WO2019183747A1 - Voice detection method and apparatus - Google Patents

Voice detection method and apparatus Download PDF

Info

Publication number
WO2019183747A1
WO2019183747A1 PCT/CN2018/080447 CN2018080447W WO2019183747A1 WO 2019183747 A1 WO2019183747 A1 WO 2019183747A1 CN 2018080447 W CN2018080447 W CN 2018080447W WO 2019183747 A1 WO2019183747 A1 WO 2019183747A1
Authority
WO
WIPO (PCT)
Prior art keywords
noise
initial
candidate
data block
energy
Prior art date
Application number
PCT/CN2018/080447
Other languages
French (fr)
Chinese (zh)
Inventor
郭红敬
李国梁
王鑫山
杨柯
朱虎
Original Assignee
深圳市汇顶科技股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳市汇顶科技股份有限公司 filed Critical 深圳市汇顶科技股份有限公司
Priority to PCT/CN2018/080447 priority Critical patent/WO2019183747A1/en
Priority to CN201880000470.4A priority patent/CN110537223B/en
Publication of WO2019183747A1 publication Critical patent/WO2019183747A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals

Definitions

  • the present application relates to the field of speech detection and, more particularly, to a method and apparatus for speech detection.
  • VAD Voice Activity Detection
  • speech endpoint detection usually detects the starting point of an actual speech segment from a continuous audio signal in a complex noise background environment, depending on the characteristics of speech and noise.
  • the termination point extracts a valid speech segment and eliminates interference from other non-speech signals such as noise.
  • the existing voice activity detection algorithms can include three types: the first type: the decision method based on the statistical characteristics of speech and noise, and the decision criterion used is mostly the maximum likelihood criterion, and the calculation amount of this method is relatively small, however,
  • the performance of speech detection is general; the second type is based on statistical model and pattern classification. This method has high computational complexity and good performance.
  • the third category based on neural network and deep learning methods, the performance of such methods is better. However, the amount of calculation is relatively large and requires a large amount of training data.
  • the embodiment of the present application provides a method and an apparatus for voice detection, which can ensure good detection performance under low complexity and low calculation amount.
  • a method for voice detection comprising:
  • the method for voice detection in the embodiment of the present application uses the energy of the packet as a feature parameter, which can smooth the noise and reduce the false alarm probability, and is advantageous for improving voice detection in comparison with the existing voice detection using single frame energy. Accuracy, at the same time, is conducive to reducing the computational complexity compared to speech detection using other parameters.
  • the method further includes:
  • the method further includes:
  • the candidate speech set processed by the kth iteration is the target speech set
  • the candidate noise set processed by the kth iteration is the target noise set
  • the method further includes:
  • the first The candidate noise set processed by k iterations is the target noise set.
  • the method further includes:
  • the determining, according to a time interval between adjacent packets in the target voice set, the updated target voice set including:
  • the time interval of two adjacent packets in the target voice set is less than a preset threshold, determining that other packets between the two adjacent packets are also voice signals, and between the two adjacent packets Other groupings are added to the target voice collection to obtain the updated target voice collection.
  • the determining, according to energy of each of the initial candidate noise sets, an initial noise threshold including:
  • a result obtained by multiplying the initial noise power by a threshold factor is determined as the initial noise threshold, wherein the threshold factor is determined according to a target false alarm probability.
  • the first data block is a first data block in the to-be-processed data, and determining initial noise according to energy of each of the initial candidate noise sets Power, including:
  • An average of the energy of each of the initial candidate noise sets is determined as the initial noise power.
  • the first data block is a non-first data block in the to-be-processed data
  • a previous data block of the first data block is a second data block
  • the energy of each of the initial candidate noise sets determines an initial noise power, including:
  • an estimated noise power of the first data block is An average of energy of each of the initial candidate noise sets of the first data block, the target noise power of the second data block being each of the target noise sets of the second data block The average of the energy.
  • the noise threshold for voice detection in the embodiment of the present application is determined according to a threshold factor and a noise power.
  • a threshold factor a noise power
  • an iterative update of noise power is performed in each data block, so that voices in each data block are made.
  • the robustness of the detection is good.
  • the noise power can be smoothed between data blocks, and the change of the environmental noise can be adaptively tracked, so that the noise threshold between the data blocks has better adaptability.
  • each data block of the data to be processed has better robustness.
  • the determining the initial noise power of the first data block according to the target noise power of the second data block and the estimated noise power of the first data block including:
  • the initial noise power of the first data block is determined by the following formula:
  • P 1 is an initial noise power of the first data block
  • the P 1 ' is an estimated noise power of the first data block
  • the P 2 ′′ is a target of the second data block Noise power, 0 ⁇ ⁇ ⁇ 1.
  • the determining, according to the energy of the N packets, the initial candidate noise set and the initial candidate voice set including:
  • a certain number of packets of lesser energy among the N packets are determined as the initial candidate noise set, and other ones of the N packets are determined as the initial candidate speech set.
  • an apparatus for voice detection comprising a determining module for performing the method of any of the first aspect or the first aspect of the first aspect.
  • a computer readable medium storing program code for execution by an electronic device, the program code comprising instructions for performing the method of the first aspect.
  • a computer program product comprising: computer program code, when the computer program code is executed by a processor of an electronic device, causing the electronic device to perform the first aspect or the first aspect A method in any of the possible implementations.
  • FIG. 1 is a schematic flowchart of a method for voice detection according to an embodiment of the present application.
  • FIG. 2 is an overall flow chart of a method of voice detection according to an embodiment of the present application.
  • FIG. 3 is a schematic block diagram of an apparatus for voice detection according to an embodiment of the present application.
  • FIG. 1 is a schematic flowchart of a method for voice detection according to an embodiment of the present application.
  • a method for voice detection according to an embodiment of the present application is described by using a device for voice detection as an execution subject.
  • the audio signal may be sampled at a certain sampling frequency (for example, 8 kHz, 16 kHz, 32 kHz, etc.) to obtain data to be processed, and the to-be-processed data may include a noise signal and/or a voice signal, and the voice detection may be performed.
  • the device may be configured to process the sampled data to be processed to obtain a voice signal therein.
  • the apparatus for detecting a voice may divide the data to be processed into a plurality of data blocks for processing, and determine a voice signal and a noise signal in each data block, where, in the data to be processed,
  • the first data block is taken as an example to describe the method of voice detection according to an embodiment of the present application.
  • the method 100 includes:
  • S120 Determine, according to energy of the N packets, an initial candidate noise set and an initial candidate voice set, where a maximum energy of the packets in the initial candidate noise set is smaller than a packet in the initial candidate voice set.
  • S140 Determine, according to the initial candidate voice set and the initial noise threshold, a candidate noise set processed by the first iteration and a candidate voice set processed by the first iteration, where the candidate of the first iterative process is processed.
  • the energy of the packets in the noise set is less than or equal to the initial noise threshold, and the energy of the packets in the candidate speech set processed by the first iteration is greater than the initial noise threshold.
  • the apparatus for voice detection may divide the first data block into N packets, calculate the energy of each packet in units of packets, thereby reducing the calculation amount of voice detection, and according to a plurality of packets.
  • the frame sampling data determines the energy of the packet, instead of performing voice detection according to the energy of the sampled data of each frame, which is beneficial to smoothing noise and improving the accuracy of voice detection.
  • an average of the power of each frame of sample data in one packet may be used as the energy of the packet, or the sum of the power of each frame of sample data in a packet may also be used as the packet.
  • the energy of the packet, or the power of the sampled data in a packet may be smoothed to obtain the energy of the packet.
  • the calculation method of the energy of the packet is not specifically limited in the embodiment of the present application.
  • an initial candidate noise set and an initial candidate speech set may be determined according to energy of the N packets, wherein the initial candidate noise set is grouped It can be considered as a noise signal, and the packet in the initial candidate speech set can be considered as a speech signal.
  • a certain proportion of the less energy of the N packets may be determined as the initial candidate noise set, the remaining packets of the N packets are determined as the initial candidate speech set; or the energy of the N packets may be The smaller number of packets are determined as the initial candidate noise set, and the remaining packets in the N packets are determined as the initial candidate speech set, etc., and the initial candidate speech set and initialization are not specifically limited in this embodiment of the present application.
  • the N packets may be arranged in ascending order according to the size of the energy of the packet. At this time, the less energy of the N packets is the group with the highest ranking, and then the ordering may be selected.
  • a certain proportion (eg, 20%) or a certain number (eg, 20) of packets in the front constitutes an initial candidate noise set, and the remaining packets are determined as an initial candidate speech set, and thus, in the initial candidate noise set
  • the energy of the packets is less than the energy of the packets in the initial candidate speech set.
  • the initial candidate noise set includes packet 1 to packet 20
  • the large 80 packets constitute the initial candidate speech set, ie the initial candidate speech set includes packet 21 to packet 100.
  • the apparatus for voice detection may determine an initial noise threshold according to an energy of each packet in the initial candidate noise set, and the initial noise threshold may be used to determine whether a noise signal is still present in the initial candidate voice set. Where the packet whose energy value is less than the initial noise threshold can be considered as a noise signal.
  • determining, according to energy of each of the initial candidate noise sets, an initial noise threshold including:
  • a result obtained by multiplying the initial noise power by a threshold factor is determined as the initial noise threshold, wherein the threshold factor is determined according to a target false alarm probability.
  • the threshold factor T may be determined according to the data length M of the packet and the target false alarm probability P fa , wherein the target false alarm probability is the maximum false alarm probability allowed by the system, that is, the noise signal allowed by the system.
  • the threshold factor T can be determined according to the following formula.
  • the initial noise power may be an average value of power of each packet in the initial candidate noise set, or power of each packet in the initial candidate noise set. And the like, the embodiment of the present application does not limit this.
  • the energy of the packet in the initial candidate speech set may be sequentially compared with the initial noise threshold. If the energy of the packet is less than the initial noise threshold, the packet may be considered as the packet. After the comparison is completed, the packet of the initial candidate speech set whose energy is less than the initial noise threshold may be added to the initial candidate noise set to obtain the candidate noise set of the first iteration process. Determining, among the N packets, other than the candidate noise set processed by the first iteration as the candidate speech set processed by the first iteration, that is, the candidate noise set processed by the first iteration is the first time The obtained candidate noise set is updated, and the candidate speech set processed by the first iteration is the candidate speech set obtained by the first update.
  • the initial candidate noise set includes packet 1 to packet 20
  • the initial candidate speech set includes packet 21 to packet 100
  • the initial noise threshold can be compared with the energy of the packet 21 to the packet in packet 100. If the energy of the packet 21 to the packet 40 in the initial candidate speech set is less than the initial noise threshold, the packet 21 to the packet 40 may be added to the initial candidate noise set to obtain the candidate noise of the first iteration process.
  • the set, including packet 1 to packet 40 is simultaneously available for the first iteratively processed candidate speech set, including packet 41 to packet 100.
  • the method for voice detection in the embodiment of the present application uses the energy of the packet as a feature parameter, which can smooth the noise and reduce the false alarm probability, and is advantageous for improving voice detection in comparison with the existing voice detection using single frame energy. Accuracy, at the same time, is conducive to reducing the computational complexity compared to speech detection using other parameters.
  • the method 100 may further include:
  • the candidate noise set of the k+1th iteration process and the candidate speech set of the k+1th iteration process are determined according to the candidate speech set processed by the kth iteration and the noise threshold of the kth iteration process.
  • the noise threshold of the first iteration process may also be determined according to the energy of each packet in the candidate noise set processed by the first iteration. And determining, according to the noise threshold of the first iteration process, whether the candidate speech set processed by the first iteration further includes a noise signal, if the energy of each packet in the candidate speech set processed by the first iteration is greater than
  • the noise threshold of the first iteration process may determine that the noise signal is not included in the candidate speech set processed by the first iteration, and thus may be confirmed as the target speech set, and the candidate noise of the first iteration process may be determined.
  • the set is the target noise set.
  • the above iterative operation may continue to be performed until the energy of each packet in the candidate speech set processed by the kth iteration is greater than the noise threshold of the kth iteration, or when the number of iterations k reaches Upper limit, at this time, it may be determined that the candidate speech set processed by the kth iteration is the target speech set That is, the packets in the candidate speech set processed by the kth iteration are all voice signals, and it can be determined that the packets in the candidate noise set processed by the kth iteration are all noise signals, that is, the candidate speech sets processed by the k iterations.
  • the candidate noise set processed by the kth iteration is the target noise set, thereby obtaining a decision result for each of the N packets, or N of the first data block may be determined. Which packets in the packet are voice signals and which packets are noise signals.
  • determining the noise threshold of the kth iteration process according to the energy of each of the candidate noise sets processed by the kth iteration may include:
  • the product of the noise power processed by the kth iteration and the threshold factor is determined as the noise threshold of the kth iteration.
  • the noise threshold for voice detection is determined according to a threshold factor and a noise power. According to the above iterative process, iterative update of noise power in each data block can be implemented, and noise can be implemented at the same time. An iterative update of the threshold, thereby improving the robustness of speech detection within each data block.
  • the apparatus for voice detection may further perform smoothing processing on the current data block according to the noise power of the adjacent data block.
  • smoothing processing on the current data block according to the noise power of the adjacent data block.
  • Case 1 If the first data block is the first data block in the to-be-processed data, determining the initial noise power according to the energy of each of the initial candidate noise sets, including:
  • An average of the energy of each of the initial candidate noise sets is determined as the initial noise power.
  • the initial candidate noise set of the first data block may be directly The average value of the power of each packet is determined as the initial noise power of the first data block, or the sum of the powers of each of the initial candidate noise sets of the first data block may be determined as the first The initial noise power of a block of data, etc.
  • the determination manner of the initial candidate noise set of the first data block refer to the related description of the foregoing embodiment, and details are not described herein again.
  • Case 2 if the first data block is a non-first data block in the to-be-processed data, the previous data block of the first data block is a second data block, according to the initial candidate
  • the energy of each packet in the noise set determines the initial noise power, including:
  • an estimated noise power of the first data block is An average of energy of each of the initial candidate noise sets of the first data block, the target noise power of the second data block being each of the target noise sets of the second data block The average of the energy.
  • the initial noise power of the first data block may be according to the first data block.
  • the estimated noise power of the data block and the target data power of the previous data block of the first data block, that is, the second data block, are determined.
  • the estimated noise power of the first data block may be determined according to an energy of each of the initial candidate noise sets of the first data block, for example, the estimated noise power of the first data block may be An average of the power of each of the initial candidate noise sets of the first data block, or the estimated noise power of the first data block may also be each of the initial candidate noise sets of the first data block.
  • the target noise power of the second data block may be determined according to energy of each packet in the target candidate noise set of the second data block, for example, the target noise power of the second data block may be the second An average of the power of each of the target noise sets of the data block, or the target noise power of the second data block may also be the sum of the powers of each of the target noise sets of the second data block,
  • the determining manner of the target noise set of the second data block may refer to the determining manner of the target noise set of the first data block, and details are not described herein again.
  • the initial noise power of the first data block may be determined according to the following formula (4):
  • P 1 is an initial noise power of the first data block
  • the P 1 ' is an estimated noise power of the first data block
  • the P 2 ′′ is a target of the second data block Noise power, 0 ⁇ ⁇ ⁇ 1.
  • the estimated value of the noise power of the first data block ie, the estimated noise power
  • the stable noise power ie, the target noise power of the previous data block of the first data block, that is, the second data block
  • Processing obtaining an initial noise power of the first data block, and further, determining an initial noise threshold of the first data block according to an initial noise power of the first data block and a threshold factor.
  • the noise threshold for voice detection in the embodiment of the present application is determined according to a threshold factor and a noise power.
  • a threshold factor a noise power
  • an iterative update of noise power is performed in each data block, so that voices in each data block are made.
  • the robustness of the detection is good.
  • the noise power can be smoothed between data blocks, and the change of the environmental noise can be adaptively tracked, so that the noise threshold between the data blocks has better adaptability.
  • each data block of the data to be processed has better robustness.
  • the method for voice detection in the embodiment of the present application can adaptively adjust the noise threshold according to the ambient noise while ensuring low computational complexity, and can also have better robustness while ensuring detection performance. Sex.
  • the method 100 may further include:
  • the speech signal is continuous in a short time. Therefore, the decision result should also be continuous in a short time. There may be a sudden change in the energy of the original speech or the influence of noise. The change of the speech signal and the noise signal is relatively frequent, and based on this, the decision result can be corrected.
  • the signal type of each of the N packets is known, that is, belongs to a voice signal or a noise signal. Then, the N packets can be arranged in the order of sampling time, that is, the N packets are restored to the original order.
  • the decision result of the voice detection may be corrected according to the time interval between two adjacent packets belonging to the voice signal, so that the updated target voice set may be determined. For example, if the time interval between two adjacent packets belonging to the voice signal is less than a preset threshold, it can be determined that other packets between the two packets are also voice signals, so that the other packets can also be added to the target voice.
  • the corrected target noise set may also be determined in a similar manner as described above, and details are not described herein for brevity.
  • the packet 21 and the packet 30 are two adjacent packets belonging to the voice signal, if the time interval between the packet 21 and the packet 30 is 10 ms, the interval is short, and at this time, the packet 21 and the packet 30 can be determined.
  • the other packets in between are also voice signals, i.e., packets 22 through 29 can also be determined as voice signals, so that the updated target voice set can be obtained.
  • the method for voice detection in the embodiment of the present application can also correct the decision result of the voice detection according to the non-mutation characteristic of the voice signal, thereby improving the accuracy of the voice detection.
  • the method for voice detection according to an embodiment of the present application is described in detail in conjunction with the overall flowchart shown in FIG. 2. As shown in FIG. 2, the method may include the following content:
  • the data to be processed may be divided into multiple data blocks for processing.
  • the energy of the j-th packet of the block, the energy of each packet may be the sum of the powers of each sample point in each packet, or may also be the average of the power of each sample point, this embodiment of the present application This is not limited.
  • the N packets may be sorted by the size of the energy, for example, in ascending order according to the size of the energy.
  • S203 is executed to determine whether the data block is the first data block in the to-be-processed data, and if yes, execute S204; otherwise, execute S205.
  • an initial noise power of the data block is determined.
  • the implementation process of the S204 may correspond to the implementation process of the case 1 in the foregoing embodiment. For brevity, details are not described herein again.
  • an initial noise power of the data block is determined according to the estimated noise power of the data block and the target noise power of the previous data block of the data block.
  • the implementation process of the S205 may correspond to the implementation process of the case 2 in the foregoing embodiment. For brevity, details are not described herein again.
  • S206 may be performed to determine a noise threshold according to the noise power determined in S204 or S205 in combination with the threshold factor determined in S201;
  • the product of the noise power and the threshold factor can be determined as the noise threshold.
  • S207 is executed to re-determine the noise set and the voice set in the data block according to the noise threshold.
  • a packet in the data block whose energy is greater than the noise threshold may be determined as a set of speech, and a packet in the data block whose energy is less than or equal to the noise threshold is determined as a noise set.
  • the initial noise power of the data block is determined according to the energy of the packet 1 to the packet k 1 in the data block, it can be considered that the packet 1 to the packet k 1 constitute an initial candidate noise set, and the packet k 1 + 1 to group N constitute an initial candidate speech set.
  • S208 it is determined whether there is a new packet join in the noise set of the data block, and if so, S209 is performed; otherwise, S210 is performed.
  • the updated noise power is determined according to the re-determined noise set, and then jumps to execution S206, and the updated noise threshold is determined according to the updated noise power. Further, S207 may be further performed according to the updated The noise threshold re-determines the noise set and the voice set in the data block until the preset number of iterations is reached, or the noise power is stabilized by iteration, so that the noise threshold is also stabilized, so that the packet whose energy is greater than the noise threshold can be A packet determined to be a speech signal whose energy is less than the noise threshold can be determined as a noise signal.
  • the determination result of the voice detection may also be corrected.
  • the identifier of each group may be set with an identifier, for example, a packet belonging to the voice signal may be Set the identifier 1, set the identifier of the packet belonging to the noise signal to 0, and then sort the packets according to the order of sampling time, that is, return to the original order, and then judge the result according to the time interval of the packets belonging to the adjacent voice signal. Make corrections.
  • FIG. 1 to FIG. 2 The embodiment of the present invention is described in detail with reference to FIG. 1 to FIG. 2 .
  • the device embodiment of the present application is described in detail with reference to FIG. 3 . It should be understood that the device embodiment and the method embodiment correspond to each other, and a similar description may be used. Refer to method embodiments.
  • FIG. 3 is a schematic structural diagram of an apparatus for voice detection according to an embodiment of the present application.
  • the apparatus 300 includes a determination module 310.
  • the determining module 310 is configured to:
  • the determining module 310 is further configured to:
  • the determining module 310 is further configured to:
  • the candidate speech set processed by the kth iteration is the target speech set
  • the candidate noise set processed by the kth iteration is the target noise set
  • the determining module 310 is further configured to:
  • the first The candidate noise set processed by k iterations is the target noise set.
  • the determining module 310 is further configured to:
  • the determining module 310 is specifically configured to:
  • the time interval of two adjacent packets in the target voice set is less than a preset threshold, determining that other packets between the two adjacent packets are also voice signals, and between the two adjacent packets Other groupings are added to the target voice collection to obtain the updated target voice collection.
  • the determining module 310 is specifically configured to:
  • a result obtained by multiplying the initial noise power by a threshold factor is determined as the initial noise threshold, wherein the threshold factor is determined according to a target false alarm probability.
  • the first data block is the first data block in the to-be-processed data
  • the determining module 310 is specifically configured to:
  • An average of the energy of each of the initial candidate noise sets is determined as the initial noise power.
  • the first data block is a non-first data block in the to-be-processed data
  • a previous data block of the first data block is a second data block
  • the determination module is specifically used to:
  • an estimated noise power of the first data block is An average of the energy of each of the initial candidate noise sets of the first data block, the target noise power of the second data block being the energy of each of the target noise sets of the second data block average value.
  • the determining module 310 is specifically configured to:
  • the initial noise power of the first data block is determined according to the following formula:
  • P 1 is an initial noise power of the first data block
  • the P 1 ' is an estimated noise power of the first data block
  • the P 2 ′′ is a target of the second data block Noise power, 0 ⁇ ⁇ ⁇ 1.
  • the determining module 310 is further configured to:
  • a certain number of packets of lesser energy among the N packets are determined as the initial candidate noise set, and other ones of the N packets are determined as the initial candidate speech set.
  • the determining module 310 can be a specific processing capability processor, and the processor can be a central processing unit (CPU), or other general-purpose processor, a digital signal processor (DSP), an application specific integrated circuit. (ASIC), off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware component, etc., which are not limited in this embodiment.
  • the general purpose processor may be a microprocessor or the processor or any conventional processor or the like.
  • the apparatus 300 for voice detection may further include a memory, which may include a read only memory and a random access memory, and provides instructions and data to the processor.
  • a portion of the memory may also include a non-volatile random access memory.
  • the memory can also store information of the device type.
  • the memory may also be used to store the collected audio data.
  • the embodiment of the present application further provides a computer readable storage medium storing one or more programs, the one or more programs including instructions, when the portable electronic device is included in a plurality of applications When executed, the portable electronic device can be caused to perform the method of the embodiment shown in Figures 1-2.
  • the embodiment of the present application also proposes a computer program comprising instructions which, when executed by a computer, cause the computer to perform the corresponding flow of the method of the embodiment shown in Figures 1 to 2.
  • the size of the sequence numbers of the foregoing processes does not mean the order of execution sequence, and the order of execution of each process should be determined by its function and internal logic, and should not be applied to the embodiment of the present application.
  • the implementation process constitutes any limitation.
  • the disclosed systems, devices, and methods may be implemented in other manners.
  • the device embodiments described above are merely illustrative.
  • the division of the unit is only a logical function division.
  • there may be another division manner for example, multiple units or components may be combined or Can be integrated into another system, or some features can be ignored or not executed.
  • the mutual coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection through some interface, device or unit, and may be in an electrical, mechanical or other form.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of the embodiment.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit.
  • the functions may be stored in a computer readable storage medium if implemented in the form of a software functional unit and sold or used as a standalone product.
  • the technical solution of the present application which is essential or contributes to the prior art, or a part of the technical solution, may be embodied in the form of a software product, which is stored in a storage medium, including
  • the instructions are used to cause a computer device (which may be a personal computer, server, or network device, etc.) to perform all or part of the steps of the methods described in various embodiments of the present application.
  • the foregoing storage medium includes: a U disk, a mobile hard disk, a read-only memory (ROM), a random access memory (RAM), a magnetic disk, or an optical disk, and the like, which can store program codes. .

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Telephone Function (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

A voice detection method and apparatus. The method comprises: determining the energy of each group from among N groups of a first data block in data to be processed, wherein N is a positive integer (S110); according to the energy of the N groups, determining an initial candidate noise set and an initial candidate voice set, wherein the maximum energy of groups in the initial candidate noise set is less than the minimum energy of groups in the initial candidate voice set (S120); according to the energy of each group in the initial candidate noise set, determining an initial noise threshold (S130); and according to the initial candidate voice set and the initial noise threshold, determining a candidate noise set iteratively processed for the first time and a candidate voice set iteratively processed for the first time, wherein the energy of groups in the candidate noise set iteratively processed for the first time is less than or equal to the initial noise threshold; and the energy of groups in the candidate voice set iteratively processed for the first time is greater than the initial noise threshold (S140).

Description

语音检测的方法和装置Method and device for voice detection 技术领域Technical field
本申请涉及语音检测领域,并且更具体地,涉及一种语音检测的方法和装置。The present application relates to the field of speech detection and, more particularly, to a method and apparatus for speech detection.
背景技术Background technique
随着移动物联网技术、人机交互技术、人工智能等技术的快速发展,各种类型的智能音响、智能穿戴设备、语音助手产品层出不穷,人们对语音质量、产品体验的要求也越来越高,这同时也给语音识别、语音增强、语音交互的要求提出巨大的挑战。With the rapid development of mobile Internet of Things technology, human-computer interaction technology, artificial intelligence and other technologies, various types of intelligent audio, smart wearable devices, and voice assistant products emerge in an endless stream, and people's requirements for voice quality and product experience are also increasing. This also poses a huge challenge to the requirements of speech recognition, speech enhancement, and voice interaction.
语音活动性检测(Voice Activity Detection,VAD),也称为语音端点检测,通常是根据语音和噪声的特征不同,在复杂噪声背景环境下,从连续音频信号中检测出实际语音片段的起始点和终止点,从而提取出有效的语音片段,排除噪声等其他非语音信号的干扰。Voice Activity Detection (VAD), also known as speech endpoint detection, usually detects the starting point of an actual speech segment from a continuous audio signal in a complex noise background environment, depending on the characteristics of speech and noise. The termination point extracts a valid speech segment and eliminates interference from other non-speech signals such as noise.
现有的语音活动性检测算法可以包括三类:第一类:基于语音和噪声统计特性的判决方法,使用的判决准则多为最大似然准则,这一类方法计算量相对较小,但是,语音检测性能一般;第二类:基于统计模型和模式分类的方法,这类方法计算复杂度高,性能比较好;第三类:基于神经网络、深度学习的方法,这类方法性能比较好,但是计算量比较大且需要大量的训练数据。The existing voice activity detection algorithms can include three types: the first type: the decision method based on the statistical characteristics of speech and noise, and the decision criterion used is mostly the maximum likelihood criterion, and the calculation amount of this method is relatively small, however, The performance of speech detection is general; the second type is based on statistical model and pattern classification. This method has high computational complexity and good performance. The third category: based on neural network and deep learning methods, the performance of such methods is better. However, the amount of calculation is relatively large and requires a large amount of training data.
因此,需要一种语音检测算法,能够保证在低复杂度、低计算量的情况下,具有良好的检测性能。Therefore, a speech detection algorithm is needed to ensure good detection performance under low complexity and low computational complexity.
发明内容Summary of the invention
本申请实施例提供一种语音检测的方法和装置,能够保证在低复杂度、低计算量的情况下,具有良好的检测性能。The embodiment of the present application provides a method and an apparatus for voice detection, which can ensure good detection performance under low complexity and low calculation amount.
第一方面,提供了一种语音检测的方法,包括:In a first aspect, a method for voice detection is provided, comprising:
确定待处理数据中的第一数据块的N个分组中的每个分组的能量,其中,所述N为正整数;Determining an energy of each of the N packets of the first data block in the data to be processed, wherein the N is a positive integer;
根据所述的N个分组能量,确定初始的候选噪声集合和初始的候选语音 集合,其中,所述初始的候选噪声集合中的分组的最大能量小于所述初始的候选语音集合中的分组的最小能量;Determining an initial candidate noise set and an initial candidate speech set according to the N packet energies, wherein a maximum energy of a packet in the initial candidate noise set is smaller than a minimum of a packet in the initial candidate speech set energy;
根据所述初始的候选噪声集合中的每个分组的能量,确定初始的噪声门限;Determining an initial noise threshold based on energy of each of the initial candidate noise sets;
根据所述初始的候选语音集合以及所述初始的噪声门限,确定第一次迭代处理的候选噪声集合和第一次迭代处理的候选语音集合,其中,所述第一次迭代处理的候选噪声集合中的分组的能量均小于或等于所述初始的噪声门限,所述第一次迭代处理的候选语音集合中的分组的能量均大于所述初始的噪声门限。Determining, according to the initial candidate speech set and the initial noise threshold, a candidate noise set processed by the first iteration and a candidate speech set processed by the first iteration, wherein the candidate noise set of the first iteration processing The energy of the packets in the group is less than or equal to the initial noise threshold, and the energy of the packets in the candidate speech set processed by the first iteration is greater than the initial noise threshold.
因此,本申请实施例的语音检测的方法,使用分组的能量作为特征参数,能够平滑噪声减小虚警概率,相对于现有的采用单帧能量进行语音检测而言,有利于提升语音检测的精度,同时,与采用其他参数进行语音检测相比,有利于降低计算的复杂度。Therefore, the method for voice detection in the embodiment of the present application uses the energy of the packet as a feature parameter, which can smooth the noise and reduce the false alarm probability, and is advantageous for improving voice detection in comparison with the existing voice detection using single frame energy. Accuracy, at the same time, is conducive to reducing the computational complexity compared to speech detection using other parameters.
在一种可能的实现方式中,所述方法还包括:In a possible implementation manner, the method further includes:
根据第k次迭代处理的候选噪声集合中的每个分组的能量,确定第k次迭代处理的噪声门限,其中,所述k为1,2,……;Determining a noise threshold of the kth iteration process according to the energy of each of the candidate noise sets processed by the kth iteration, wherein the k is 1, 2, ...;
根据第k次迭代处理的候选语音集合以及所述第k次迭代处理的噪声门限,确定第k+1次迭代处理的候选噪声集合和第k+1次迭代处理的候选语音集合。Determining the candidate noise set of the k+1th iteration process and the candidate speech set of the k+1th iteration process according to the candidate speech set processed by the kth iteration and the noise threshold of the kth iteration process.
在一种可能的实现方式中,所述方法还包括:In a possible implementation manner, the method further includes:
在迭代次数k达到迭代上限时,确定所述第k迭代次处理的候选语音集合为目标语音集合,所述第k次迭代处理的所述候选噪声集合为目标噪声集合。When the number of iterations k reaches the upper limit of the iteration, it is determined that the candidate speech set processed by the kth iteration is the target speech set, and the candidate noise set processed by the kth iteration is the target noise set.
在一种可能的实现方式中,所述方法还包括:In a possible implementation manner, the method further includes:
若所述第k次迭代处理的候选语音集合中的分组的能量都大于所述第k次迭代处理的噪声门限,确定所述第k次迭代处理的候选语音集合为目标语音集合,所述第k次迭代处理的候选噪声集合为目标噪声集合。If the energy of the packet in the candidate speech set processed by the kth iteration is greater than the noise threshold of the kth iteration process, determining that the candidate speech set processed by the kth iteration is the target speech set, the first The candidate noise set processed by k iterations is the target noise set.
在一种可能的实现方式中,所述方法还包括:In a possible implementation manner, the method further includes:
将所述目标语音集合中的分组按照时间顺序排列;Arranging the packets in the target speech set in chronological order;
根据所述目标语音集合中的相邻分组之间的时间间隔,确定更新后的所述目标语音集合。And determining the updated target voice set according to a time interval between adjacent packets in the target voice set.
在一种可能的实现方式中,所述根据所述目标语音集合中的相邻分组之间的时间间隔,确定更新后的所述目标语音集合,包括:In a possible implementation manner, the determining, according to a time interval between adjacent packets in the target voice set, the updated target voice set, including:
若所述目标语音集合中的相邻两个分组的时间间隔小于预设门限,确定所述相邻两个分组之间的其他分组也为语音信号,并将所述相邻两个分组之间的其他分组添加到所述目标语音集合,得到更新后的所述目标语音集合。If the time interval of two adjacent packets in the target voice set is less than a preset threshold, determining that other packets between the two adjacent packets are also voice signals, and between the two adjacent packets Other groupings are added to the target voice collection to obtain the updated target voice collection.
在一种可能的实现方式中,所述根据所述初始的候选噪声集合中的每个分组的能量,确定初始的噪声门限,包括:In a possible implementation manner, the determining, according to energy of each of the initial candidate noise sets, an initial noise threshold, including:
根据所述初始的候选噪声集合中的每个分组的能量,确定初始噪声功率;Determining an initial noise power based on energy of each of the initial candidate noise sets;
将所述初始噪声功率乘以门限因子得到的结果确定为所述初始的噪声门限,其中,所述门限因子是根据目标虚警概率确定的。A result obtained by multiplying the initial noise power by a threshold factor is determined as the initial noise threshold, wherein the threshold factor is determined according to a target false alarm probability.
在一种可能的实现方式中,所述第一数据块为所述待处理数据中的第一个数据块,所述根据所述初始的候选噪声集合中的每个分组的能量,确定初始噪声功率,包括:In a possible implementation manner, the first data block is a first data block in the to-be-processed data, and determining initial noise according to energy of each of the initial candidate noise sets Power, including:
将所述初始的候选噪声集合中的每个分组的能量的平均值,确定为所述初始噪声功率。An average of the energy of each of the initial candidate noise sets is determined as the initial noise power.
在一种可能的实现方式中,所述第一数据块为所述待处理数据中的非第一个数据块,所述第一数据块的前一数据块为第二数据块,所述根据所述初始的候选噪声集合中的每个分组的能量,确定初始噪声功率,包括:In a possible implementation manner, the first data block is a non-first data block in the to-be-processed data, and a previous data block of the first data block is a second data block, where the The energy of each of the initial candidate noise sets determines an initial noise power, including:
根据所述第二数据块的目标噪声功率以及所述第一数据块的预估噪声功率,确定所述第一数据块的初始噪声功率,其中,所述第一数据块的预估噪声功率为所述第一数据块的初始的候选噪声集合中的每个分组的能量的平均值,所述第二数据块的目标噪声功率为所述第二数据块的目标噪声集合中的每个分组的能量的平均值。Determining an initial noise power of the first data block according to a target noise power of the second data block and an estimated noise power of the first data block, where an estimated noise power of the first data block is An average of energy of each of the initial candidate noise sets of the first data block, the target noise power of the second data block being each of the target noise sets of the second data block The average of the energy.
因此,本申请实施例的用于语音检测的噪声门限是根据门限因子和噪声功率确定的,一方面,每个数据块内都在进行噪声功率的迭代更新,从而使得每个数据块内的语音检测的鲁棒性较好,另一方面,噪声功率可以在数据块之间进行平滑处理,能够自适应跟踪环境噪声的变化,从而使得数据块之间的噪声门限具有较好的自适应性,进而使得待处理数据的每个数据块都具有较好的鲁棒性。Therefore, the noise threshold for voice detection in the embodiment of the present application is determined according to a threshold factor and a noise power. On the one hand, an iterative update of noise power is performed in each data block, so that voices in each data block are made. The robustness of the detection is good. On the other hand, the noise power can be smoothed between data blocks, and the change of the environmental noise can be adaptively tracked, so that the noise threshold between the data blocks has better adaptability. In turn, each data block of the data to be processed has better robustness.
在一种可能的实现方式中,所述根据所述第二数据块的目标噪声功率以及所述第一数据块的预估噪声功率,确定所述第一数据块的初始噪声功率,包括:根据如下公式,确定所述第一数据块的初始噪声功率:In a possible implementation, the determining the initial noise power of the first data block according to the target noise power of the second data block and the estimated noise power of the first data block, including: The initial noise power of the first data block is determined by the following formula:
P 1=αP 1'+(1-α)P 2P 1 =αP 1 '+(1-α)P 2 ′′
其中,所述P 1为所述第一数据块的初始噪声功率,所述P 1'为所述第一数据块的预估噪声功率,所述P 2″为所述第二数据块的目标噪声功率,0<α<1。 Wherein P 1 is an initial noise power of the first data block, the P 1 ' is an estimated noise power of the first data block, and the P 2 ′′ is a target of the second data block Noise power, 0 < α < 1.
在一种可能的实现方式中,所述根据所述N个分组的能量,确定初始的候选噪声集合和初始的候选语音集合,包括:In a possible implementation, the determining, according to the energy of the N packets, the initial candidate noise set and the initial candidate voice set, including:
将所述N个分组中能量较小的一定比例的分组,确定为所述初始的候选噪声集合,将所述N个分组中的其他分组确定为所述初始的候选语音集合;或Determining, into the initial candidate noise set, a certain proportion of the lesser packets of the N packets, and determining other ones of the N packets as the initial candidate speech set; or
将所述N个分组中能量较小的一定数量的分组,确定为所述初始的候选噪声集合,将所述N个分组中的其他分组确定为所述初始的候选语音集合。A certain number of packets of lesser energy among the N packets are determined as the initial candidate noise set, and other ones of the N packets are determined as the initial candidate speech set.
第二方面,提供一种语音检测的装置,包括确定模块,用于执行第一方面或第一方面的任一种可能的实现方式中的方法。In a second aspect, an apparatus for voice detection is provided, comprising a determining module for performing the method of any of the first aspect or the first aspect of the first aspect.
第三方面,提供一种计算机可读介质,所述计算机可读介质存储用于电子设备执行的程序代码,所述程序代码包括用于执行第一方面中的方法的指令。In a third aspect, a computer readable medium storing program code for execution by an electronic device, the program code comprising instructions for performing the method of the first aspect.
第四方面,提供了一种计算机程序产品,所述计算机程序产品包括:计算机程序代码,当所述计算机程序代码被电子设备的处理器运行时,使得电子设备执行第一方面或第一方面的任一种可能的实现方式中的方法。In a fourth aspect, a computer program product is provided, the computer program product comprising: computer program code, when the computer program code is executed by a processor of an electronic device, causing the electronic device to perform the first aspect or the first aspect A method in any of the possible implementations.
附图说明DRAWINGS
图1是根据本申请实施例的语音检测的方法的示意性流程图。FIG. 1 is a schematic flowchart of a method for voice detection according to an embodiment of the present application.
图2是根据本申请实施例的语音检测的方法的整体流程图。2 is an overall flow chart of a method of voice detection according to an embodiment of the present application.
图3是根据本申请实施例的语音检测的装置的示意性框图。FIG. 3 is a schematic block diagram of an apparatus for voice detection according to an embodiment of the present application.
具体实施方式detailed description
下面结合本申请实施例的附图,对本申请实施例的技术方案进行描述。The technical solutions of the embodiments of the present application are described below with reference to the accompanying drawings of the embodiments of the present application.
图1是根据本申请实施例的语音检测的方法的示意性流程图,以下,以语音检测的装置为执行主体来描述本申请实施例的语音检测的方法。FIG. 1 is a schematic flowchart of a method for voice detection according to an embodiment of the present application. Hereinafter, a method for voice detection according to an embodiment of the present application is described by using a device for voice detection as an execution subject.
可选地,可以是以一定的采样频率(例如,8kHz,16kHz,32kHz等)对音频信号进行采样,得到待处理数据,该待处理数据中可以包括噪声信号和/或语音信号,该语音检测的装置可以用于对采样的该待处理数据进行处理,获取其中的语音信号。在本申请实施例中,该语音检测的装置可以将该待处理数据划分为多个数据块分别进行处理,确定每个数据块中的语音信号和噪声信号,以下,对该待处理数据中的第一数据块为例,详细说明根据本申请实施例的语音检测的方法。Optionally, the audio signal may be sampled at a certain sampling frequency (for example, 8 kHz, 16 kHz, 32 kHz, etc.) to obtain data to be processed, and the to-be-processed data may include a noise signal and/or a voice signal, and the voice detection may be performed. The device may be configured to process the sampled data to be processed to obtain a voice signal therein. In the embodiment of the present application, the apparatus for detecting a voice may divide the data to be processed into a plurality of data blocks for processing, and determine a voice signal and a noise signal in each data block, where, in the data to be processed, The first data block is taken as an example to describe the method of voice detection according to an embodiment of the present application.
如图1所示,该方法100包括:As shown in FIG. 1, the method 100 includes:
S110,确定待处理数据中的第一数据块的N个分组中的每个分组的能量,其中,所述N为正整数;S110. Determine energy of each of N packets of the first data block in the to-be-processed data, where the N is a positive integer;
S120,根据所述N个分组的能量,确定初始的候选噪声集合和初始的候选语音集合,其中,所述初始的候选噪声集合中的分组的最大能量小于所述初始的候选语音集合中的分组的最小能量;S120. Determine, according to energy of the N packets, an initial candidate noise set and an initial candidate voice set, where a maximum energy of the packets in the initial candidate noise set is smaller than a packet in the initial candidate voice set. Minimum energy
S130,根据所述初始的候选噪声集合中的每个分组的能量,确定初始的噪声门限;S130. Determine an initial noise threshold according to energy of each packet in the initial candidate noise set.
S140,根据所述初始的候选语音集合以及所述初始的噪声门限,确定第一次迭代处理的候选噪声集合和第一次迭代处理的候选语音集合,其中,所述第一次迭代处理的候选噪声集合中的分组的能量均小于或等于所述初始的噪声门限,所述第一次迭代处理的候选语音集合中的分组的能量均大于所述初始的噪声门限。S140. Determine, according to the initial candidate voice set and the initial noise threshold, a candidate noise set processed by the first iteration and a candidate voice set processed by the first iteration, where the candidate of the first iterative process is processed. The energy of the packets in the noise set is less than or equal to the initial noise threshold, and the energy of the packets in the candidate speech set processed by the first iteration is greater than the initial noise threshold.
具体来说,该语音检测的装置可以将该第一数据块划分为N个分组,以分组为单位,计算每个分组的能量,从而能够降低语音检测的计算量,同时根据一个分组中的多帧采样数据确定该分组的能量,而不是根据每帧采样数据的能量进行语音检测,有利于平滑噪声,提高语音检测的精度。Specifically, the apparatus for voice detection may divide the first data block into N packets, calculate the energy of each packet in units of packets, thereby reducing the calculation amount of voice detection, and according to a plurality of packets. The frame sampling data determines the energy of the packet, instead of performing voice detection according to the energy of the sampled data of each frame, which is beneficial to smoothing noise and improving the accuracy of voice detection.
可选地,在一些实施例中,可以将一个分组中的每帧采样数据的功率的平均值作为该分组的能量,或者也可以将一个分组中的每帧采样数据的功率之和作为该分组的能量等,或者,也可以将一个分组中的每帧采样数据的功率进行平滑处理得到该分组的能量等,本申请实施例对于分组的能量的计算方法不作具体限定。Optionally, in some embodiments, an average of the power of each frame of sample data in one packet may be used as the energy of the packet, or the sum of the power of each frame of sample data in a packet may also be used as the packet. The energy of the packet, or the power of the sampled data in a packet may be smoothed to obtain the energy of the packet. The calculation method of the energy of the packet is not specifically limited in the embodiment of the present application.
得到该N个分组中的每个分组的能量之后,进一步地,可以根据该N个分组的能量,确定初始的候选噪声集合和初始的候选语音集合,其中,该 初始的候选噪声集合中的分组可以认为是噪声信号,该初始的候选语音集合中的分组可以认为是语音信号。After obtaining the energy of each of the N packets, further, an initial candidate noise set and an initial candidate speech set may be determined according to energy of the N packets, wherein the initial candidate noise set is grouped It can be considered as a noise signal, and the packet in the initial candidate speech set can be considered as a speech signal.
例如,可以将N个分组中能量较少的一定比例的分组确定为该初始的候选噪声集合,将该N个分组中的剩余分组确定为初始的候选语音集合;或者可以将N个分组中能量较小的一定数量的分组确定为该初始的候选噪声集合,将该N个分组中的剩余分组确定为初始的候选语音集合等,本申请实施例并不特别限定该初始的候选语音集合和初始的候选噪声集合的划分方式。For example, a certain proportion of the less energy of the N packets may be determined as the initial candidate noise set, the remaining packets of the N packets are determined as the initial candidate speech set; or the energy of the N packets may be The smaller number of packets are determined as the initial candidate noise set, and the remaining packets in the N packets are determined as the initial candidate speech set, etc., and the initial candidate speech set and initialization are not specifically limited in this embodiment of the present application. The way in which candidate noise sets are divided.
在一种具体的实现方式中,可以根据分组的能量的大小将该N个分组作升序排列,此时,该N个分组中能量较小的分组为排序靠前的分组,然后,可以选择排序靠前的一定比例(例如,20%)或一定数量(例如,20个)的分组构成初始的候选噪声集合,将剩余的分组确定为初始的候选语音集合,因此,初始的候选噪声集合中的分组的能量均小于初始的候选语音集合中的分组的能量。In a specific implementation, the N packets may be arranged in ascending order according to the size of the energy of the packet. At this time, the less energy of the N packets is the group with the highest ranking, and then the ordering may be selected. A certain proportion (eg, 20%) or a certain number (eg, 20) of packets in the front constitutes an initial candidate noise set, and the remaining packets are determined as an initial candidate speech set, and thus, in the initial candidate noise set The energy of the packets is less than the energy of the packets in the initial candidate speech set.
作为示例而非限定,若N=100,可以在该100个分组中选择20个能量较小的分组构成初始的候选噪声集合,即初始的候选噪声集合包括分组1至分组20,从而,能量较大的80个分组构成初始的候选语音集合,即初始的候选语音集合包括分组21至分组100。As an example and not by way of limitation, if N=100, 20 smaller energy packets may be selected among the 100 packets to form an initial candidate noise set, that is, the initial candidate noise set includes packet 1 to packet 20, thereby The large 80 packets constitute the initial candidate speech set, ie the initial candidate speech set includes packet 21 to packet 100.
进一步地,该语音检测的装置可以根据初始的候选噪声集合中的每个分组的能量,确定初始的噪声门限,该初始的噪声门限可以用于确定该初始的候选语音集合中是否还存在噪声信号,其中,能量值小于该初始的噪声门限的分组可以认为是噪声信号。Further, the apparatus for voice detection may determine an initial noise threshold according to an energy of each packet in the initial candidate noise set, and the initial noise threshold may be used to determine whether a noise signal is still present in the initial candidate voice set. Where the packet whose energy value is less than the initial noise threshold can be considered as a noise signal.
可选地,在一些实施例中,所述根据所述初始的候选噪声集合中的每个分组的能量,确定初始的噪声门限,包括:Optionally, in some embodiments, determining, according to energy of each of the initial candidate noise sets, an initial noise threshold, including:
根据所述初始的候选噪声集合中的每个分组的能量,确定初始噪声功率;Determining an initial noise power based on energy of each of the initial candidate noise sets;
将所述初始噪声功率乘以门限因子得到的结果确定为所述初始的噪声门限,其中,所述门限因子是根据目标虚警概率确定的。A result obtained by multiplying the initial noise power by a threshold factor is determined as the initial noise threshold, wherein the threshold factor is determined according to a target false alarm probability.
作为示例而非限定,可以根据分组的数据长度M和目标虚警概率P fa确定该门限因子T,其中,该目标虚警概率为系统允许的最大的虚警概率,即系统允许的将噪声信号误判为语音信号的最大概率,例如,可以根据如下公式确定门限因子T。 As an example and not by way of limitation, the threshold factor T may be determined according to the data length M of the packet and the target false alarm probability P fa , wherein the target false alarm probability is the maximum false alarm probability allowed by the system, that is, the noise signal allowed by the system The erroneous judgment is the maximum probability of the speech signal, for example, the threshold factor T can be determined according to the following formula.
T=F -1(1-P fa)               公式(1) T=F -1 (1-P fa ) Formula (1)
Figure PCTCN2018080447-appb-000001
Figure PCTCN2018080447-appb-000001
Figure PCTCN2018080447-appb-000002
Figure PCTCN2018080447-appb-000002
可选地,在本申请实施例中,该初始噪声功率可以为该初始的候选噪声集合中的每个分组的功率的平均值,或者,该初始的候选噪声集合中的每个分组的功率之和等,本申请实施例对此不作限定。Optionally, in the embodiment of the present application, the initial noise power may be an average value of power of each packet in the initial candidate noise set, or power of each packet in the initial candidate noise set. And the like, the embodiment of the present application does not limit this.
确定该第一数据块的初始的噪声门限后,可以将该初始的候选语音集合中的分组的能量依次跟该初始的噪声门限比较,若分组的能量小于该初始的噪声门限,可以认为该分组为噪声信号,对比完毕之后,进一步地,可以将该初始的候选语音集合中分组的能量小于该初始的噪声门限的分组添加到该初始的候选噪声集合,得到第一次迭代处理的候选噪声集合,将该N个分组中除该第一次迭代处理的候选噪声集合以外的其他分组确定为第一次迭代处理的候选语音集合,即,该第一次迭代处理的候选噪声集合为第一次更新得到的候选噪声集合,该第一次迭代处理的候选语音集合为第一次更新得到的候选语音集合。After determining the initial noise threshold of the first data block, the energy of the packet in the initial candidate speech set may be sequentially compared with the initial noise threshold. If the energy of the packet is less than the initial noise threshold, the packet may be considered as the packet. After the comparison is completed, the packet of the initial candidate speech set whose energy is less than the initial noise threshold may be added to the initial candidate noise set to obtain the candidate noise set of the first iteration process. Determining, among the N packets, other than the candidate noise set processed by the first iteration as the candidate speech set processed by the first iteration, that is, the candidate noise set processed by the first iteration is the first time The obtained candidate noise set is updated, and the candidate speech set processed by the first iteration is the candidate speech set obtained by the first update.
接着上述示例,若该初始的候选噪声集合包括分组1至分组20,初始的候选语音集合包括分组21至分组100,可以将该初始的噪声门限与该分组21至分组100中的分组的能量对比,若初始的候选语音集合中的分组21至分组40的能量都小于该初始的噪声门限,则可以将分组21至分组40添加到该初始的候选噪声集合,得到第一次迭代处理的候选噪声集合,包括分组1至分组40,同时可以得到第一次迭代处理的候选语音集合,包括分组41至分组100。Following the above example, if the initial candidate noise set includes packet 1 to packet 20, the initial candidate speech set includes packet 21 to packet 100, and the initial noise threshold can be compared with the energy of the packet 21 to the packet in packet 100. If the energy of the packet 21 to the packet 40 in the initial candidate speech set is less than the initial noise threshold, the packet 21 to the packet 40 may be added to the initial candidate noise set to obtain the candidate noise of the first iteration process. The set, including packet 1 to packet 40, is simultaneously available for the first iteratively processed candidate speech set, including packet 41 to packet 100.
因此,本申请实施例的语音检测的方法,使用分组的能量作为特征参数,能够平滑噪声减小虚警概率,相对于现有的采用单帧能量进行语音检测而言,有利于提升语音检测的精度,同时,与采用其他参数进行语音检测相比,有利于降低计算的复杂度。Therefore, the method for voice detection in the embodiment of the present application uses the energy of the packet as a feature parameter, which can smooth the noise and reduce the false alarm probability, and is advantageous for improving voice detection in comparison with the existing voice detection using single frame energy. Accuracy, at the same time, is conducive to reducing the computational complexity compared to speech detection using other parameters.
可选地,在本申请实施例中,所述方法100还可以包括:Optionally, in the embodiment of the present application, the method 100 may further include:
根据第k次迭代处理的候选噪声集合中的每个分组的能量,确定第k次迭代处理的噪声门限,其中,所述k为1,2,……;Determining a noise threshold of the kth iteration process according to the energy of each of the candidate noise sets processed by the kth iteration, wherein the k is 1, 2, ...;
根据第k次迭代处理的候选语音集合以及所述第k次迭代处理的噪声门 限,确定第k+1次迭代处理的候选噪声集合和第k+1次迭代处理的候选语音集合。The candidate noise set of the k+1th iteration process and the candidate speech set of the k+1th iteration process are determined according to the candidate speech set processed by the kth iteration and the noise threshold of the kth iteration process.
在本申请实施例中,确定第一次迭代处理的候选噪声集合之后,还可以根据该第一次迭代处理的候选噪声集合中的每个分组的能量,确定第一次迭代处理的噪声门限,然后可以根据该第一次迭代处理的噪声门限,确定该第一次迭代处理的候选语音集合是否还包括噪声信号,若该第一次迭代处理的候选语音集合中的每个分组的能量都大于该第一次迭代处理的噪声门限,则可以确定该第一次迭代处理的候选语音集合中不包括噪声信号,进而可以确认其为目标语音集合,同时可以确定该第一次迭代处理的候选噪声集合为目标噪声集合,否则,还可以继续执行上述迭代操作,直到第k次迭代处理的候选语音集合中的每个分组的能量都大于第k次迭代处理的噪声门限,或者当迭代次数k达到上限,此时,可以确定该第k次迭代处理的候选语音集合为该目标语音集合,即该第k次迭代处理的候选语音集合中的分组都为语音信号,同时可以确定第k次迭代处理的候选噪声集合中的分组都为噪声信号,即该k次迭代处理的候选语音集合为目标语音集合,该第k次迭代处理的候选噪声集合为目标噪声集合,从而得到对该N个分组中的每个分组的判决结果,或者说,可以确定该第一数据块中的N个分组中哪些分组是语音信号,哪些分组是噪声信号。In the embodiment of the present application, after determining the candidate noise set processed by the first iteration, the noise threshold of the first iteration process may also be determined according to the energy of each packet in the candidate noise set processed by the first iteration. And determining, according to the noise threshold of the first iteration process, whether the candidate speech set processed by the first iteration further includes a noise signal, if the energy of each packet in the candidate speech set processed by the first iteration is greater than The noise threshold of the first iteration process may determine that the noise signal is not included in the candidate speech set processed by the first iteration, and thus may be confirmed as the target speech set, and the candidate noise of the first iteration process may be determined. The set is the target noise set. Otherwise, the above iterative operation may continue to be performed until the energy of each packet in the candidate speech set processed by the kth iteration is greater than the noise threshold of the kth iteration, or when the number of iterations k reaches Upper limit, at this time, it may be determined that the candidate speech set processed by the kth iteration is the target speech set That is, the packets in the candidate speech set processed by the kth iteration are all voice signals, and it can be determined that the packets in the candidate noise set processed by the kth iteration are all noise signals, that is, the candidate speech sets processed by the k iterations. For the target speech set, the candidate noise set processed by the kth iteration is the target noise set, thereby obtaining a decision result for each of the N packets, or N of the first data block may be determined. Which packets in the packet are voice signals and which packets are noise signals.
跟前述的该初始的噪声功率的确定方式类似,所述根据第k次迭代处理的候选噪声集合中的每个分组的能量,确定第k次迭代处理的噪声门限,可以包括:Similar to the foregoing manner of determining the initial noise power, determining the noise threshold of the kth iteration process according to the energy of each of the candidate noise sets processed by the kth iteration may include:
根据第k次迭代处理的候选噪声集合中的每个分组的能量,确定第k次迭代处理的噪声功率;Determining the noise power of the kth iteration process according to the energy of each of the candidate noise sets processed by the kth iteration;
将所述第k次迭代处理的噪声功率和门限因子的乘积,确定为该第k次迭代处理的噪声门限。The product of the noise power processed by the kth iteration and the threshold factor is determined as the noise threshold of the kth iteration.
因此,本申请实施例中,用于语音检测的噪声门限是根据门限因子和噪声功率确定的,根据上述迭代过程,能够实现每个数据块内都在进行噪声功率的迭代更新,同时能够实现噪声门限的迭代更新,从而能够提升每个数据块内的语音检测的鲁棒性。Therefore, in the embodiment of the present application, the noise threshold for voice detection is determined according to a threshold factor and a noise power. According to the above iterative process, iterative update of noise power in each data block can be implemented, and noise can be implemented at the same time. An iterative update of the threshold, thereby improving the robustness of speech detection within each data block.
可选地,在本申请实施例中,该语音检测的装置还可以根据相邻数据块的噪声功率对当前数据块进行平滑处理。具体可以包括以下两种情况:Optionally, in the embodiment of the present application, the apparatus for voice detection may further perform smoothing processing on the current data block according to the noise power of the adjacent data block. Specifically, the following two situations can be included:
情况1:若所述第一数据块为所述待处理数据中的第一个数据块,所述根据所述初始的候选噪声集合中的每个分组的能量,确定初始噪声功率,包括:Case 1: If the first data block is the first data block in the to-be-processed data, determining the initial noise power according to the energy of each of the initial candidate noise sets, including:
将所述初始的候选噪声集合中的每个分组的能量的平均值,确定为所述初始噪声功率。An average of the energy of each of the initial candidate noise sets is determined as the initial noise power.
即,在该第一数据块为该待处理数据的第一个数据块,或者说,该第一数据块之前无其他数据块时,可以直接将该第一数据块的初始的候选噪声集合中的每个分组的功率的平均值,确定为该第一数据块的初始噪声功率,或者可以将该第一数据块的初始的候选噪声集合中的每个分组的功率之和,确定为该第一数据块的初始噪声功率等。其中,该第一数据块的初始的候选噪声集合的确定方式可以参考前述实施例的相关描述,这里不再赘述。That is, when the first data block is the first data block of the to-be-processed data, or when there is no other data block before the first data block, the initial candidate noise set of the first data block may be directly The average value of the power of each packet is determined as the initial noise power of the first data block, or the sum of the powers of each of the initial candidate noise sets of the first data block may be determined as the first The initial noise power of a block of data, etc. For the determination manner of the initial candidate noise set of the first data block, refer to the related description of the foregoing embodiment, and details are not described herein again.
情况2:若所述第一数据块为所述待处理数据中的非第一个数据块,所述第一数据块的前一数据块为第二数据块,所述根据所述初始的候选噪声集合中的每个分组的能量,确定初始噪声功率,包括:Case 2: if the first data block is a non-first data block in the to-be-processed data, the previous data block of the first data block is a second data block, according to the initial candidate The energy of each packet in the noise set determines the initial noise power, including:
根据所述第二数据块的目标噪声功率以及所述第一数据块的预估噪声功率,确定所述第一数据块的初始噪声功率,其中,所述第一数据块的预估噪声功率为所述第一数据块的初始的候选噪声集合中的每个分组的能量的平均值,所述第二数据块的目标噪声功率为所述第二数据块的目标噪声集合中的每个分组的能量的平均值。Determining an initial noise power of the first data block according to a target noise power of the second data block and an estimated noise power of the first data block, where an estimated noise power of the first data block is An average of energy of each of the initial candidate noise sets of the first data block, the target noise power of the second data block being each of the target noise sets of the second data block The average of the energy.
即,当该第一数据块为该待处理数据的非第一个数据块,或者说,该第一数据之前还有其他数据块时,该第一数据块的初始噪声功率可以根据该第一数据块的预估噪声功率和该第一数据块的前一数据块即第二数据块的目标噪声功率确定。That is, when the first data block is the non-first data block of the to-be-processed data, or there is another data block before the first data, the initial noise power of the first data block may be according to the first data block. The estimated noise power of the data block and the target data power of the previous data block of the first data block, that is, the second data block, are determined.
可选地,该第一数据块的预估噪声功率可以根据该第一数据块的初始的候选噪声集合中的每个分组的能量确定,例如,该第一数据块的预估噪声功率可以为该第一数据块的初始的候选噪声集合中的每个分组的功率的平均值,或者该第一数据块的预估噪声功率也可以为该第一数据块的初始的候选噪声集合中的每个分组的功率之和,该第一数据块的初始的候选噪声集合的确定方式可以参考前述实施例中的相关描述,这里不再赘述。Optionally, the estimated noise power of the first data block may be determined according to an energy of each of the initial candidate noise sets of the first data block, for example, the estimated noise power of the first data block may be An average of the power of each of the initial candidate noise sets of the first data block, or the estimated noise power of the first data block may also be each of the initial candidate noise sets of the first data block For the determination of the initial candidate noise set of the first data block, refer to the related description in the foregoing embodiment, and details are not described herein again.
可选地,该第二数据块的目标噪声功率可以根据该第二数据块的目标候选噪声集合中的每个分组的能量确定,例如,该第二数据块的目标噪声功率 可以为该第二数据块的目标噪声集合中的每个分组的功率的平均值,或者,该第二数据块的目标噪声功率也可以为该第二数据块的目标噪声集合中的每个分组的功率之和,其中,该第二数据块的目标噪声集合的确定方式可以参考该第一数据块的目标噪声集合的确定方式,这里不再赘述。Optionally, the target noise power of the second data block may be determined according to energy of each packet in the target candidate noise set of the second data block, for example, the target noise power of the second data block may be the second An average of the power of each of the target noise sets of the data block, or the target noise power of the second data block may also be the sum of the powers of each of the target noise sets of the second data block, The determining manner of the target noise set of the second data block may refer to the determining manner of the target noise set of the first data block, and details are not described herein again.
在一个具体的实现方式中,可以根据如下公式(4),确定所述第一数据块的初始噪声功率:In a specific implementation, the initial noise power of the first data block may be determined according to the following formula (4):
P 1=αP 1'+(1-α)P 2″            公式(4) P 1 =αP 1 '+(1-α)P 2 ′′ Formula (4)
其中,所述P 1为所述第一数据块的初始噪声功率,所述P 1'为所述第一数据块的预估噪声功率,所述P 2″为所述第二数据块的目标噪声功率,0<α<1。 Wherein P 1 is an initial noise power of the first data block, the P 1 ' is an estimated noise power of the first data block, and the P 2 ′′ is a target of the second data block Noise power, 0 < α < 1.
即,可以对第一数据块的噪声功率的估计值(即预估噪声功率)和该第一数据块的前一数据块即第二数据块的稳定的噪声功率(即目标噪声功率)作平滑处理,得到该第一数据块的初始噪声功率,进一步地,可以根据该第一数据块的初始噪声功率和门限因子,确定该第一数据块的初始噪声门限。That is, the estimated value of the noise power of the first data block (ie, the estimated noise power) and the stable noise power (ie, the target noise power) of the previous data block of the first data block, that is, the second data block, may be smoothed. Processing, obtaining an initial noise power of the first data block, and further, determining an initial noise threshold of the first data block according to an initial noise power of the first data block and a threshold factor.
因此,本申请实施例的用于语音检测的噪声门限是根据门限因子和噪声功率确定的,一方面,每个数据块内都在进行噪声功率的迭代更新,从而使得每个数据块内的语音检测的鲁棒性较好,另一方面,噪声功率可以在数据块之间进行平滑处理,能够自适应跟踪环境噪声的变化,从而使得数据块之间的噪声门限具有较好的自适应性,进而使得待处理数据的每个数据块都具有较好的鲁棒性。Therefore, the noise threshold for voice detection in the embodiment of the present application is determined according to a threshold factor and a noise power. On the one hand, an iterative update of noise power is performed in each data block, so that voices in each data block are made. The robustness of the detection is good. On the other hand, the noise power can be smoothed between data blocks, and the change of the environmental noise can be adaptively tracked, so that the noise threshold between the data blocks has better adaptability. In turn, each data block of the data to be processed has better robustness.
总之,本申请实施例的语音检测的方法,在保证较低的计算复杂度的情况下,还能够根据环境噪声自适应调整噪声门限,在保证检测性能的同时,还可以具有较好的鲁棒性。In summary, the method for voice detection in the embodiment of the present application can adaptively adjust the noise threshold according to the ambient noise while ensuring low computational complexity, and can also have better robustness while ensuring detection performance. Sex.
在本申请实施例中,确定该第一数据块中的N个分组的判决结果之后,进一步地,所述方法100还可以包括:In the embodiment of the present application, after determining the determination result of the N packets in the first data block, the method 100 may further include:
将所述目标语音集合中的分组按照时间顺序排列;Arranging the packets in the target speech set in chronological order;
根据所述目标语音集合中的相邻分组之间的时间间隔,确定更新后的所述目标语音集合。And determining the updated target voice set according to a time interval between adjacent packets in the target voice set.
通常来说,语音信号在很短时间内是连续的,因此,判决结果也应该在短时间内是连续的,可能存在因为原始语音的能量的突变或者噪声的影响,导致短时间内判决结果显示语音信号和噪声信号的变化比较频繁的情况,基于此,可以对判决结果进行修正。Generally speaking, the speech signal is continuous in a short time. Therefore, the decision result should also be continuous in a short time. There may be a sudden change in the energy of the original speech or the influence of noise. The change of the speech signal and the noise signal is relatively frequent, and based on this, the decision result can be corrected.
在本申请实施例中,确定该第一数据块中的目标语音集合和目标噪声集合之后,即可获知该N个分组中的每个分组的信号类型,即是属于语音信号,还是属于噪声信号,然后可以将该N个分组按照采样时间的先后顺序排列,即将该N个分组还原为原来的排序。In the embodiment of the present application, after determining the target voice set and the target noise set in the first data block, the signal type of each of the N packets is known, that is, belongs to a voice signal or a noise signal. Then, the N packets can be arranged in the order of sampling time, that is, the N packets are restored to the original order.
此情况下,可以根据相邻两个属于语音信号的分组之间的时间间隔,对语音检测的判决结果进行修正,即可以确定更新后的该目标语音集合。,例如,若相邻两个属于语音信号的分组之间的时间间隔小于预设门限,可以确定这两个分组之间的其他分组也为语音信号,从而可以将该其他分组也添加到目标语音集合,得到更新后(或者说,修正后)的目标语音集合。In this case, the decision result of the voice detection may be corrected according to the time interval between two adjacent packets belonging to the voice signal, so that the updated target voice set may be determined. For example, if the time interval between two adjacent packets belonging to the voice signal is less than a preset threshold, it can be determined that other packets between the two packets are also voice signals, so that the other packets can also be added to the target voice. The collection, the updated target collection (or, after correction).
可选地,在本申请实施例中,也可以按照上述类似的方式确定修正后的目标噪声集合,为了简洁,这里不再赘述。Optionally, in the embodiment of the present application, the corrected target noise set may also be determined in a similar manner as described above, and details are not described herein for brevity.
例如,若分组21和分组30为相邻的两个属于语音信号的分组,若分组21和分组30之间的时间间隔为10ms,间隔较短,此时,可以确定该分组21和分组30之间的其他分组也为语音信号,即可以将分组22至分组29也确定为语音信号,从而可以得到更新后的目标语音集合。For example, if the packet 21 and the packet 30 are two adjacent packets belonging to the voice signal, if the time interval between the packet 21 and the packet 30 is 10 ms, the interval is short, and at this time, the packet 21 and the packet 30 can be determined. The other packets in between are also voice signals, i.e., packets 22 through 29 can also be determined as voice signals, so that the updated target voice set can be obtained.
因此,本申请实施例的语音检测的方法,还可以根据语音信号的不突变的特性,对语音检测的判决结果进行修正,从而能够提升语音检测的准确度。Therefore, the method for voice detection in the embodiment of the present application can also correct the decision result of the voice detection according to the non-mutation characteristic of the voice signal, thereby improving the accuracy of the voice detection.
以下,结合图2所示的整体流程图,详细说明根据本申请实施例的语音检测的方法,如图2所示,该方法可以包括如下内容:Hereinafter, the method for voice detection according to an embodiment of the present application is described in detail in conjunction with the overall flowchart shown in FIG. 2. As shown in FIG. 2, the method may include the following content:
在本申请实施例中,可以将待处理数据分成多个数据块来处理,可选地,可以根据应用场景或处理能力确定数据块的长度,假设每个数据块包括L个采样点,可以根据处理能力和检测精度确定分组的数据长度,将该L个采样点分为N个分组,则每个分组的数据长度M=[L/N]。In the embodiment of the present application, the data to be processed may be divided into multiple data blocks for processing. Optionally, the length of the data block may be determined according to an application scenario or processing capability, and each data block includes L sampling points, which may be The processing capability and the detection precision determine the data length of the packet, and the L sample points are divided into N packets, and the data length of each packet is M=[L/N].
S201,根据预设的虚警概率确定门限因子。S201. Determine a threshold factor according to a preset false alarm probability.
其中,该S201的具体实现过程可以参考前述实施例的相关描述,这里不再赘述。For a specific implementation process of the S201, refer to the related description of the foregoing embodiment, and details are not described herein again.
在S202中,确定数据块中的每个分组的能量。In S202, the energy of each packet in the data block is determined.
例如,该待处理数据的第i个数据块中的每个分组的能量可以为:P i=[p i1,p i2,....p iN],其中,p ij为该第i个数据块的第j个分组的能量,每个分组的能量可以为该每个分组中的每个采样点的功率之和,或者也可以为该每个采样点的功率的平均值,本申请实施例对此不作限定。 For example, the energy of each of the i-th data blocks of the to-be-processed data may be: P i =[p i1 , p i2 , . . . p iN ], where p ij is the i-th data The energy of the j-th packet of the block, the energy of each packet may be the sum of the powers of each sample point in each packet, or may also be the average of the power of each sample point, this embodiment of the present application This is not limited.
进一步地,可以按能量的大小将该N个分组排序,例如,可以按照能量的大小做升序排列。Further, the N packets may be sorted by the size of the energy, for example, in ascending order according to the size of the energy.
其后,执行S203,判断该数据块是否为该待处理数据中的第一个数据块,若是,则执行S204,否则,执行S205。Thereafter, S203 is executed to determine whether the data block is the first data block in the to-be-processed data, and if yes, execute S204; otherwise, execute S205.
在S204中,确定该数据块的初始噪声功率。In S204, an initial noise power of the data block is determined.
其中,该S204的实现过程可以对应于前述实施例中的情况1的实现过程,为了简洁,这里不再赘述。The implementation process of the S204 may correspond to the implementation process of the case 1 in the foregoing embodiment. For brevity, details are not described herein again.
在S205中,根据该数据块的预估噪声功率和该数据块的前一数据块的目标噪声功率,确定该数据块的初始噪声功率。In S205, an initial noise power of the data block is determined according to the estimated noise power of the data block and the target noise power of the previous data block of the data block.
其中,该S205的实现过程可以对应于前述实施例中的情况2的实现过程,为了简洁,这里不再赘述。The implementation process of the S205 may correspond to the implementation process of the case 2 in the foregoing embodiment. For brevity, details are not described herein again.
进一步地,可以执行S206,根据在S204或S205中确定的噪声功率结合S201中确定的门限因子,确定噪声门限;Further, S206 may be performed to determine a noise threshold according to the noise power determined in S204 or S205 in combination with the threshold factor determined in S201;
例如,可以将噪声功率和门限因子的乘积确定为该噪声门限。For example, the product of the noise power and the threshold factor can be determined as the noise threshold.
然后执行S207,根据该噪声门限,重新确定该数据块中的噪声集合和语音集合。Then, S207 is executed to re-determine the noise set and the voice set in the data block according to the noise threshold.
例如,可以将该数据块中能量大于该噪声门限的分组确定为语音集合,将该数据块中能量小于或等于该噪声门限的分组确定为噪声集合。For example, a packet in the data block whose energy is greater than the noise threshold may be determined as a set of speech, and a packet in the data block whose energy is less than or equal to the noise threshold is determined as a noise set.
举例来说,若该数据块的初始噪声功率是根据该数据块中的分组1~分组k 1的能量确定的,则可以认为分组1~分组k 1构成初始的候选噪声集合,分组k 1+1~分组N构成初始的候选语音集合。在S207中,可以根据噪声门限重新确定分组k 1+1~分组N中是否有分组属于噪声信号,其中,分组的能量小于或等于该噪声门限的分组可以确定属于噪声信号。 For example, if the initial noise power of the data block is determined according to the energy of the packet 1 to the packet k 1 in the data block, it can be considered that the packet 1 to the packet k 1 constitute an initial candidate noise set, and the packet k 1 + 1 to group N constitute an initial candidate speech set. In S207, it may be determined according to the noise threshold whether any of the packets k 1 +1 to N belong to the noise signal, and the packet whose energy is less than or equal to the noise threshold may be determined to belong to the noise signal.
在S208中,确定该数据块的噪声集合是否有新分组加入,若是,则执行S209,否则,执行S210。In S208, it is determined whether there is a new packet join in the noise set of the data block, and if so, S209 is performed; otherwise, S210 is performed.
在S209中,根据重新确定的噪声集合,确定更新后的噪声功率,然后跳转至执行S206,根据更新后的噪声功率确定更新后的噪声门限,进一步地,还可以执行S207,根据更新后的噪声门限重新确定该数据块中的噪声集合和语音集合,直到达到预设的迭代次数,或者通过迭代使得噪声功率趋于稳定,从而噪声门限也趋于稳定,这样,能量大于噪声门限的分组可以确定为语音信号,能量小于噪声门限的分组可以确定为噪声信号。In S209, the updated noise power is determined according to the re-determined noise set, and then jumps to execution S206, and the updated noise threshold is determined according to the updated noise power. Further, S207 may be further performed according to the updated The noise threshold re-determines the noise set and the voice set in the data block until the preset number of iterations is reached, or the noise power is stabilized by iteration, so that the noise threshold is also stabilized, so that the packet whose energy is greater than the noise threshold can be A packet determined to be a speech signal whose energy is less than the noise threshold can be determined as a noise signal.
在S210中,输出该数据块的各个分组的判决结果。In S210, the decision result of each packet of the data block is output.
如上文所述,在本申请实施例中,还可以对语音检测的判决结果进行修正,在一种实现方式中,可以对每个分组的判决结果设置标识,例如,可以将属于语音信号的分组设置标识1,将属于噪声信号的分组设置标识0,然后将分组按照采样时间的先后顺序进行排序,即恢复至原来的顺序,然后可以根据相邻属于语音信号的分组的时间间隔,对判决结果进行修正。As described above, in the embodiment of the present application, the determination result of the voice detection may also be corrected. In an implementation manner, the identifier of each group may be set with an identifier, for example, a packet belonging to the voice signal may be Set the identifier 1, set the identifier of the packet belonging to the noise signal to 0, and then sort the packets according to the order of sampling time, that is, return to the original order, and then judge the result according to the time interval of the packets belonging to the adjacent voice signal. Make corrections.
例如,若判决结果的标识向量V=[v 1,v 2....v N],v i∈0,1,其中,标识为1表示对应的位置上的分组为语音信号,标识为0表示对应的位置上的分组为噪声信号,根据该标识向量,可以确定该数据块中的语音信号的位置向量为W=(w 1,w 2,...w k),k<L,1≤w i≤N,其中,w i可以用于标识该分组i的时间信息,对该位置向量中的相邻两个分组的位置向量做差分,可以得到Δ=(Δ 12,...Δ k-1),Δ k-1表示w k-1和w k的时间差,由于相邻语音信号之间的间隔不会太大,因此,若Δ 1小于预设门限,则可以把w 1和w 2之间的分组的判决结果也看做是语音信号,从而可以得到更新后的判决结果的标注向量V',则该数据块的最终语音检测结果为V'。 For example, if the identification vector of the decision result is V=[v 1 , v 2 . . . v N ], v i ∈ 0, 1, wherein the identifier 1 indicates that the packet at the corresponding position is a voice signal, and the identifier is 0. The group representing the corresponding position is a noise signal, and according to the identifier vector, the position vector of the voice signal in the data block can be determined as W=(w 1 , w 2 , . . . , w k ), k<L, 1 ≤ w i ≤ N, where w i can be used to identify the time information of the packet i, and the position vector of two adjacent packets in the position vector is differentiated, and Δ=(Δ 1 , Δ 2 ,. .. Δ k-1 ), Δ k-1 represents the time difference between w k-1 and w k , since the interval between adjacent speech signals is not too large, if Δ 1 is less than the preset threshold, then The decision result of the packet between w 1 and w 2 is also regarded as a speech signal, so that the label vector V' of the updated decision result can be obtained, and the final speech detection result of the data block is V'.
以上,结合图1至图2详细说明了本申请的方法实施例,以下,结合图3,详细描述本申请的装置实施例,应理解,装置实施例与方法实施例相互对应,类似的描述可以参照方法实施例。The embodiment of the present invention is described in detail with reference to FIG. 1 to FIG. 2 . Hereinafter, the device embodiment of the present application is described in detail with reference to FIG. 3 . It should be understood that the device embodiment and the method embodiment correspond to each other, and a similar description may be used. Refer to method embodiments.
图3是根据本申请实施例的语音检测的装置的示意性结构图,如图3所示,该装置300包括确定模块310。其中,该确定模块310用于:FIG. 3 is a schematic structural diagram of an apparatus for voice detection according to an embodiment of the present application. As shown in FIG. 3, the apparatus 300 includes a determination module 310. The determining module 310 is configured to:
确定待处理数据中的第一数据块的N个分组中的每个分组的能量,其中,所述N为正整数;Determining an energy of each of the N packets of the first data block in the data to be processed, wherein the N is a positive integer;
根据所述N个分组的能量,确定初始的候选噪声集合和初始的候选语音集合,其中,所述初始的候选噪声集合中的分组的最大能量小于所述初始的候选语音集合中的分组的最小能量;Determining an initial candidate noise set and an initial candidate speech set according to energy of the N packets, wherein a maximum energy of a packet in the initial candidate noise set is smaller than a minimum of a packet in the initial candidate speech set energy;
根据所述初始的候选噪声集合中的每个分组的能量,确定初始的噪声门限;Determining an initial noise threshold based on energy of each of the initial candidate noise sets;
根据所述初始的候选语音集合以及所述初始的噪声门限,确定第一次迭代处理的候选噪声集合和第一次迭代处理的候选语音集合,其中,所述第一次迭代处理的候选噪声集合中的分组的能量均小于或等于所述初始的噪声门限,所述第一次迭代处理的候选语音集合中的分组的能量均大于所述初始 的噪声门限。Determining, according to the initial candidate speech set and the initial noise threshold, a candidate noise set processed by the first iteration and a candidate speech set processed by the first iteration, wherein the candidate noise set of the first iteration processing The energy of the packets in the group is less than or equal to the initial noise threshold, and the energy of the packets in the candidate speech set processed by the first iteration is greater than the initial noise threshold.
可选地,在一些实施例中,所述确定模块310还用于:Optionally, in some embodiments, the determining module 310 is further configured to:
根据第k次迭代处理的候选噪声集合中的每个分组的能量,确定第k次迭代处理的噪声门限,其中,所述k为1,2,……;Determining a noise threshold of the kth iteration process according to the energy of each of the candidate noise sets processed by the kth iteration, wherein the k is 1, 2, ...;
根据第k次迭代处理的候选语音集合以及所述第k次迭代处理的噪声门限,确定第k+1次迭代处理的候选噪声集合和第k+1次迭代处理的候选语音集合。Determining the candidate noise set of the k+1th iteration process and the candidate speech set of the k+1th iteration process according to the candidate speech set processed by the kth iteration and the noise threshold of the kth iteration process.
可选地,在一些实施例中,所述确定模块310还用于:Optionally, in some embodiments, the determining module 310 is further configured to:
在迭代次数k达到迭代上限时,确定所述第k迭代次处理的候选语音集合为目标语音集合,所述第k次迭代处理的所述候选噪声集合为目标噪声集合。When the number of iterations k reaches the upper limit of the iteration, it is determined that the candidate speech set processed by the kth iteration is the target speech set, and the candidate noise set processed by the kth iteration is the target noise set.
可选地,在一些实施例中,所述确定模块310还用于:Optionally, in some embodiments, the determining module 310 is further configured to:
若所述第k次迭代处理的候选语音集合中的分组的能量都大于所述第k次迭代处理的噪声门限,确定所述第k次迭代处理的候选语音集合为目标语音集合,所述第k次迭代处理的候选噪声集合为目标噪声集合。If the energy of the packet in the candidate speech set processed by the kth iteration is greater than the noise threshold of the kth iteration process, determining that the candidate speech set processed by the kth iteration is the target speech set, the first The candidate noise set processed by k iterations is the target noise set.
可选地,在一些实施例中,所述确定模块310还用于:Optionally, in some embodiments, the determining module 310 is further configured to:
将所述目标语音集合中的分组按照时间顺序排列;Arranging the packets in the target speech set in chronological order;
根据所述目标语音集合中的相邻分组之间的时间间隔,确定更新后的所述目标语音集合。And determining the updated target voice set according to a time interval between adjacent packets in the target voice set.
可选地,在一些实施例中,所述确定模块310具体用于:Optionally, in some embodiments, the determining module 310 is specifically configured to:
若所述目标语音集合中的相邻两个分组的时间间隔小于预设门限,确定所述相邻两个分组之间的其他分组也为语音信号,并将所述相邻两个分组之间的其他分组添加到所述目标语音集合,得到更新后的所述目标语音集合。If the time interval of two adjacent packets in the target voice set is less than a preset threshold, determining that other packets between the two adjacent packets are also voice signals, and between the two adjacent packets Other groupings are added to the target voice collection to obtain the updated target voice collection.
可选地,在一些实施例中,所述确定模块310具体用于:Optionally, in some embodiments, the determining module 310 is specifically configured to:
根据所述初始的候选噪声集合中的每个分组的能量,确定初始噪声功率;Determining an initial noise power based on energy of each of the initial candidate noise sets;
将所述初始噪声功率乘以门限因子得到的结果确定为所述初始的噪声门限,其中,所述门限因子是根据目标虚警概率确定的。A result obtained by multiplying the initial noise power by a threshold factor is determined as the initial noise threshold, wherein the threshold factor is determined according to a target false alarm probability.
可选地,在一些实施例中,所述第一数据块为所述待处理数据中的第一个数据块,所述确定模块310具体用于:Optionally, in some embodiments, the first data block is the first data block in the to-be-processed data, and the determining module 310 is specifically configured to:
将所述初始的候选噪声集合中的每个分组的能量的平均值,确定为所述 初始噪声功率。An average of the energy of each of the initial candidate noise sets is determined as the initial noise power.
可选地,在一些实施例中,所述第一数据块为所述待处理数据中的非第一个数据块,所述第一数据块的前一数据块为第二数据块,所述确定模块具体用于:Optionally, in some embodiments, the first data block is a non-first data block in the to-be-processed data, and a previous data block of the first data block is a second data block, where The determination module is specifically used to:
根据所述第二数据块的目标噪声功率以及所述第一数据块的预估噪声功率,确定所述第一数据块的初始噪声功率,其中,所述第一数据块的预估噪声功率为所述第一数据块的初始候选噪声集合中的每个分组的能量的平均值,所述第二数据块的目标噪声功率为所述第二数据块的目标噪声集合中的每个分组的能量的平均值。Determining an initial noise power of the first data block according to a target noise power of the second data block and an estimated noise power of the first data block, where an estimated noise power of the first data block is An average of the energy of each of the initial candidate noise sets of the first data block, the target noise power of the second data block being the energy of each of the target noise sets of the second data block average value.
可选地,在一些实施例中,所述确定模块310具体用于:Optionally, in some embodiments, the determining module 310 is specifically configured to:
根据如下公式,确定所述第一数据块的初始噪声功率:The initial noise power of the first data block is determined according to the following formula:
P 1=αP 1'+(1-α)P 2P 1 =αP 1 '+(1-α)P 2 ′′
其中,所述P 1为所述第一数据块的初始噪声功率,所述P 1'为所述第一数据块的预估噪声功率,所述P 2″为所述第二数据块的目标噪声功率,0<α<1。 Wherein P 1 is an initial noise power of the first data block, the P 1 ' is an estimated noise power of the first data block, and the P 2 ′′ is a target of the second data block Noise power, 0 < α < 1.
可选地,在一些实施例中,所述确定模块310还用于:Optionally, in some embodiments, the determining module 310 is further configured to:
将所述N个分组中能量较小的一定比例的分组,确定为所述初始的候选噪声集合,将所述N个分组中的其他分组确定为所述初始的候选语音集合;或Determining, into the initial candidate noise set, a certain proportion of the lesser packets of the N packets, and determining other ones of the N packets as the initial candidate speech set; or
将所述N个分组中能量较小的一定数量的分组,确定为所述初始的候选噪声集合,将所述N个分组中的其他分组确定为所述初始的候选语音集合。A certain number of packets of lesser energy among the N packets are determined as the initial candidate noise set, and other ones of the N packets are determined as the initial candidate speech set.
可选地,该确定模块310可以为具体处理能力的处理器,该处理器可以为中央处理单元(Central Processing Unit,CPU),或者其他通用处理器、数字信号处理器(DSP)、专用集成电路(ASIC)、现成可编程门阵列(FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等,本申请实施例对此不作限定。其中,通用处理器可以是微处理器或者所述处理器也可以是任何常规的处理器等。Optionally, the determining module 310 can be a specific processing capability processor, and the processor can be a central processing unit (CPU), or other general-purpose processor, a digital signal processor (DSP), an application specific integrated circuit. (ASIC), off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware component, etc., which are not limited in this embodiment. The general purpose processor may be a microprocessor or the processor or any conventional processor or the like.
可选地,该语音检测的装置300还可以包括存储器,该存储器可以包括只读存储器和随机存取存储器,并向处理器提供指令和数据。存储器的一部分还可以包括非易失性随机存取存储器。例如,存储器还可以存储设备类型的信息。Optionally, the apparatus 300 for voice detection may further include a memory, which may include a read only memory and a random access memory, and provides instructions and data to the processor. A portion of the memory may also include a non-volatile random access memory. For example, the memory can also store information of the device type.
可选地,在本申请实施例中,该存储器还可以用于存储采集的音频数据。Optionally, in the embodiment of the present application, the memory may also be used to store the collected audio data.
本申请实施例还提出了一种计算机可读存储介质,该计算机可读存储介质存储一个或多个程序,该一个或多个程序包括指令,该指令当被包括多个应用程序的便携式电子设备执行时,能够使该便携式电子设备执行图1至图2所示实施例的方法。The embodiment of the present application further provides a computer readable storage medium storing one or more programs, the one or more programs including instructions, when the portable electronic device is included in a plurality of applications When executed, the portable electronic device can be caused to perform the method of the embodiment shown in Figures 1-2.
本申请实施例还提出了一种计算机程序,该计算机程序包括指令,当该计算机程序被计算机执行时,使得计算机可以执行图1至图2所示实施例的方法的相应流程。The embodiment of the present application also proposes a computer program comprising instructions which, when executed by a computer, cause the computer to perform the corresponding flow of the method of the embodiment shown in Figures 1 to 2.
应理解,本文中术语“和/或”,仅仅是一种描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。另外,本文中字符“/”,一般表示前后关联对象是一种“或”的关系。It should be understood that the term "and/or" herein is merely an association relationship describing an associated object, indicating that there may be three relationships, for example, A and/or B, which may indicate that A exists separately, and A and B exist simultaneously. There are three cases of B alone. In addition, the character "/" in this article generally indicates that the contextual object is an "or" relationship.
应理解,在本申请的各种实施例中,上述各过程的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。It should be understood that, in the various embodiments of the present application, the size of the sequence numbers of the foregoing processes does not mean the order of execution sequence, and the order of execution of each process should be determined by its function and internal logic, and should not be applied to the embodiment of the present application. The implementation process constitutes any limitation.
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。Those of ordinary skill in the art will appreciate that the elements and algorithm steps of the various examples described in connection with the embodiments disclosed herein can be implemented in electronic hardware or a combination of computer software and electronic hardware. Whether these functions are performed in hardware or software depends on the specific application and design constraints of the solution. A person skilled in the art can use different methods to implement the described functions for each particular application, but such implementation should not be considered to be beyond the scope of the present application.
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统、装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。A person skilled in the art can clearly understand that for the convenience and brevity of the description, the specific working process of the system, the device and the unit described above can refer to the corresponding process in the foregoing method embodiment, and details are not described herein again.
在本申请所提供的几个实施例中,应该理解到,所揭露的系统、装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。In the several embodiments provided by the present application, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. For example, the device embodiments described above are merely illustrative. For example, the division of the unit is only a logical function division. In actual implementation, there may be another division manner, for example, multiple units or components may be combined or Can be integrated into another system, or some features can be ignored or not executed. In addition, the mutual coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection through some interface, device or unit, and may be in an electrical, mechanical or other form.
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方, 或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of the embodiment.
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。In addition, each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit.
所述功能如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(Read-Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。The functions may be stored in a computer readable storage medium if implemented in the form of a software functional unit and sold or used as a standalone product. Based on such understanding, the technical solution of the present application, which is essential or contributes to the prior art, or a part of the technical solution, may be embodied in the form of a software product, which is stored in a storage medium, including The instructions are used to cause a computer device (which may be a personal computer, server, or network device, etc.) to perform all or part of the steps of the methods described in various embodiments of the present application. The foregoing storage medium includes: a U disk, a mobile hard disk, a read-only memory (ROM), a random access memory (RAM), a magnetic disk, or an optical disk, and the like, which can store program codes. .
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以所述权利要求的保护范围为准。The foregoing is only a specific embodiment of the present application, but the scope of protection of the present application is not limited thereto, and any person skilled in the art can easily think of changes or substitutions within the technical scope disclosed in the present application. It should be covered by the scope of protection of this application. Therefore, the scope of protection of the present application should be determined by the scope of the claims.

Claims (20)

  1. 一种语音检测的方法,其特征在于,包括:A method for voice detection, comprising:
    确定待处理数据中的第一数据块的N个分组中的每个分组的能量,其中,所述N为正整数;Determining an energy of each of the N packets of the first data block in the data to be processed, wherein the N is a positive integer;
    根据所述N个分组的能量,确定初始的候选噪声集合和初始的候选语音集合,其中,所述初始的候选噪声集合中的分组的最大能量小于所述初始的候选语音集合中的分组的最小能量;Determining an initial candidate noise set and an initial candidate speech set according to energy of the N packets, wherein a maximum energy of a packet in the initial candidate noise set is smaller than a minimum of a packet in the initial candidate speech set energy;
    根据所述初始的候选噪声集合中的每个分组的能量,确定初始的噪声门限;Determining an initial noise threshold based on energy of each of the initial candidate noise sets;
    根据所述初始的候选语音集合以及所述初始的噪声门限,确定第一次迭代处理的候选噪声集合和第一次迭代处理的候选语音集合,其中,所述第一次迭代处理的候选噪声集合中的分组的能量均小于或等于所述初始的噪声门限,所述第一次迭代处理的候选语音集合中的分组的能量均大于所述初始的噪声门限。Determining, according to the initial candidate speech set and the initial noise threshold, a candidate noise set processed by the first iteration and a candidate speech set processed by the first iteration, wherein the candidate noise set of the first iteration processing The energy of the packets in the group is less than or equal to the initial noise threshold, and the energy of the packets in the candidate speech set processed by the first iteration is greater than the initial noise threshold.
  2. 根据权利要求1所述的方法,其特征在于,所述方法还包括:The method of claim 1 further comprising:
    根据第k次迭代处理的候选噪声集合中的每个分组的能量,确定第k次迭代处理的噪声门限,其中,所述k为1,2,……;Determining a noise threshold of the kth iteration process according to the energy of each of the candidate noise sets processed by the kth iteration, wherein the k is 1, 2, ...;
    根据第k次迭代处理的候选语音集合以及所述第k次迭代处理的噪声门限,确定第k+1次迭代处理的候选噪声集合和第k+1次迭代处理的候选语音集合。Determining the candidate noise set of the k+1th iteration process and the candidate speech set of the k+1th iteration process according to the candidate speech set processed by the kth iteration and the noise threshold of the kth iteration process.
  3. 根据权利要求2所述的方法,其特征在于,所述方法还包括:The method of claim 2, wherein the method further comprises:
    若所述第k次迭代处理的候选语音集合中的分组的能量都大于所述第k次迭代处理的噪声门限,确定所述第k次迭代处理的候选语音集合为目标语音集合,所述第k次迭代处理的候选噪声集合为目标噪声集合。If the energy of the packet in the candidate speech set processed by the kth iteration is greater than the noise threshold of the kth iteration process, determining that the candidate speech set processed by the kth iteration is the target speech set, the first The candidate noise set processed by k iterations is the target noise set.
  4. 根据权利要求2所述的方法,其特征在于,所述方法还包括:The method of claim 2, wherein the method further comprises:
    在迭代次数k达到迭代上限时,确定所述第k迭代次处理的候选语音集合为目标语音集合,所述第k次迭代处理的所述候选噪声集合为目标噪声集合。When the number of iterations k reaches the upper limit of the iteration, it is determined that the candidate speech set processed by the kth iteration is the target speech set, and the candidate noise set processed by the kth iteration is the target noise set.
  5. 根据权利要求3或4所述的方法,其特征在于,所述方法还包括:The method according to claim 3 or 4, wherein the method further comprises:
    将所述目标语音集合中的分组按照时间顺序排列;Arranging the packets in the target speech set in chronological order;
    根据所述目标语音集合中的相邻分组之间的时间间隔,确定更新后的所 述目标语音集合。The updated target speech set is determined based on a time interval between adjacent packets in the target speech set.
  6. 根据权利要求5所述的方法,其特征在于,所述根据所述目标语音集合中的相邻分组之间的时间间隔,确定更新后的所述目标语音集合,包括:The method according to claim 5, wherein the determining the updated target voice set according to a time interval between adjacent packets in the target voice set comprises:
    若所述目标语音集合中的相邻两个分组的时间间隔小于预设门限,确定所述相邻两个分组之间的其他分组也为语音信号,并将所述相邻两个分组之间的其他分组添加到所述目标语音集合,得到更新后的所述目标语音集合。If the time interval of two adjacent packets in the target voice set is less than a preset threshold, determining that other packets between the two adjacent packets are also voice signals, and between the two adjacent packets Other groupings are added to the target voice collection to obtain the updated target voice collection.
  7. 根据权利要求1至6中任一项所述的方法,其特征在于,所述根据所述初始的候选噪声集合中的每个分组的能量,确定初始的噪声门限,包括:The method according to any one of claims 1 to 6, wherein the determining an initial noise threshold according to the energy of each of the initial candidate noise sets comprises:
    根据所述初始的候选噪声集合中的每个分组的能量,确定初始噪声功率;Determining an initial noise power based on energy of each of the initial candidate noise sets;
    将所述初始噪声功率乘以门限因子得到的结果确定为所述初始的噪声门限,其中,所述门限因子是根据目标虚警概率确定的。A result obtained by multiplying the initial noise power by a threshold factor is determined as the initial noise threshold, wherein the threshold factor is determined according to a target false alarm probability.
  8. 根据权利要求7所述的方法,其特征在于,所述第一数据块为所述待处理数据中的第一个数据块,所述根据所述初始的候选噪声集合中的每个分组的能量,确定初始噪声功率,包括:The method according to claim 7, wherein said first data block is a first data block in said data to be processed, said energy according to each of said initial candidate noise sets To determine the initial noise power, including:
    将所述初始的候选噪声集合中的每个分组的能量的平均值,确定为所述初始噪声功率。An average of the energy of each of the initial candidate noise sets is determined as the initial noise power.
  9. 根据权利要求7所述的方法,其特征在于,所述第一数据块为所述待处理数据中的非第一个数据块,所述第一数据块的前一数据块为第二数据块,所述根据所述初始的候选噪声集合中的每个分组的能量,确定初始噪声功率,包括:The method according to claim 7, wherein the first data block is a non-first data block in the to-be-processed data, and the previous data block of the first data block is a second data block. Determining the initial noise power according to the energy of each of the initial candidate noise sets, including:
    根据所述第二数据块的目标噪声功率以及所述第一数据块的预估噪声功率,确定所述第一数据块的初始噪声功率,其中,所述第一数据块的预估噪声功率为所述第一数据块的初始的候选噪声集合中的每个分组的能量的平均值,所述第二数据块的目标噪声功率为所述第二数据块的目标噪声集合中的每个分组的能量的平均值。Determining an initial noise power of the first data block according to a target noise power of the second data block and an estimated noise power of the first data block, where an estimated noise power of the first data block is An average of energy of each of the initial candidate noise sets of the first data block, the target noise power of the second data block being each of the target noise sets of the second data block The average of the energy.
  10. 根据权利要求1至9中任一项所述的方法,其特征在于,所述根据所述N个分组的能量,确定初始的候选噪声集合和初始的候选语音集合,包括:The method according to any one of claims 1 to 9, wherein the determining the initial candidate noise set and the initial candidate speech set according to the energy of the N packets comprises:
    将所述N个分组中能量较小的一定比例的分组,确定为所述初始的候选噪声集合,将所述N个分组中的其他分组确定为所述初始的候选语音集合; 或Determining, into the initial candidate noise set, a certain proportion of the lesser packets of the N packets, and determining other ones of the N packets as the initial candidate speech set; or
    将所述N个分组中能量较小的一定数量的分组,确定为所述初始的候选噪声集合,将所述N个分组中的其他分组确定为所述初始的候选语音集合。A certain number of packets of lesser energy among the N packets are determined as the initial candidate noise set, and other ones of the N packets are determined as the initial candidate speech set.
  11. 一种语音检测的装置,其特征在于,包括确定模块,用于:A device for voice detection, comprising: a determining module, configured to:
    确定待处理数据中的第一数据块的N个分组中的每个分组的能量,其中,所述N为正整数;Determining an energy of each of the N packets of the first data block in the data to be processed, wherein the N is a positive integer;
    根据所述N个分组的能量,确定初始的候选噪声集合和初始的候选语音集合,其中,所述初始的候选噪声集合中的分组的最大能量小于所述初始的候选语音集合中的分组的最小能量;Determining an initial candidate noise set and an initial candidate speech set according to energy of the N packets, wherein a maximum energy of a packet in the initial candidate noise set is smaller than a minimum of a packet in the initial candidate speech set energy;
    根据所述初始的候选噪声集合中的每个分组的能量,确定初始的噪声门限;Determining an initial noise threshold based on energy of each of the initial candidate noise sets;
    根据所述初始的候选语音集合以及所述初始的噪声门限,确定第一次迭代处理的候选噪声集合和第一次迭代处理的候选语音集合,其中,所述第一次迭代处理的候选噪声集合中的分组的能量均小于或等于所述初始的噪声门限,所述第一次迭代处理的候选语音集合中的分组的能量均大于所述初始的噪声门限。Determining, according to the initial candidate speech set and the initial noise threshold, a candidate noise set processed by the first iteration and a candidate speech set processed by the first iteration, wherein the candidate noise set of the first iteration processing The energy of the packets in the group is less than or equal to the initial noise threshold, and the energy of the packets in the candidate speech set processed by the first iteration is greater than the initial noise threshold.
  12. 根据权利要求11所述的装置,其特征在于,所述确定模块还用于:The apparatus according to claim 11, wherein the determining module is further configured to:
    根据第k次迭代处理的候选噪声集合中的每个分组的能量,确定第k次迭代处理的噪声门限,其中,所述k为1,2,……;Determining a noise threshold of the kth iteration process according to the energy of each of the candidate noise sets processed by the kth iteration, wherein the k is 1, 2, ...;
    根据第k次迭代处理的候选语音集合以及所述第k次迭代处理的噪声门限,确定第k+1次迭代处理的候选噪声集合和第k+1次迭代处理的候选语音集合。Determining the candidate noise set of the k+1th iteration process and the candidate speech set of the k+1th iteration process according to the candidate speech set processed by the kth iteration and the noise threshold of the kth iteration process.
  13. 根据权利要求12所述的装置,其特征在于,所述确定模块还用于:The device according to claim 12, wherein the determining module is further configured to:
    在迭代次数k达到迭代上限时,确定所述第k迭代次处理的候选语音集合为目标语音集合,所述第k次迭代处理的所述候选噪声集合为目标噪声集合。When the number of iterations k reaches the upper limit of the iteration, it is determined that the candidate speech set processed by the kth iteration is the target speech set, and the candidate noise set processed by the kth iteration is the target noise set.
  14. 根据权利要求12所述的装置,其特征在于,所述确定模块还用于:The device according to claim 12, wherein the determining module is further configured to:
    若所述第k次迭代处理的候选语音集合中的分组的能量都大于所述第k次迭代处理的噪声门限,确定所述第k次迭代处理的候选语音集合为目标语音集合,所述第k次迭代处理的候选噪声集合为目标噪声集合。If the energy of the packet in the candidate speech set processed by the kth iteration is greater than the noise threshold of the kth iteration process, determining that the candidate speech set processed by the kth iteration is the target speech set, the first The candidate noise set processed by k iterations is the target noise set.
  15. 根据权利要求13或14所述的装置,其特征在于,所述确定模块还 用于:The apparatus according to claim 13 or 14, wherein the determining module is further configured to:
    将所述目标语音集合中的分组按照时间顺序排列;Arranging the packets in the target speech set in chronological order;
    根据所述目标语音集合中的相邻分组之间的时间间隔,确定更新后的所述目标语音集合。And determining the updated target voice set according to a time interval between adjacent packets in the target voice set.
  16. 根据权利要求15所述的装置,其特征在于,所述确定模块具体用于:The device according to claim 15, wherein the determining module is specifically configured to:
    若所述目标语音集合中的相邻两个分组的时间间隔小于预设门限,确定所述相邻两个分组之间的其他分组也为语音信号,并将所述相邻两个分组之间的其他分组添加到所述目标语音集合,得到更新后的所述目标语音集合。If the time interval of two adjacent packets in the target voice set is less than a preset threshold, determining that other packets between the two adjacent packets are also voice signals, and between the two adjacent packets Other groupings are added to the target voice collection to obtain the updated target voice collection.
  17. 根据权利要求11至16中任一项所述的装置,其特征在于,所述确定模块具体用于:The device according to any one of claims 11 to 16, wherein the determining module is specifically configured to:
    根据所述初始的候选噪声集合中的每个分组的能量,确定初始噪声功率;Determining an initial noise power based on energy of each of the initial candidate noise sets;
    将所述初始噪声功率乘以门限因子得到的结果确定为所述初始的噪声门限,其中,所述门限因子是根据目标虚警概率确定的。A result obtained by multiplying the initial noise power by a threshold factor is determined as the initial noise threshold, wherein the threshold factor is determined according to a target false alarm probability.
  18. 根据权利要求17所述的装置,其特征在于,所述第一数据块为所述待处理数据中的第一个数据块,所述确定模块具体用于:The device according to claim 17, wherein the first data block is the first data block in the to-be-processed data, and the determining module is specifically configured to:
    将所述初始的候选噪声集合中的每个分组的能量的平均值,确定为所述初始噪声功率。An average of the energy of each of the initial candidate noise sets is determined as the initial noise power.
  19. 根据权利要求17所述的装置,其特征在于,所述第一数据块为所述待处理数据中的非第一个数据块,所述第一数据块的前一数据块为第二数据块,所述确定模块具体用于:The apparatus according to claim 17, wherein the first data block is a non-first data block in the to-be-processed data, and a previous data block of the first data block is a second data block. The determining module is specifically configured to:
    根据所述第二数据块的目标噪声功率以及所述第一数据块的预估噪声功率,确定所述第一数据块的初始噪声功率,其中,所述第一数据块的预估噪声功率为所述第一数据块的初始候选噪声集合中的每个分组的能量的平均值,所述第二数据块的目标噪声功率为所述第二数据块的目标噪声集合中的每个分组的能量的平均值。Determining an initial noise power of the first data block according to a target noise power of the second data block and an estimated noise power of the first data block, where an estimated noise power of the first data block is An average of the energy of each of the initial candidate noise sets of the first data block, the target noise power of the second data block being the energy of each of the target noise sets of the second data block average value.
  20. 根据权利要求11至19中任一项所述的装置,其特征在于,所述确定模块还用于:The device according to any one of claims 11 to 19, wherein the determining module is further configured to:
    将所述N个分组中能量较小的一定比例的分组,确定为所述初始的候选噪声集合,将所述N个分组中的其他分组确定为所述初始的候选语音集合; 或Determining, into the initial candidate noise set, a certain proportion of the lesser packets of the N packets, and determining other ones of the N packets as the initial candidate speech set; or
    将所述N个分组中能量较小的一定数量的分组,确定为所述初始的候选噪声集合,将所述N个分组中的其他分组确定为所述初始的候选语音集合。A certain number of packets of lesser energy among the N packets are determined as the initial candidate noise set, and other ones of the N packets are determined as the initial candidate speech set.
PCT/CN2018/080447 2018-03-26 2018-03-26 Voice detection method and apparatus WO2019183747A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/CN2018/080447 WO2019183747A1 (en) 2018-03-26 2018-03-26 Voice detection method and apparatus
CN201880000470.4A CN110537223B (en) 2018-03-26 2018-03-26 Voice detection method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2018/080447 WO2019183747A1 (en) 2018-03-26 2018-03-26 Voice detection method and apparatus

Publications (1)

Publication Number Publication Date
WO2019183747A1 true WO2019183747A1 (en) 2019-10-03

Family

ID=68059408

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/080447 WO2019183747A1 (en) 2018-03-26 2018-03-26 Voice detection method and apparatus

Country Status (2)

Country Link
CN (1) CN110537223B (en)
WO (1) WO2019183747A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111475671A (en) * 2020-03-12 2020-07-31 支付宝(杭州)信息技术有限公司 Voice document processing method and device and server

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101226741A (en) * 2007-12-28 2008-07-23 无敌科技(西安)有限公司 Method for detecting movable voice endpoint
CN101625857A (en) * 2008-07-10 2010-01-13 新奥特(北京)视频技术有限公司 Self-adaptive voice endpoint detection method
CN102201231A (en) * 2010-03-23 2011-09-28 创杰科技股份有限公司 Voice sensing method
CN103716470A (en) * 2012-09-29 2014-04-09 华为技术有限公司 Method and device for speech quality monitoring
CN105810201A (en) * 2014-12-31 2016-07-27 展讯通信(上海)有限公司 Voice activity detection method and system

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7146314B2 (en) * 2001-12-20 2006-12-05 Renesas Technology Corporation Dynamic adjustment of noise separation in data handling, particularly voice activation
CN1540623A (en) * 2003-11-04 2004-10-27 清华大学 Threshold self-adaptive speech sound detection system
CN101599269B (en) * 2009-07-02 2011-07-20 中国农业大学 Phonetic end point detection method and device therefor
US20150287406A1 (en) * 2012-03-23 2015-10-08 Google Inc. Estimating Speech in the Presence of Noise
CN103730110B (en) * 2012-10-10 2017-03-01 北京百度网讯科技有限公司 A kind of method and apparatus of detection sound end
CN105513614B (en) * 2015-12-03 2019-05-03 广东顺德中山大学卡内基梅隆大学国际联合研究院 A kind of area You Yin detection method based on noise power spectrum Gamma statistical distribution model

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101226741A (en) * 2007-12-28 2008-07-23 无敌科技(西安)有限公司 Method for detecting movable voice endpoint
CN101625857A (en) * 2008-07-10 2010-01-13 新奥特(北京)视频技术有限公司 Self-adaptive voice endpoint detection method
CN102201231A (en) * 2010-03-23 2011-09-28 创杰科技股份有限公司 Voice sensing method
CN103716470A (en) * 2012-09-29 2014-04-09 华为技术有限公司 Method and device for speech quality monitoring
CN105810201A (en) * 2014-12-31 2016-07-27 展讯通信(上海)有限公司 Voice activity detection method and system

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111475671A (en) * 2020-03-12 2020-07-31 支付宝(杭州)信息技术有限公司 Voice document processing method and device and server
CN111475671B (en) * 2020-03-12 2023-09-26 支付宝(杭州)信息技术有限公司 Voice document processing method and device and server

Also Published As

Publication number Publication date
CN110537223A (en) 2019-12-03
CN110537223B (en) 2022-07-05

Similar Documents

Publication Publication Date Title
US11508366B2 (en) Whispering voice recovery method, apparatus and device, and readable storage medium
CN110838289B (en) Wake-up word detection method, device, equipment and medium based on artificial intelligence
US20180158449A1 (en) Method and device for waking up via speech based on artificial intelligence
WO2019223457A1 (en) Mixed speech recognition method and apparatus, and computer readable storage medium
CN105632486B (en) Voice awakening method and device of intelligent hardware
WO2020098083A1 (en) Call separation method and apparatus, computer device and storage medium
CN111739539B (en) Method, device and storage medium for determining number of speakers
WO2019120007A1 (en) Method and apparatus for predicting user gender, and electronic device
CN112687266A (en) Speech recognition method, speech recognition device, computer equipment and storage medium
CN109377984B (en) ArcFace-based voice recognition method and device
CN113593597B (en) Voice noise filtering method, device, electronic equipment and medium
CN109635749B (en) Image processing method and device based on video stream
CN114898266A (en) Training method, image processing method, device, electronic device and storage medium
WO2019183747A1 (en) Voice detection method and apparatus
CN111571567A (en) Robot translation skill training method and device, electronic equipment and storage medium
WO2023088142A1 (en) Audio signal processing method and apparatus, and device and storage medium
WO2021016925A1 (en) Audio processing method and apparatus
CN112017676A (en) Audio processing method, apparatus and computer readable storage medium
CN113590774B (en) Event query method, device and storage medium
CN115223573A (en) Voice wake-up method and device, electronic equipment and storage medium
CN114461837A (en) Image processing method and device and electronic equipment
WO2021217619A1 (en) Label smoothing-based speech recognition method, terminal, and medium
CN113782014A (en) Voice recognition method and device
CN112530418A (en) Voice wake-up method, device and related equipment
Fujita et al. Robust DNN-Based VAD Augmented with Phone Entropy Based Rejection of Background Speech.

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18912742

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18912742

Country of ref document: EP

Kind code of ref document: A1