WO2018068636A1 - 一种语音信号检测方法与装置 - Google Patents
一种语音信号检测方法与装置 Download PDFInfo
- Publication number
- WO2018068636A1 WO2018068636A1 PCT/CN2017/103489 CN2017103489W WO2018068636A1 WO 2018068636 A1 WO2018068636 A1 WO 2018068636A1 CN 2017103489 W CN2017103489 W CN 2017103489W WO 2018068636 A1 WO2018068636 A1 WO 2018068636A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- audio signal
- short
- energy
- signal
- preset
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/78—Detection of presence or absence of voice signals
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/78—Detection of presence or absence of voice signals
- G10L25/87—Detection of discrete points within a voice signal
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/78—Detection of presence or absence of voice signals
- G10L25/84—Detection of presence or absence of voice signals for discriminating voice from noise
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/21—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being power information
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/78—Detection of presence or absence of voice signals
- G10L2025/783—Detection of presence or absence of voice signals based on threshold decision
Definitions
- the present application relates to the field of computer technologies, and in particular, to a voice signal detecting method and apparatus.
- the smart device needs to record all the time or record according to the preset period, and judge whether the obtained audio signal contains the voice signal, and if the voice signal is included, the voice is included. The signal is extracted, then processed and sent out, thus completing the transmission of the voice message.
- a dual threshold method a detection method based on an autocorrelation maximum value, or a detection method based on a wavelet transform is generally used to detect whether a voice signal is included in the acquired audio signal.
- these methods basically obtain the frequency characteristics of the audio information through complex calculations such as Fourier transform, and then determine whether the voice signal is included according to the frequency characteristics, and need to calculate larger buffer data, and the memory occupancy is high, and the calculation is performed. The amount is too large, the processing speed is slow, and the power consumption is large.
- the embodiment of the present invention provides a method and a device for detecting a voice signal, which are used to solve the problem that the processing method of the voice signal detection method in the prior art has a slow processing speed and consumes a lot of resources.
- a method for detecting a voice signal comprising:
- a voice signal is detected in the audio signal according to the energy of each short-term energy frame.
- a voice signal detecting device comprising:
- a dividing module dividing the audio signal into a plurality of short-term energy frames according to a frequency of the preset voice signal
- the detecting module detects whether the audio signal includes a voice signal according to the energy of each short-term energy frame.
- the voice signal detection method used in the embodiment of the present application does not need to perform complex calculation such as Fourier transform.
- the voice signal detecting method provided by the embodiment of the present invention can solve the problem that the processing method of the voice signal detecting method in the prior art has a slow processing speed and consumes a large amount of resources.
- FIG. 1 is a specific flowchart of a method for detecting a voice signal according to an embodiment of the present application
- FIG. 2 is a specific flowchart of another method for detecting a voice signal according to an embodiment of the present application
- FIG. 3 is a diagram showing an audio signal display of a preset duration according to an embodiment of the present application
- FIG. 4 is a schematic structural diagram of a voice signal detecting apparatus according to an embodiment of the present disclosure.
- the embodiment of the present application provides a voice signal detection method.
- the execution body of the method may be, but not limited to, a user terminal such as a mobile phone, a tablet computer or a personal computer (PC), or an application (application, APP) running on the user terminals, or may be a server or the like. device.
- a user terminal such as a mobile phone, a tablet computer or a personal computer (PC), or an application (application, APP) running on the user terminals, or may be a server or the like. device.
- FIG. 1 A schematic diagram of the specific process of the method is shown in FIG. 1 and includes the following steps:
- Step 101 Acquire an audio signal.
- the above-mentioned audio signal may be an audio signal collected by the APP through the audio collection device, or may be an audio signal received by the APP, such as an audio signal transmitted by another APP or device, which is not limited in this embodiment of the present application. . After the APP acquires the audio signal, it can save the audio signal locally.
- the present application also does not impose any limitation on the sampling rate, duration, format or channel corresponding to the above audio signal.
- the above APP can be any type of APP, such as a chat app or a payment APP, as long as the The APP can acquire the audio signal, and can perform the detection of the voice signal on the acquired audio signal by using the voice signal detection method provided by the embodiment of the present application.
- Step 102 Divide the audio signal into a plurality of short-term energy frames according to a frequency of the preset voice signal.
- the short-term energy frame described above is actually a part of the audio signal in the audio signal acquired in step 101.
- the period of the preset voice signal may be determined according to the frequency of the preset voice signal, and the audio signal obtained in step 101 is divided into corresponding durations according to the determined period. Time energy frame. For example, if the period of the preset voice signal is 0.01 S, the audio signal may be divided into a number of short-term energy frames each having a duration of 0.01 S according to the duration of the audio signal acquired in step 101. It should be noted that, when dividing the audio signal acquired in step 101, the audio signal may be divided into at least two short-term energy frames according to the frequency of the preset voice signal according to actual conditions. For the convenience of the following description, the embodiment of the present application will be described later by taking an audio signal into multiple short-term energy frames as an example.
- the audio signal when the audio signal is collected by the APP itself through the audio collection device in step 101, since the audio signal is generally collected, the audio signal that is actually an analog signal is collected into a digital signal at a certain sampling rate, that is, pulse code modulation (Pulse).
- the audio signal of the Code Modulation (PCM) format therefore, the audio signal can also be divided into a plurality of short-term energy frames according to the sampling rate of the audio signal and the frequency of the preset speech signal.
- a ratio m of a sampling rate of the audio signal to a frequency of the preset voice signal may be determined, and then each of the sample points in the collected digital form is divided into a short-time energy frame according to the ratio m. . If m is a positive integer, the audio signal may be divided into a maximum number of short-term energy frames according to m; if m is not a positive integer, the audio signal may be divided into m according to a rounding principle to be converted into a positive integer. The maximum number of short-term energy frames.
- the remaining sampling points may be discarded.
- the remaining sampling points can also be It is also used as a short-term energy frame for subsequent processing.
- the above m is used to indicate the number of sampling points included in the audio signal acquired in step 101 during the period of a preset voice signal.
- the frequency of the preset speech signal is 82 Hz
- the duration of the audio signal acquired in step 101 is 1 S
- the sampling rate is 16000 Hz
- m is not a positive integer
- 195.1 is converted to a positive integer 195 according to the rounding principle.
- the duration of the audio signal and the sampling rate it can be determined that the number of sampling points included in the audio signal is 16000.
- the audio can be After the signal is divided into 82 short-term energy frames, the remaining 10 sample points are discarded.
- Each of the short-term energy frames described above includes 195 sampling points.
- the audio signal acquired in step 101 is the received audio signal transmitted by another APP or device
- the audio signal may be divided into a plurality of short-term energy frames by any of the above methods.
- the format of the above audio signal may not be in the PCM format. If the short-term energy frame is divided according to the sampling rate of the audio signal and the frequency of the preset speech signal by the above method, the received audio signal is converted into an audio signal of the PCM format, and when the audio signal is received, The sampling rate of the audio signal needs to be identified, and the method for specifically identifying the sampling rate of the audio signal can be identified by the prior art method, and will not be repeated here.
- step 103 the energy of each short-term energy frame is determined.
- the amplitude of the audio signal corresponding to each sampling point in the short-term energy frame may be used.
- Value to determine the energy of the short-term energy frame Specifically, the energy of each sampling point may be determined according to the amplitude of the audio signal corresponding to each sampling point in the short-term energy frame, and then the energy is added, and the sum of the finally obtained energy is used as The energy of the short-term energy frame.
- the following formula can be used to determine the energy of a short-term energy frame: Where i is the ith sampling point of the audio signal; n is the number of sampling points contained in the short-time energy frame; A i [t] is the amplitude of the audio signal corresponding to the ith sampling point, wherein, short-time The amplitude of the energy frame ranges from -32768 to 32767.
- the amplitude obtained when the audio signal is collected may be divided by the value of 32768 as the normalized amplitude of the short-term energy frame, then the short-time energy frame.
- the normalized amplitude ranges from -1 to 1.
- the function of calculating the amplitude can be determined according to the amplitude of each moment of the short-term energy frame, and the square of the function is integrated, and the final integration result is the short-time energy.
- the energy of the frame is not the PCM format.
- Step 104 Detect whether a voice signal is included in the audio signal according to energy of each short-term energy frame.
- the following two methods may be used to determine whether a voice signal is detected in the audio signal:
- Method 1 Determine the ratio of the number of short-term energy frames whose energy is greater than a preset threshold to the total number of all short-term energy frames (hereinafter referred to as a high-energy frame ratio), and determine whether the determined high-energy frame ratio is greater than a preset ratio. If so, it is determined that the audio signal is detected to include the voice signal; if not, it is determined that the voice signal is not detected in the audio signal.
- a high-energy frame ratio the ratio of the number of short-term energy frames whose energy is greater than a preset threshold to the total number of all short-term energy frames
- the preset threshold and the preset ratio may be set according to actual needs.
- the preset threshold may be set to 2
- the preset ratio is set to 20%
- the high energy frame ratio is greater than 20%
- the method 1 can be used to determine whether the voice signal is detected in the audio signal, because in real life, when people talk, there is more or less noise in the external environment, and the noise is generally relative. The energy is lower in what people say. Then, if there is a short-term energy frame whose energy is higher than a preset threshold in an audio signal, and the short-term energy frames occupy a certain ratio in the audio signal, the audio signal may be considered to include a voice signal.
- Method 2 In order to make the final detection result more accurate, the method mentioned in Method 1 can be used to determine the high energy frame ratio, and determine whether the determined high energy frame ratio is greater than a preset ratio, and if not, then Determining that the audio signal is not detected in the audio signal; if yes, when there are at least N consecutive short-time energy frames in the short-term energy frame whose energy is greater than a preset threshold, determining that the detected audio signal includes the voice signal, when the energy is greater than When there are no at least N consecutive short-time energy frames in the short-time energy frame of the preset threshold, it is determined that the voice signal is not detected in the audio signal.
- N can be any positive integer. In the embodiment of the present application, N can be set to 10.
- the method 2 adds a condition for determining whether the audio signal is included in the audio signal: whether there are at least N consecutive short-time energy frames in the short-term energy frame whose energy is greater than the preset threshold. This can effectively reduce noise. Since in real life, the noise is lower than that of human beings, and the signal is random, using Method 2 can effectively eliminate the excessive noise in the audio signal and reduce the influence of noise in the external environment. The role of noise reduction.
- the above-mentioned voice signal detecting method can be applied to detecting a mono audio signal, a two-channel audio signal, or a multi-channel audio signal.
- the audio signal collected through one channel is a mono audio signal; the audio signal collected through two channels is a two-channel audio signal, and the audio signal collected through multiple channels is multi-channel audio. signal.
- the audio signals of each channel obtained may be detected according to the operations mentioned in steps 101 to 104, Finally, based on the detection result of the audio signal of each channel, it is judged whether the acquired audio signal contains a voice signal.
- the operations mentioned in steps 101 to 104 can be directly performed on the audio signal, and the detection result is used as a final detection result.
- the audio signals of each channel are processed according to the operations in steps 101-104. If it is detected that the audio signal of each channel does not include a voice signal, it is determined that the audio signal acquired in step 101 does not include a voice signal. If it is detected that the audio signal of at least one channel contains a voice signal, it is determined that the audio signal acquired in step 101 includes a voice signal.
- the frequency of the preset voice signal mentioned in step 102 may be the frequency of any voice. This application does not limit this. In an actual application, the frequency of different preset voice signals may be set for different audio signals acquired in step 101 according to actual conditions. It should be specially noted that regardless of the frequency of the preset speech, the frequency of the speech signal, such as the frequency of the soprano, or the frequency of the bass, as long as the final divided short-term energy frame satisfies the following conditions: The duration of the short-term energy frame is not less than the period corresponding to the audio signal acquired in step 101.
- the frequency of the preset voice signal can be set to the minimum vocal frequency, that is, 82 Hz. Since the period is the reciprocal of the frequency, if the frequency of the preset speech signal is the minimum vocal frequency, the period of the preset speech signal is the maximum vocal period, and therefore, regardless of the period of the audio signal acquired in step 101, how short The duration of the time energy frame is not less than the period of the acquired audio signal.
- the duration of the short-term energy frame is not less than the period of the audio signal acquired in step 101, because the detection method provided by the embodiment of the present application is based on The characteristics of the words spoken by humans to detect whether the audio signal contains a speech signal. The words spoken by humans are higher, more stable, and more continuous than noise. If the duration of the short-term energy frame is smaller than the period of the audio signal acquired in step 101, then there is no complete period of waveform in the waveform corresponding to the short-time energy frame, and the duration of the short-term energy frame is relatively short.
- the duration of the audio signal acquired in step 101 should be greater than the maximum period of a human voice.
- the voice signal detection method provided by the embodiment of the present application is particularly suitable for an application scenario in which a chat APP can complete a voice message without any user clicking operation. Then, the voice signal detecting method provided by the embodiment of the present application is described in detail below for the scenario. In this scenario, the specific process diagram of the method is shown in FIG. 2, and includes the following steps:
- step 201 an audio signal is collected in real time.
- the app can be finished without any click operation.
- the user can start to record the external environment without interruption, and collect the audio signal in real time to avoid missing the user's words.
- the audio signal can be saved locally in real time.
- the app stops recording.
- Step 202 The audio signal of the preset duration is intercepted from the collected audio signal in real time.
- the APP can intercept the audio signal of the preset duration in the audio signal collected in step 201 in real time, and perform subsequent detection on the audio signal of the preset duration.
- the audio signal of the preset duration is currently referred to as the current audio signal, and the audio signal of the preset duration captured last time may be referred to as the audio signal acquired last time.
- Step 203 The audio signal of the preset duration is divided into a plurality of short-term energy frames according to the frequency of the preset voice signal.
- Step 204 determining the energy of each short-term energy frame.
- Step 205 Detect whether a voice signal is included in the audio signal of the preset duration according to the energy of each short-term energy frame.
- the current audio signal may be The starting point is determined as the starting point of the speech signal; if it is determined that the previously acquired audio signal contains the speech signal, the starting point of the current audio signal is not the starting point of the speech signal.
- the voice signal received last time contains a voice signal
- the voice signal obtained last time contains a voice signal
- the last time the voice signal is obtained
- the end point of the audio signal is determined as the end point of the speech signal; if the audio signal obtained last time does not contain the speech signal, the end point of the current audio signal or the audio signal obtained last time is not the end point of the speech signal.
- A, B, C, and D are four adjacent audio signals of preset durations
- a and D do not contain a speech signal
- B and C contain a speech signal.
- the starting point of B can be determined as the starting point of the speech signal
- the end point of C can be determined as the end point of the speech signal.
- the current audio signal is just the beginning or end of a sentence of the user, and the audio signal contains less voice signals.
- the APP may incorrectly determine that the audio signal does not include a voice signal. Then, in order to avoid misjudgment and cause the user to miss the speech, after detecting the current audio signal including the speech signal, it is determined whether the audio signal obtained last time contains the speech signal, and if the last acquired signal is determined, If the audio signal does not contain a voice signal, the starting point of the last acquired audio signal can be determined as the starting point of the voice signal.
- the current audio signal after detecting that the current audio signal does not include the voice signal, it may be determined whether the voice signal received last time includes a voice signal, and if it is determined that the last acquired audio signal includes a voice signal, the current audio may be included.
- the end of the signal is determined as the end of the speech signal.
- the starting point of A can be determined as the starting point of the speech signal
- the end point of D can be determined as the end point of the speech signal.
- the audio signal may be sent to the voice recognition device, so that the voice recognition device may perform voice processing on the audio signal to obtain the voice result, and then the voice recognition device will The audio signal is sent to a subsequent processing device, which ultimately transmits the audio signal as a voice message.
- the APP may send all the audio signals between the start point and the end point of the determined voice signal to the voice recognition device, to the voice
- the identification device sends an audio termination signal for informing the user of the voice recognition device that the phrase currently spoken has been completed, so that the voice recognition device sends the audio signals together to the subsequent processing device, and finally the audio signals are voiced.
- the message is sent out.
- the sub-signal of the preset time period is intercepted in the audio signal acquired last time, and the current audio signal and the intercepted sub-signal are spliced.
- the obtained audio signal hereinafter referred to as a spliced audio signal
- the subsequent voice signal is detected for the spliced audio signal.
- the sub-signal can be spliced before the current audio signal.
- the preset time period can be the last time
- the tail period of the captured audio signal and the duration corresponding to the period may be any length of time.
- the duration corresponding to the preset time period may be set to be no more than the product of the duration corresponding to the stitched audio signal and the preset ratio.
- the voice signal is included in the spliced audio signal, it may be determined whether the spliced audio signal obtained last time includes a voice signal, and if it is determined that the spliced audio signal obtained last time does not include a voice signal, the splicing may be performed.
- the starting point of the audio signal serves as the starting point of the speech signal. If it is detected that the spliced audio signal does not include the voice signal, it may be determined whether the spliced audio signal obtained last time includes a voice signal, and if it is determined that the spliced audio signal obtained last time contains the voice signal, the spliced audio may be The end of the signal is used as the end of the speech signal.
- the APP can perform recordings in a non-stop manner, and the recording is performed periodically.
- the voice signal detecting method provided by the embodiment of the present application can also be implemented by a voice signal detecting device.
- the specific structure of the device is shown in FIG. 4, and mainly includes the following devices:
- Obtaining module 41 acquiring an audio signal
- the dividing module 42 divides the audio signal into a plurality of short-term energy frames according to a frequency of the preset voice signal
- a determining module 43 determining the energy of each short-term energy frame
- the detecting module 44 detects whether the audio signal includes a voice signal according to the energy of each short-term energy frame.
- the acquiring module 41 acquires a current audio signal; and in the last acquired audio signal, intercepting a sub-signal of a preset time period;
- the current audio signal and the intercepted sub-signal are spliced as an acquired audio signal.
- the dividing module 42 determines a period of the preset voice signal according to a frequency of the preset voice signal
- the audio signal is divided into a plurality of short-term energy frames of the same duration according to the determined period.
- the detecting module 44 determines a ratio of the number of short-term energy frames whose energy is greater than a preset threshold to the total number of all short-term energy frames;
- the detecting module 44 determines a ratio of the number of short-term energy frames whose energy is greater than a preset threshold to the total number of all short-term energy frames;
- the short-term energy frame is When there are no at least N consecutive short-time energy frames, it is determined that the audio signal is not detected in the audio signal.
- the voice signal detection method used in the embodiment of the present application does not need to perform complex calculation such as Fourier transform.
- the voice signal detecting method provided by the embodiment of the present invention can solve the problem that the processing method of the voice signal detecting method in the prior art has a slow processing speed and consumes a large amount of resources.
- the computer program instructions can also be stored in a computer readable memory that can direct a computer or other programmable data processing device to operate in a particular manner, such that the instructions stored in the computer readable memory produce an article of manufacture comprising the instruction device.
- the apparatus implements the functions specified in one or more blocks of a flow or a flow and/or block diagram of the flowchart.
- These computer program instructions can also be loaded onto a computer or other programmable data processing device such that a series of operational steps are performed on a computer or other programmable device to produce computer-implemented processing for execution on a computer or other programmable device.
- the instructions provide steps for implementing the functions specified in one or more of the flow or in a block or blocks of a flow diagram.
- a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
- processors CPUs
- input/output interfaces network interfaces
- memory volatile and non-volatile memory
- the memory may include non-persistent memory, random access memory (RAM), and/or non-volatile memory in a computer readable medium, such as read only memory (ROM) or flash memory.
- RAM random access memory
- ROM read only memory
- Memory is an example of a computer readable medium.
- Computer readable media includes both permanent and non-persistent, removable and non-removable media.
- Information storage can be implemented by any method or technology.
- the information can be computer readable instructions, data structures, modules of programs, or other data.
- Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read only memory. (ROM), electrically erasable programmable read only memory (EEPROM), flash memory or other memory technology, compact disk read only memory (CD-ROM), digital versatile disk (DVD) or other optical storage, Magnetic tape cartridges, magnetic tape storage or other magnetic storage devices or any other non-transportable media can be used to store information that can be accessed by a computing device.
- computer readable media does not include temporary storage of computer readable media, such as modulated data signals and carrier waves.
- embodiments of the present application can be provided as a method, system, or computer program product.
- the present application can take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment in combination of software and hardware.
- the application can take the form of a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) including computer usable program code.
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Signal Processing (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Telephone Function (AREA)
- Circuits Of Receivers In General (AREA)
- Mobile Radio Communication Systems (AREA)
- Electric Clocks (AREA)
- Time-Division Multiplex Systems (AREA)
- Signal Processing For Digital Recording And Reproducing (AREA)
Priority Applications (7)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| KR1020197013519A KR102214888B1 (ko) | 2016-10-12 | 2017-09-26 | 오디오 신호를 검출하기 위한 방법 및 디바이스 |
| JP2019520035A JP6859499B2 (ja) | 2016-10-12 | 2017-09-26 | 音声信号検出方法及び装置 |
| MYPI2019001999A MY201634A (en) | 2016-10-12 | 2017-09-26 | Voice signal detection method and apparatus |
| SG11201903320XA SG11201903320XA (en) | 2016-10-12 | 2017-09-26 | Voice signal detection method and apparatus |
| EP17860814.7A EP3528251B1 (en) | 2016-10-12 | 2017-09-26 | Method and device for detecting audio signal |
| PH1/2019/500784A PH12019500784B1 (en) | 2016-10-12 | 2017-09-26 | Voice signal detection method and apparatus |
| US16/380,609 US10706874B2 (en) | 2016-10-12 | 2019-04-10 | Voice signal detection method and apparatus |
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201610890946.9 | 2016-10-12 | ||
| CN201610890946.9A CN106887241A (zh) | 2016-10-12 | 2016-10-12 | 一种语音信号检测方法与装置 |
Related Child Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| US16/380,609 Continuation US10706874B2 (en) | 2016-10-12 | 2019-04-10 | Voice signal detection method and apparatus |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2018068636A1 true WO2018068636A1 (zh) | 2018-04-19 |
Family
ID=59176496
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/CN2017/103489 Ceased WO2018068636A1 (zh) | 2016-10-12 | 2017-09-26 | 一种语音信号检测方法与装置 |
Country Status (10)
| Country | Link |
|---|---|
| US (1) | US10706874B2 (enExample) |
| EP (1) | EP3528251B1 (enExample) |
| JP (2) | JP6859499B2 (enExample) |
| KR (1) | KR102214888B1 (enExample) |
| CN (1) | CN106887241A (enExample) |
| MY (1) | MY201634A (enExample) |
| PH (1) | PH12019500784B1 (enExample) |
| SG (1) | SG11201903320XA (enExample) |
| TW (1) | TWI654601B (enExample) |
| WO (1) | WO2018068636A1 (enExample) |
Families Citing this family (14)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN106887241A (zh) * | 2016-10-12 | 2017-06-23 | 阿里巴巴集团控股有限公司 | 一种语音信号检测方法与装置 |
| CN107957918B (zh) * | 2016-10-14 | 2019-05-10 | 腾讯科技(深圳)有限公司 | 数据恢复方法和装置 |
| CN108257616A (zh) * | 2017-12-05 | 2018-07-06 | 苏州车萝卜汽车电子科技有限公司 | 人机对话的检测方法以及装置 |
| CN108305639B (zh) * | 2018-05-11 | 2021-03-09 | 南京邮电大学 | 语音情感识别方法、计算机可读存储介质、终端 |
| CN108682432B (zh) * | 2018-05-11 | 2021-03-16 | 南京邮电大学 | 语音情感识别装置 |
| CN108847217A (zh) * | 2018-05-31 | 2018-11-20 | 平安科技(深圳)有限公司 | 一种语音切分方法、装置、计算机设备及存储介质 |
| CN109545193B (zh) * | 2018-12-18 | 2023-03-14 | 百度在线网络技术(北京)有限公司 | 用于生成模型的方法和装置 |
| CN110225444A (zh) * | 2019-06-14 | 2019-09-10 | 四川长虹电器股份有限公司 | 一种麦克风阵列系统的故障检测方法及其检测系统 |
| CN111724783B (zh) * | 2020-06-24 | 2023-10-17 | 北京小米移动软件有限公司 | 智能设备的唤醒方法、装置、智能设备及介质 |
| CN113270118B (zh) * | 2021-05-14 | 2024-02-13 | 杭州网易智企科技有限公司 | 语音活动侦测方法及装置、存储介质和电子设备 |
| CN116612775A (zh) * | 2022-02-09 | 2023-08-18 | 宸芯科技股份有限公司 | 一种杂音消除方法、装置、电子设备及介质 |
| CN114792530B (zh) * | 2022-04-26 | 2025-07-04 | 美的集团(上海)有限公司 | 语音数据处理方法、装置、电子设备和存储介质 |
| CN114898774B (zh) * | 2022-05-06 | 2025-06-13 | 钉钉(中国)信息技术有限公司 | 一种音频掉点的检测方法及装置 |
| CN116863947A (zh) * | 2023-07-27 | 2023-10-10 | 海纳科德(湖北)科技有限公司 | 一种利用宠物语音信号识别情绪的方法及系统 |
Citations (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN101494049A (zh) * | 2009-03-11 | 2009-07-29 | 北京邮电大学 | 一种用于音频监控系统中的音频特征参数的提取方法 |
| CN101625860A (zh) * | 2008-07-10 | 2010-01-13 | 新奥特(北京)视频技术有限公司 | 语音端点检测中的背景噪声自适应调整方法 |
| CN103198838A (zh) * | 2013-03-29 | 2013-07-10 | 苏州皓泰视频技术有限公司 | 一种用于嵌入式系统的异常声音监控方法和监控装置 |
| CN103544961A (zh) * | 2012-07-10 | 2014-01-29 | 中兴通讯股份有限公司 | 语音信号处理方法及装置 |
| CN103646649A (zh) * | 2013-12-30 | 2014-03-19 | 中国科学院自动化研究所 | 一种高效的语音检测方法 |
| CN106887241A (zh) * | 2016-10-12 | 2017-06-23 | 阿里巴巴集团控股有限公司 | 一种语音信号检测方法与装置 |
Family Cites Families (24)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JP3297346B2 (ja) * | 1997-04-30 | 2002-07-02 | 沖電気工業株式会社 | 音声検出装置 |
| TW333610B (en) | 1997-10-16 | 1998-06-11 | Winbond Electronics Corp | The phonetic detecting apparatus and its detecting method |
| US6480823B1 (en) | 1998-03-24 | 2002-11-12 | Matsushita Electric Industrial Co., Ltd. | Speech detection for noisy conditions |
| JP3266124B2 (ja) * | 1999-01-07 | 2002-03-18 | ヤマハ株式会社 | アナログ信号中の類似波形検出装置及び同信号の時間軸伸長圧縮装置 |
| KR100463657B1 (ko) * | 2002-11-30 | 2004-12-29 | 삼성전자주식회사 | 음성구간 검출 장치 및 방법 |
| US7715447B2 (en) | 2003-12-23 | 2010-05-11 | Intel Corporation | Method and system for tone detection |
| JP5459220B2 (ja) | 2008-11-27 | 2014-04-02 | 日本電気株式会社 | 発話音声検出装置 |
| ES2371619B1 (es) | 2009-10-08 | 2012-08-08 | Telefónica, S.A. | Procedimiento de detección de segmentos de voz. |
| BR112012008671A2 (pt) | 2009-10-19 | 2016-04-19 | Ericsson Telefon Ab L M | método para detectar atividade de voz de um sinal de entrada recebido, e, detector de atividade de voz |
| KR101666521B1 (ko) * | 2010-01-08 | 2016-10-14 | 삼성전자 주식회사 | 입력 신호의 피치 주기 검출 방법 및 그 장치 |
| US20130090926A1 (en) | 2011-09-16 | 2013-04-11 | Qualcomm Incorporated | Mobile device context information using speech detection |
| CN102568457A (zh) * | 2011-12-23 | 2012-07-11 | 深圳市万兴软件有限公司 | 一种基于哼唱输入的乐曲合成方法及装置 |
| US9351089B1 (en) * | 2012-03-14 | 2016-05-24 | Amazon Technologies, Inc. | Audio tap detection |
| JP5772739B2 (ja) * | 2012-06-21 | 2015-09-02 | ヤマハ株式会社 | 音声処理装置 |
| HUE038398T2 (hu) * | 2012-08-31 | 2018-10-29 | Ericsson Telefon Ab L M | Eljárás és eszköz hang aktivitás észlelésére |
| CN103117067B (zh) * | 2013-01-19 | 2015-07-15 | 渤海大学 | 一种低信噪比下语音端点检测方法 |
| CN103177722B (zh) * | 2013-03-08 | 2016-04-20 | 北京理工大学 | 一种基于音色相似度的歌曲检索方法 |
| CN103247293B (zh) * | 2013-05-14 | 2015-04-08 | 中国科学院自动化研究所 | 一种语音数据的编码及解码方法 |
| WO2014194273A2 (en) * | 2013-05-30 | 2014-12-04 | Eisner, Mark | Systems and methods for enhancing targeted audibility |
| US9502028B2 (en) | 2013-10-18 | 2016-11-22 | Knowles Electronics, Llc | Acoustic activity detection apparatus and method |
| CN104916288B (zh) | 2014-03-14 | 2019-01-18 | 深圳Tcl新技术有限公司 | 一种音频中人声突出处理的方法及装置 |
| CN104934032B (zh) * | 2014-03-17 | 2019-04-05 | 华为技术有限公司 | 根据频域能量对语音信号进行处理的方法和装置 |
| US9406313B2 (en) * | 2014-03-21 | 2016-08-02 | Intel Corporation | Adaptive microphone sampling rate techniques |
| CN106328168B (zh) * | 2016-08-30 | 2019-10-18 | 成都普创通信技术股份有限公司 | 一种语音信号相似度检测方法 |
-
2016
- 2016-10-12 CN CN201610890946.9A patent/CN106887241A/zh active Pending
-
2017
- 2017-09-12 TW TW106131148A patent/TWI654601B/zh active
- 2017-09-26 PH PH1/2019/500784A patent/PH12019500784B1/en unknown
- 2017-09-26 JP JP2019520035A patent/JP6859499B2/ja active Active
- 2017-09-26 KR KR1020197013519A patent/KR102214888B1/ko active Active
- 2017-09-26 MY MYPI2019001999A patent/MY201634A/en unknown
- 2017-09-26 SG SG11201903320XA patent/SG11201903320XA/en unknown
- 2017-09-26 WO PCT/CN2017/103489 patent/WO2018068636A1/zh not_active Ceased
- 2017-09-26 EP EP17860814.7A patent/EP3528251B1/en active Active
-
2019
- 2019-04-10 US US16/380,609 patent/US10706874B2/en active Active
-
2020
- 2020-12-04 JP JP2020201829A patent/JP6999012B2/ja active Active
Patent Citations (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN101625860A (zh) * | 2008-07-10 | 2010-01-13 | 新奥特(北京)视频技术有限公司 | 语音端点检测中的背景噪声自适应调整方法 |
| CN101494049A (zh) * | 2009-03-11 | 2009-07-29 | 北京邮电大学 | 一种用于音频监控系统中的音频特征参数的提取方法 |
| CN103544961A (zh) * | 2012-07-10 | 2014-01-29 | 中兴通讯股份有限公司 | 语音信号处理方法及装置 |
| CN103198838A (zh) * | 2013-03-29 | 2013-07-10 | 苏州皓泰视频技术有限公司 | 一种用于嵌入式系统的异常声音监控方法和监控装置 |
| CN103646649A (zh) * | 2013-12-30 | 2014-03-19 | 中国科学院自动化研究所 | 一种高效的语音检测方法 |
| CN106887241A (zh) * | 2016-10-12 | 2017-06-23 | 阿里巴巴集团控股有限公司 | 一种语音信号检测方法与装置 |
Also Published As
| Publication number | Publication date |
|---|---|
| PH12019500784A1 (en) | 2019-11-11 |
| JP2019535039A (ja) | 2019-12-05 |
| KR102214888B1 (ko) | 2021-02-15 |
| JP2021071729A (ja) | 2021-05-06 |
| US20190237097A1 (en) | 2019-08-01 |
| PH12019500784B1 (en) | 2024-02-28 |
| SG11201903320XA (en) | 2019-05-30 |
| TWI654601B (zh) | 2019-03-21 |
| CN106887241A (zh) | 2017-06-23 |
| EP3528251A1 (en) | 2019-08-21 |
| US10706874B2 (en) | 2020-07-07 |
| KR20190061076A (ko) | 2019-06-04 |
| EP3528251A4 (en) | 2019-08-21 |
| JP6999012B2 (ja) | 2022-01-18 |
| EP3528251B1 (en) | 2022-02-23 |
| JP6859499B2 (ja) | 2021-04-14 |
| MY201634A (en) | 2024-03-06 |
| TW201814692A (zh) | 2018-04-16 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| WO2018068636A1 (zh) | 一种语音信号检测方法与装置 | |
| US11670325B2 (en) | Voice activity detection using a soft decision mechanism | |
| CN108766418B (zh) | 语音端点识别方法、装置及设备 | |
| US11037560B2 (en) | Method, apparatus and storage medium for wake up processing of application | |
| CN109065044B (zh) | 唤醒词识别方法、装置、电子设备及计算机可读存储介质 | |
| CN108986822A (zh) | 语音识别方法、装置、电子设备及非暂态计算机存储介质 | |
| CN108877779B (zh) | 用于检测语音尾点的方法和装置 | |
| CN111667843B (zh) | 终端设备的语音唤醒方法、系统、电子设备、存储介质 | |
| CN109741753A (zh) | 一种语音交互方法、装置、终端及服务器 | |
| CN109767784B (zh) | 鼾声识别的方法及装置、存储介质和处理器 | |
| CN114283818A (zh) | 语音交互方法、装置、设备、存储介质和程序产品 | |
| CN114333912B (zh) | 语音激活检测方法、装置、电子设备和存储介质 | |
| US20250252969A1 (en) | Audio processing method and apparatus, storage medium, and electronic device | |
| CN109994129A (zh) | 语音处理系统、方法和设备 | |
| CN108093356B (zh) | 一种啸叫检测方法及装置 | |
| CN112542157B (zh) | 语音处理方法、装置、电子设备及计算机可读存储介质 | |
| CN110085264B (zh) | 语音信号检测方法、装置、设备及存储介质 | |
| CN113436641A (zh) | 一种音乐转场时间点检测方法、设备及介质 | |
| HK1237986A (en) | Voice signal detection method and apparatus | |
| HK1237986A1 (en) | Voice signal detection method and apparatus | |
| CN113129904A (zh) | 声纹判定方法、装置、系统、设备和存储介质 | |
| TW201828285A (zh) | 音頻識別方法和系統 | |
| US11790931B2 (en) | Voice activity detection using zero crossing detection | |
| CN111883159B (zh) | 语音的处理方法及装置 | |
| US20220130405A1 (en) | Low Complexity Voice Activity Detection Algorithm |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 17860814 Country of ref document: EP Kind code of ref document: A1 |
|
| ENP | Entry into the national phase |
Ref document number: 2019520035 Country of ref document: JP Kind code of ref document: A |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| ENP | Entry into the national phase |
Ref document number: 20197013519 Country of ref document: KR Kind code of ref document: A |
|
| ENP | Entry into the national phase |
Ref document number: 2017860814 Country of ref document: EP Effective date: 20190513 |