WO2020253073A1 - Speech endpoint detection method, apparatus and device, and storage medium - Google Patents

Speech endpoint detection method, apparatus and device, and storage medium Download PDF

Info

Publication number
WO2020253073A1
WO2020253073A1 PCT/CN2019/118699 CN2019118699W WO2020253073A1 WO 2020253073 A1 WO2020253073 A1 WO 2020253073A1 CN 2019118699 W CN2019118699 W CN 2019118699W WO 2020253073 A1 WO2020253073 A1 WO 2020253073A1
Authority
WO
WIPO (PCT)
Prior art keywords
voice
frame
detection
speech
endpoint
Prior art date
Application number
PCT/CN2019/118699
Other languages
French (fr)
Chinese (zh)
Inventor
魏韬
马骏
王少军
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2020253073A1 publication Critical patent/WO2020253073A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/04Segmentation; Word boundary detection
    • G10L15/05Word boundary detection
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals

Definitions

  • This application relates to the field of artificial intelligence technology, and in particular to a voice endpoint detection method, device, equipment and storage medium.
  • voice endpoint detection that is, to detect the start and end positions of the voice.
  • the current voice endpoint detection algorithm is usually only suitable for voice recognition in relatively quiet scenes. This method is suitable for relatively stable noise (Such as white noise, siren sound, etc.) The effect is better, but the effect is poor for noisy environments (such as public places where many people speak). The reason is that the noise in such situations also has the characteristics of speech, so it is difficult to be accurate Distinguish noise from speech, which leads to low speech recognition rate.
  • the main purpose of this application is to provide a voice endpoint detection method, device, device, and storage medium, which aims to solve the technical problem of poor voice recognition accuracy caused by poor voice endpoint detection.
  • the present application provides a voice endpoint detection method, which includes the following steps:
  • the voice start endpoint and voice end endpoint of the input voice are determined.
  • the voice frame detection model includes: a voice model and a noise model; before the step of acquiring the input voice to be detected and a preset voice frame detection model, it further includes:
  • a preset second machine learning algorithm is used for training, and a noise model is constructed for use in detecting noise frames.
  • the sequentially inputting each voice frame of the input voice into the voice frame detection model for detection, and outputting the first detection result corresponding to each voice frame includes:
  • the first detection result corresponding to each speech frame is output, wherein if the first probability value of the speech frame being a valid speech frame is greater than the second probability value of being a noise frame, The voice frame is determined to be a valid voice frame, otherwise it is a noise frame.
  • the performing harmonic energy detection on each voice frame of the input voice in sequence to obtain the second detection result corresponding to each voice frame includes:
  • the i-th speech frame is a valid speech frame, otherwise it is a noise frame.
  • the calculation formula of the short-term speech energy is as follows:
  • M(i) represents the short-term speech energy of the i-th speech frame
  • x(n) represents the time domain signal of the speech waveform
  • w(n) represents the window function
  • y i (n) represents the frame after w(n)
  • b represents the frame shift length
  • n 1, 2,...L
  • i 1, 2,...f n
  • L represents the frame length
  • f n represents the total number of frames after framing .
  • the determining the frame category corresponding to each speech frame based on the first detection result and the second detection result includes:
  • the first detection result is that the voice frame is a valid voice frame
  • the second detection result is that the voice frame is a valid voice frame, determining that the frame category corresponding to the voice frame is a valid voice frame
  • the first detection result is that the voice frame is a valid voice frame
  • the second detection result is that the voice frame is a noise frame
  • the first detection result is that the voice frame is a noise frame
  • the second detection result is that the voice frame is a valid voice frame, determining that the frame category corresponding to the voice frame is a noise frame
  • the first detection result is that the voice frame is a noise frame
  • the second detection result is that the voice frame is a noise frame
  • the determining the voice start endpoint and the voice end endpoint of the input voice based on the frame category corresponding to each voice frame includes:
  • a preset detection window determine whether the frame type corresponding to each voice frame in the detection window meets a preset voice endpoint determination condition
  • the voice endpoint determination condition includes: if the ratio of valid voice frames in the current detection window exceeds the preset first ratio, determining that the voice start endpoint of the input voice exists in the current detection window; if there is valid voice in the current detection window If the ratio of the frame is lower than the preset second ratio, it is determined that there is a voice end endpoint of the input voice in the current detection window.
  • the present application also provides a voice endpoint detection device, the voice endpoint detection device includes:
  • the acquisition module is used to acquire the input voice to be detected and the preset voice frame detection model
  • the framing module is used to perform framing processing on the input voice to obtain multiple voice frames with time sequence;
  • the first detection module is configured to sequentially input each voice frame of the input voice into the voice frame detection model for detection, and output a first detection result corresponding to each voice frame;
  • the second detection module is configured to perform harmonic energy detection on each voice frame of the input voice in sequence to obtain a second detection result corresponding to each voice frame;
  • a frame type determining module configured to determine a frame type corresponding to each speech frame based on the first detection result and the second detection result, the frame type including valid speech frames and noise frames;
  • the voice endpoint determination module is used to determine the voice start endpoint and voice end endpoint of the input voice based on the frame category corresponding to each voice frame.
  • the present application also provides a voice endpoint detection device, the voice endpoint detection device includes a memory, a processor, and a voice endpoint detection device stored in the memory and running on the processor A program, when the voice endpoint detection program is executed by the processor, the steps of the voice endpoint detection method as described in any one of the above are implemented.
  • the present application also provides a computer-readable storage medium having a voice endpoint detection program stored on the computer-readable storage medium, and when the voice endpoint detection program is executed by a processor, any of the foregoing One of the steps of the voice endpoint detection method.
  • This application uses a preset speech frame detection model and a harmonic energy detection method to detect each speech frame of the input speech, and then combines the two detection results to determine whether each speech frame is a valid speech frame or a noise frame; and finally Based on the frame category corresponding to each voice frame, the voice start endpoint and voice end endpoint of the input voice are determined.
  • This application integrates a variety of detection algorithms, which can improve the accuracy of voice endpoint detection to a certain extent.
  • the voice endpoint is determined according to the frame category corresponding to each voice frame, so it can adapt to various voice recognition scenarios and improve Speech recognition accuracy rate.
  • FIG. 1 is a schematic structural diagram of an operating environment of a voice endpoint detection device involved in a solution of an embodiment of the application;
  • FIG. 2 is a schematic flowchart of an embodiment of a voice endpoint detection method according to this application.
  • FIG. 3 is a detailed flowchart of an embodiment of step S30 in FIG. 2;
  • FIG. 4 is a detailed flowchart of an embodiment of step S40 in FIG. 2;
  • FIG. 5 is a detailed flowchart of an embodiment of step S60 in FIG. 2;
  • FIG. 6 is a schematic diagram of functional modules of an embodiment of voice endpoint detection according to this application.
  • This application provides a voice endpoint detection device.
  • FIG. 1 is a schematic structural diagram of an operating environment of a voice endpoint detection device involved in a solution in an embodiment of this application.
  • the voice endpoint detection device includes: a processor 1001, such as a CPU, a communication bus 1002, a user interface 1003, a network interface 1004, and a memory 1005.
  • the communication bus 1002 is used to implement connection and communication between these components.
  • the user interface 1003 may include a display screen (Display) and an input unit such as a keyboard (Keyboard), and the network interface 1004 may optionally include a standard wired interface and a wireless interface (such as a WI-FI interface).
  • the memory 1005 may be a high-speed RAM memory, or a non-volatile memory (non-volatile memory), such as a magnetic disk memory.
  • the memory 1005 may also be a storage device independent of the foregoing processor 1001.
  • the hardware structure of the voice endpoint detection device shown in FIG. 1 does not constitute a limitation on the voice endpoint detection device, and may include more or less components than shown in the figure, or a combination of certain components, Or different component arrangements.
  • the memory 1005 as a computer-readable storage medium may include an operating system, a network communication module, a user interface module, and a voice endpoint detection program.
  • the operating system is a program that manages and controls the voice endpoint detection equipment and software resources, and supports the operation of the voice endpoint detection program and other software and/or programs.
  • the network interface 1004 is mainly used to access the network; the user interface 1003 is mainly used to detect and confirm instructions and edit instructions.
  • the processor 1001 may be used to call the voice endpoint detection program stored in the memory 1005, and execute the operations of the following embodiments of the voice endpoint detection method.
  • Fig. 2 is a schematic flowchart of an embodiment of a voice endpoint detection method according to the present application.
  • the voice endpoint detection method includes the following steps:
  • Step S10 acquiring the input voice to be detected and a preset voice frame detection model
  • the input voice is not limited, and it may be voice in a quiet environment or voice in various noisy environments.
  • this embodiment pre-trains a voice frame detection model, and detects the input voice through the voice frame detection model.
  • Step S20 Perform framing processing on the input voice to obtain multiple voice frames with time series;
  • the voice signal is usually not stable on the macro level, and stable on the micro level, with short-term stability (the voice signal can be considered to be approximately unchanged within 10-30ms). Therefore, in the process of voice signal processing, in order to reduce the voice signal
  • the overall non-steady state and time-varying influences require framing processing of the speech signal. That is, the voice signal is divided into short segments for processing, and each short segment is called a frame.
  • Step S30 sequentially input each voice frame of the input voice into the voice frame detection model for detection, and output a first detection result corresponding to each voice frame;
  • the existing voice endpoint detection is difficult to accurately distinguish normal voice and noise in complex scenarios, the reasons are mainly manifested in the following two aspects: On the one hand, it is due to the applicable scenarios of the existing voice endpoint detection algorithm Relatively simple, for example, the detection effect is better for relatively stable noise (such as white noise, siren, etc.), but the detection effect is poor for noisy environments (such as public places where many people speak); on the other hand, the existing voice Endpoint detection algorithms usually can only detect from a single dimension, which is prone to misjudgment.
  • this embodiment preferably adopts multiple methods to perform endpoint detection on the input voice. Since multiple detection methods are used to detect multiple dimensions, the advantages of multiple detection algorithms can be combined to make the detection result more accurate.
  • a pre-trained voice frame detection model is used to detect each voice frame of the input voice, and the first detection result corresponding to each voice frame is output, for example, a certain voice frame is a valid voice frame (that is, a person's speaking voice) The probability that a certain speech frame is a noise frame.
  • Step S40 performing harmonic energy detection on each voice frame of the input voice in turn, to obtain a second detection result corresponding to each voice frame;
  • each voice frame of the input speech is also detected based on the harmonic energy dimension.
  • the voice signal is a harmonic signal with energy characteristics, and the harmonic energy can be measured by the magnitude of the harmonic amplitude. If the harmonic energy is high, the harmonic amplitude is large, and if the harmonic energy is low, the harmonic amplitude is small.
  • the harmonic energy of each voice frame is detected to distinguish between valid voice frames and noise frames. Harmonic energy detection can quickly distinguish voice and noise in a quiet environment, but for a noisy environment, the accuracy of detection is reduced due to noise interference.
  • Step S50 Determine a frame category corresponding to each voice frame based on the first detection result and the second detection result, where the frame category includes valid voice frames and noise frames;
  • the currently detected voice frame is a valid voice frame or a noise frame.
  • the different detection results of the same speech frame may be all the same, or all may be different, or may be partly the same and partly different.
  • a comprehensive analysis is performed in combination with the first detection result of the model dimension and the second detection result of the harmonic energy dimension, and then the frame type corresponding to each voice frame is determined.
  • the speech frame in this embodiment not only has speech features, but also has speech capability features, so the comprehensive judgment result obtained based on multi-dimensional detection is credible.
  • the following rules are specifically used to determine the frame category corresponding to the voice frame:
  • the first detection result is that the voice frame is a valid voice frame
  • the second detection result is that the voice frame is a valid voice frame
  • the first detection result is that the voice frame is a valid voice frame
  • the second detection result is that the voice frame is a noise frame
  • the first detection result is that the voice frame is a noise frame
  • the second detection result is that the voice frame is a valid voice frame
  • the first detection result is that the voice frame is a noise frame
  • the second detection result is that the voice frame is a noise frame
  • the frame category corresponding to the voice frame is determined to be a valid voice frame Otherwise, it is determined that the frame category corresponding to the speech frame is a noise frame.
  • Step S60 Determine the voice start endpoint and voice end endpoint of the input voice based on the frame category corresponding to each voice frame.
  • the voice start endpoint corresponds to a valid voice frame
  • the voice end endpoint corresponds to a noise frame (or silence).
  • a noisy environment due to the interference of external environmental noise, it cannot be used
  • Existing methods are used to determine voice endpoints. This embodiment specifically determines the voice start endpoint and voice end endpoint of the input voice based on the frame category corresponding to each voice frame. For example, if multiple consecutive voice frames are valid voice frames, it is determined that the voice start endpoint currently exists, and if multiple consecutive voice frames are noise frames, it is determined that the voice end endpoint currently exists.
  • a preset voice frame detection model and a harmonic energy detection method are used to detect each voice frame of the input voice, and then the two detection results are combined to determine whether each voice frame is a valid voice frame or a noise frame; Finally, based on the frame category corresponding to each voice frame, the voice start endpoint and voice end endpoint of the input voice are determined.
  • This embodiment integrates multiple detection algorithms, which can improve the accuracy of voice endpoint detection to a certain extent.
  • the voice endpoint is determined according to the frame category corresponding to each voice frame, so it can adapt to various voice recognition scenarios. Improve the accuracy of speech recognition.
  • using multiple voice frame detection models to perform model-dimensional voice frame detection specifically includes:
  • a voice model is constructed. Specifically, normal speech data is used as a training sample, and a preset first machine learning algorithm is used for training to construct a speech model for detecting valid speech frames.
  • a preset machine learning algorithm is used to train to construct a voice model.
  • a deep learning algorithm, a long short-term memory network model and other machine learning algorithms are used to build a model to extract normal voice data.
  • the voice features are input into the model for training, and then a voice model that can detect valid voice frames is constructed.
  • a noise model is constructed before performing voice endpoint detection. Specifically, real environmental noise is used as a training sample, and a preset second machine learning algorithm is used for training to construct a noise model for detecting noise frames.
  • a preset machine learning algorithm is used to train to construct a noise model, for example, a deep learning algorithm, a long short-term memory network model and other machine learning algorithms are used to build the model. Extract the voice features of the noise data and input the model for training, and then build a noise model that can detect noise frames.
  • FIG. 3 is a detailed flowchart of an embodiment of step S30 in FIG. Based on the foregoing embodiment, in this embodiment, the foregoing step S30 further includes:
  • Step S301 sequentially input each voice frame of the input voice into the voice model for detection, and output a first probability value that each voice frame is a valid voice frame;
  • each voice frame is sequentially input into the trained voice model for detection, and the probability value that each voice frame is a valid voice frame is output.
  • Step S302 sequentially input each voice frame of the input voice into the noise model for detection, and output a second probability value for each voice frame as a noise frame;
  • each voice frame is sequentially input into the trained noise model for detection, and the probability value of each voice frame being a noise frame is output.
  • Step S303 based on the first probability value and the second probability value, output a first detection result corresponding to each speech frame, wherein if the speech frame is a valid speech frame, the first probability value is greater than the second probability value of the noise frame. Probability value, the speech frame is determined to be a valid speech frame, otherwise it is a noise frame.
  • the same speech frame is input into two different models for speech frame recognition, thereby obtaining the probability value of the speech frame being a valid speech frame and the probability value of the speech frame being a noise frame. If the speech frame is valid If the probability value of a speech frame is greater than the probability value of a noise frame, it is determined that the speech frame is a valid speech frame, and if the probability value of the speech frame is a noise frame is greater than the probability value of a valid speech frame, the speech frame is determined to be a noise frame .
  • the probability value of the speech model output is 70%, 50%, 80%, and the probability value of the noise model output
  • the sequence is 45%, 80%, 25%, and finally it is determined that the speech frame a is a valid speech frame, the speech frame b is a noise frame, and the speech frame c is a valid speech frame.
  • FIG. 4 is a detailed flowchart of an embodiment of step S40 in FIG. Based on the foregoing embodiment, in this embodiment, the foregoing step S40 further includes:
  • Step S401 extracting the short-term speech energy in the time domain of the i-th speech frame of the input speech in sequence
  • Step S402 judging whether the short-term speech energy corresponding to the i-th speech frame is greater than the preset short-term speech energy
  • Step S403 if yes, determine that the i-th speech frame is a valid speech frame, otherwise it is a noise frame.
  • Short-term speech energy refers to the speech energy of audio signals in a relatively short time.
  • the short time usually refers to one frame of speech, that is, the speech energy within one frame is called short-term energy.
  • the energy of the speech frame is usually much higher than that of the noise. Therefore, the short-term speech energy can be used to distinguish between effective speech frames and noise frames.
  • the calculation method for calculating the short-term speech energy is not limited.
  • the calculation formula of the short-term speech energy is as follows:
  • M(i) represents the short-term speech energy of the i-th speech frame
  • x(n) represents the time domain signal of the speech waveform
  • w(n) represents the window function
  • yi(n) represents the w(n) framing process
  • the i-th frame speech signal obtained later b represents the frame shift length
  • n 1, 2,...L
  • i 1, 2,...fn
  • L represents the frame length
  • fn represents the total number of frames after framing.
  • the speech frame after calculating the short-term speech energy of a speech frame, first determine whether the short-term speech energy of the speech frame exceeds the preset short-term speech energy threshold, and if so, it is determined that the speech frame is Valid speech frame, otherwise judged as noise frame.
  • This embodiment detects the input speech signal from the perspective of the short-term speech energy of the speech frame, thereby determining the frame category corresponding to each frame of the input speech. Because the short-term speech energy detection method is convenient and the recognition accuracy rate is high. Therefore, the efficiency of voice endpoint detection for input voice can be greatly improved.
  • FIG. 5 is a detailed flowchart of an embodiment of step S60 in FIG. Based on the foregoing embodiment, in this embodiment, the foregoing step S60 further includes:
  • Step S601 In a preset detection window, determine whether the frame type corresponding to each voice frame in the detection window meets a preset voice endpoint determination condition;
  • step S602 if it is satisfied, it is determined that the voice start endpoint or the voice end endpoint of the input voice is located in the current detection window.
  • the detection window and the proportion are combined to perform voice endpoint detection. judgment.
  • the detection window specifically includes: a voice start endpoint detection window and a voice end endpoint detection window.
  • the size of the detection window used for the voice start endpoint judgment is different from that used for the voice end endpoint judgment.
  • the detection window used by the voice start endpoint is smaller than the detection window used by the voice end endpoint.
  • the voice endpoint determination conditions specifically include:
  • A. Voice start endpoint determination condition if the ratio of valid voice frames in the current detection window exceeds the preset first ratio, it is determined that the voice start endpoint of the input voice exists in the current detection window;
  • Voice end endpoint determination condition if the ratio of valid voice frames in the current detection window is lower than the preset second ratio, it is determined that the voice end endpoint of the input voice exists in the current detection window.
  • a voice start endpoint detection window in advance, for example, the size of the window is 20 frames, and then count the number of valid voice frames in the detection window, and finally judge the monitoring window Whether the ratio value between the effective speech frame and the total number of frames in the window exceeds the preset ratio value (such as 60%), if so, it is determined that there is a speech start endpoint in the current detection window.
  • the preset ratio value such as 60%
  • the size of the window is 50 frames, and then count the number of noise frames in the detection window. Whether the ratio between the total number of frames in the window is lower than a preset ratio (for example, 10%), if so, it is determined that there is a voice end endpoint in the current detection window.
  • a preset ratio for example, 10%
  • This application also provides a voice endpoint detection device.
  • FIG. 6 is a schematic diagram of functional modules of an embodiment of voice endpoint detection in this application.
  • the voice endpoint detection device includes:
  • the obtaining module 10 is used to obtain the input voice to be detected and the preset voice frame detection model
  • the framing module 20 is configured to perform framing processing on the input voice to obtain multiple voice frames with time series;
  • the first detection module 30 is configured to sequentially input each voice frame of the input voice into the voice frame detection model for detection, and output a first detection result corresponding to each voice frame;
  • the second detection module 40 is configured to perform harmonic energy detection on each voice frame of the input voice in sequence to obtain a second detection result corresponding to each voice frame;
  • the frame type determining module 50 is configured to determine the frame type corresponding to each speech frame based on the first detection result and the second detection result, where the frame type includes valid speech frames and noise frames;
  • the voice endpoint determining module 60 is configured to determine the voice start endpoint and voice end endpoint of the input voice based on the frame category corresponding to each voice frame.
  • a preset voice frame detection model and a harmonic energy detection method are used to detect each voice frame of the input voice, and then the two detection results are combined to determine whether each voice frame is a valid voice frame or a noise frame; Finally, based on the frame category corresponding to each voice frame, the voice start endpoint and voice end endpoint of the input voice are determined.
  • This embodiment integrates multiple detection algorithms, which can improve the accuracy of voice endpoint detection to a certain extent.
  • the voice endpoint is determined according to the frame category corresponding to each voice frame, so it can adapt to various voice recognition scenarios. Improve the accuracy of speech recognition.
  • the present application also provides a computer-readable storage medium, where the computer-readable storage medium may be volatile or non-volatile, which is not specifically limited by the present application.
  • a voice endpoint detection program is stored on the computer-readable storage medium, and the voice endpoint detection program is executed by the processor to implement the steps of the voice endpoint detection method described in any of the above embodiments.
  • the method implemented when the voice endpoint detection program is executed by the processor can refer to the various embodiments of the voice endpoint detection method of the present application, so it will not be repeated.

Abstract

A speech endpoint detection method, comprising the following steps: obtaining input speech to be detected and a preset speech frame detection model (S10); performing framing processing on the input speech to obtain multiple speech frames having time sequences (S20); sequentially inputting speech frames of the input speech into the speech frame detection model for detection, and outputting first detection results corresponding to the speech frames (S30); sequentially performing harmonic energy detection on the speech frames of the input speech to obtain second detection results corresponding to the speech frames (S40); determining frame types corresponding to the speech frames on the basis of the first detection results and the second detection results (S50); and determining a speech starting endpoint and a speech ending endpoint of the input speech on the basis of the frame types corresponding to the speech frames (S60). The method improves the accuracy of speech endpoint detection.

Description

语音端点检测方法、装置、设备及存储介质Voice endpoint detection method, device, equipment and storage medium
本申请要求于2019年06月17日提交中国专利局、申请号为201910521084.6、发明名称为“语音端点检测方法、装置、设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在申请中。This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on June 17, 2019, the application number is 201910521084.6, and the invention title is "Voice Endpoint Detection Method, Apparatus, Equipment and Storage Medium", the entire content of which is incorporated by reference In application.
技术领域Technical field
本申请涉人工智能技术领域,尤其涉及一种语音端点检测方法、装置、设备及存储介质。This application relates to the field of artificial intelligence technology, and in particular to a voice endpoint detection method, device, equipment and storage medium.
背景技术Background technique
现有语音识别技术中经常需要语音端点检测,也即检测语音的起始位置和结束位置,目前语音端点检测算法通常仅适用于相对安静场景下的语音和别,此种方法对于较稳定的噪声(如白噪声,汽笛声等)效果较好,但对于嘈杂环境(如较多人说话的公共场合)效果较差,其原因在于此类情景下的噪声也具备语音的特性,因而很难准确将噪声与语音区分开来,进而导致语音识别率不高。Existing voice recognition technology often requires voice endpoint detection, that is, to detect the start and end positions of the voice. The current voice endpoint detection algorithm is usually only suitable for voice recognition in relatively quiet scenes. This method is suitable for relatively stable noise (Such as white noise, siren sound, etc.) The effect is better, but the effect is poor for noisy environments (such as public places where many people speak). The reason is that the noise in such situations also has the characteristics of speech, so it is difficult to be accurate Distinguish noise from speech, which leads to low speech recognition rate.
发明内容Summary of the invention
本申请的主要目的在于提供一种语音端点检测方法、装置、设备及存储介质,旨在解决现有语音端点检测效果差而导致语音识别准确率不高的技术问题。The main purpose of this application is to provide a voice endpoint detection method, device, device, and storage medium, which aims to solve the technical problem of poor voice recognition accuracy caused by poor voice endpoint detection.
为实现上述目的,本申请提供一种语音端点检测方法,所述语音端点检测方法包括以下步骤:To achieve the foregoing objective, the present application provides a voice endpoint detection method, which includes the following steps:
获取待检测的输入语音以及预置语音帧检测模型;Obtain the input voice to be detected and the preset voice frame detection model;
对所述输入语音进行分帧处理,得到多个带时序的语音帧;Framing the input voice to obtain multiple voice frames with time sequence;
依次将所述输入语音的各语音帧输入所述语音帧检测模型进行检测,输出各语音帧对应的第一检测结果;Sequentially input each voice frame of the input voice into the voice frame detection model for detection, and output a first detection result corresponding to each voice frame;
依次对所述输入语音的各语音帧进行谐波能量检测,得到各语音帧对应的第二检测结果;Sequentially perform harmonic energy detection on each voice frame of the input voice to obtain a second detection result corresponding to each voice frame;
基于所述第一检测结果与所述第二检测结果,确定各语音帧对应的帧类 别,所述帧类别包括有效语音帧、噪声帧;Determine a frame type corresponding to each speech frame based on the first detection result and the second detection result, the frame type including valid speech frames and noise frames;
基于各语音帧对应的帧类别,确定所述输入语音的语音开始端点与语音结束端点。Based on the frame category corresponding to each voice frame, the voice start endpoint and voice end endpoint of the input voice are determined.
可选地,所述语音帧检测模型包括:语音模型和噪声模型;在所述获取待检测的输入语音以及预置语音帧检测模型的步骤之前,还包括:Optionally, the voice frame detection model includes: a voice model and a noise model; before the step of acquiring the input voice to be detected and a preset voice frame detection model, it further includes:
以正常语音数据为训练样本,采用预设第一机器学习算法进行训练,构建语音模型,以供用于检测有效语音帧;Take normal speech data as training samples, and use the preset first machine learning algorithm for training to construct a speech model for detecting valid speech frames;
以真实环境噪声为训练样本,采用预设第二机器学习算法进行训练,构建噪声模型,以供用于检测噪声帧。Using real environmental noise as a training sample, a preset second machine learning algorithm is used for training, and a noise model is constructed for use in detecting noise frames.
可选地,所述依次将所述输入语音的各语音帧输入所述语音帧检测模型进行检测,输出各语音帧对应的第一检测结果包括:Optionally, the sequentially inputting each voice frame of the input voice into the voice frame detection model for detection, and outputting the first detection result corresponding to each voice frame includes:
依次将所述输入语音的各语音帧输入所述语音模型进行检测,输出每一语音帧为有效语音帧的第一概率值;Sequentially input each voice frame of the input voice into the voice model for detection, and output the first probability value of each voice frame as a valid voice frame;
依次将所述输入语音的各语音帧输入所述噪声模型进行检测,输出每一语音帧为噪声帧的第二概率值;Sequentially input each voice frame of the input voice into the noise model for detection, and output a second probability value of each voice frame as a noise frame;
基于所述第一概率值与所述第二概率值,输出各语音帧对应的第一检测结果,其中,若语音帧为有效语音帧的第一概率值大于为噪声帧的第二概率值,则判定语音帧为有效语音帧,否则为噪声帧。Based on the first probability value and the second probability value, the first detection result corresponding to each speech frame is output, wherein if the first probability value of the speech frame being a valid speech frame is greater than the second probability value of being a noise frame, The voice frame is determined to be a valid voice frame, otherwise it is a noise frame.
可选地,所述依次对所述输入语音的各语音帧进行谐波能量检测,得到各语音帧对应的第二检测结果包括:Optionally, the performing harmonic energy detection on each voice frame of the input voice in sequence to obtain the second detection result corresponding to each voice frame includes:
依次提取所述输入语音的第i帧语音帧在时域上的短时语音能量;Sequentially extracting the short-term speech energy in the time domain of the i-th speech frame of the input speech;
判断第i帧语音帧对应的短时语音能量是否大于预置短时语音能量;Determine whether the short-term speech energy corresponding to the i-th speech frame is greater than the preset short-term speech energy;
若是,则判定第i帧语音帧为有效语音帧,否则为噪声帧。If yes, it is determined that the i-th speech frame is a valid speech frame, otherwise it is a noise frame.
可选地,所述短时语音能量的计算公式如下:Optionally, the calculation formula of the short-term speech energy is as follows:
Figure PCTCN2019118699-appb-000001
Figure PCTCN2019118699-appb-000001
其中,M(i)表示第i帧语音帧的短时语音能量;x(n)表示语音波形时域信号;w(n)表示窗函数;y i(n)表示经过w(n)分帧处理后得到的第i帧语音信号;b表示帧移长度;n=1,2,…L;i=1,2,…f n;L表示帧长,f n表示分帧后的总帧数。 Among them, M(i) represents the short-term speech energy of the i-th speech frame; x(n) represents the time domain signal of the speech waveform; w(n) represents the window function; y i (n) represents the frame after w(n) The i-th frame of speech signal obtained after processing; b represents the frame shift length; n=1, 2,...L; i=1, 2,...f n ; L represents the frame length, f n represents the total number of frames after framing .
可选地,所述基于所述第一检测结果与所述第二检测结果,确定各语音帧对应的帧类别包括:Optionally, the determining the frame category corresponding to each speech frame based on the first detection result and the second detection result includes:
若所述第一检测结果为语音帧为有效语音帧、所述第二检测结果为语音帧为有效语音帧,则判定语音帧对应的帧类别为有效语音帧;If the first detection result is that the voice frame is a valid voice frame, and the second detection result is that the voice frame is a valid voice frame, determining that the frame category corresponding to the voice frame is a valid voice frame;
若所述第一检测结果为语音帧为有效语音帧、所述第二检测结果为语音帧为噪声帧,则判定语音帧对应的帧类别为噪声帧;If the first detection result is that the voice frame is a valid voice frame, and the second detection result is that the voice frame is a noise frame, determining that the frame category corresponding to the voice frame is a noise frame;
若所述第一检测结果为语音帧为噪声帧、所述第二检测结果为语音帧为有效语音帧,则判定语音帧对应的帧类别为噪声帧;If the first detection result is that the voice frame is a noise frame, and the second detection result is that the voice frame is a valid voice frame, determining that the frame category corresponding to the voice frame is a noise frame;
若所述第一检测结果为语音帧为噪声帧、所述第二检测结果为语音帧为噪声帧,则判定语音帧对应的帧类别为噪声帧。If the first detection result is that the voice frame is a noise frame, and the second detection result is that the voice frame is a noise frame, it is determined that the frame category corresponding to the voice frame is a noise frame.
可选地,所述基于各语音帧对应的帧类别,确定所述输入语音的语音开始端点与语音结束端点包括:Optionally, the determining the voice start endpoint and the voice end endpoint of the input voice based on the frame category corresponding to each voice frame includes:
在预设检测窗口内,判断所述检测窗口内各语音帧对应的帧类别是否满足预设的语音端点判定条件;In a preset detection window, determine whether the frame type corresponding to each voice frame in the detection window meets a preset voice endpoint determination condition;
若满足,则判定所述输入语音的语音开始端点或语音结束端点位于当前检测窗口内;If it is satisfied, it is determined that the voice start endpoint or the voice end endpoint of the input voice is located in the current detection window;
其中,所述语音端点判定条件包括:若当前检测窗口内有效语音帧的比例超过预设第一比例,则判定当前检测窗口内存在所述输入语音的语音开始端点;若当前检测窗口内有效语音帧的比例低于预设第二比例,则判定当前检测窗口内存在所述输入语音的语音结束端点。Wherein, the voice endpoint determination condition includes: if the ratio of valid voice frames in the current detection window exceeds the preset first ratio, determining that the voice start endpoint of the input voice exists in the current detection window; if there is valid voice in the current detection window If the ratio of the frame is lower than the preset second ratio, it is determined that there is a voice end endpoint of the input voice in the current detection window.
进一步地,为实现上述目的,本申请还提供一种语音端点检测装置,所述语音端点检测装置包括:Further, in order to achieve the above objective, the present application also provides a voice endpoint detection device, the voice endpoint detection device includes:
获取模块,用于获取待检测的输入语音以及预置语音帧检测模型;The acquisition module is used to acquire the input voice to be detected and the preset voice frame detection model;
分帧模块,用于对所述输入语音进行分帧处理,得到多个带时序的语音帧;The framing module is used to perform framing processing on the input voice to obtain multiple voice frames with time sequence;
第一检测模块,用于依次将所述输入语音的各语音帧输入所述语音帧检测模型进行检测,输出各语音帧对应的第一检测结果;The first detection module is configured to sequentially input each voice frame of the input voice into the voice frame detection model for detection, and output a first detection result corresponding to each voice frame;
第二检测模块,用于依次对所述输入语音的各语音帧进行谐波能量检测,得到各语音帧对应的第二检测结果;The second detection module is configured to perform harmonic energy detection on each voice frame of the input voice in sequence to obtain a second detection result corresponding to each voice frame;
帧类别确定模块,用于基于所述第一检测结果与所述第二检测结果,确定各语音帧对应的帧类别,所述帧类别包括有效语音帧、噪声帧;A frame type determining module, configured to determine a frame type corresponding to each speech frame based on the first detection result and the second detection result, the frame type including valid speech frames and noise frames;
语音端点确定模块,用于基于各语音帧对应的帧类别,确定所述输入语音的语音开始端点与语音结束端点。The voice endpoint determination module is used to determine the voice start endpoint and voice end endpoint of the input voice based on the frame category corresponding to each voice frame.
进一步地,为实现上述目的,本申请还提供一种语音端点检测设备,所述语音端点检测设备包括存储器、处理器以及存储在所述存储器上并可在所述处理器上运行的语音端点检测程序,所述语音端点检测程序被所述处理器执行时实现如上述任一项所述的语音端点检测方法的步骤。Further, in order to achieve the above objective, the present application also provides a voice endpoint detection device, the voice endpoint detection device includes a memory, a processor, and a voice endpoint detection device stored in the memory and running on the processor A program, when the voice endpoint detection program is executed by the processor, the steps of the voice endpoint detection method as described in any one of the above are implemented.
进一步地,为实现上述目的,本申请还提供一种计算机可读存储介质,所述计算机可读存储介质上存储有语音端点检测程序,所述语音端点检测程序被处理器执行时实现如上述任一项所述的语音端点检测方法的步骤。Further, in order to achieve the above objective, the present application also provides a computer-readable storage medium having a voice endpoint detection program stored on the computer-readable storage medium, and when the voice endpoint detection program is executed by a processor, any of the foregoing One of the steps of the voice endpoint detection method.
本申请使用预置语音帧检测模型和谐波能量检测方式分别对输入语音的各语音帧进行检测,然后再综合两次检测结果,确定各语音帧是属于有效语音帧还是属于噪声帧;最后再基于各语音帧对应的帧类别,确定所述输入语音的语音开始端点与语音结束端点。本申请综合了多种检测算法,因而可在一定程度上提升语音端点检测的准确性,并且本申请时根据各语音帧对应的帧类别来确定语音端点,因而能够适应各种语音识别场景,提升语音识别准确率。This application uses a preset speech frame detection model and a harmonic energy detection method to detect each speech frame of the input speech, and then combines the two detection results to determine whether each speech frame is a valid speech frame or a noise frame; and finally Based on the frame category corresponding to each voice frame, the voice start endpoint and voice end endpoint of the input voice are determined. This application integrates a variety of detection algorithms, which can improve the accuracy of voice endpoint detection to a certain extent. In this application, the voice endpoint is determined according to the frame category corresponding to each voice frame, so it can adapt to various voice recognition scenarios and improve Speech recognition accuracy rate.
附图说明Description of the drawings
图1为本申请实施例方案涉及的语音端点检测设备运行环境的结构示意图;FIG. 1 is a schematic structural diagram of an operating environment of a voice endpoint detection device involved in a solution of an embodiment of the application;
图2为本申请语音端点检测方法一实施例的流程示意图;2 is a schematic flowchart of an embodiment of a voice endpoint detection method according to this application;
图3为图2中步骤S30一实施例的细化流程示意图;FIG. 3 is a detailed flowchart of an embodiment of step S30 in FIG. 2;
图4为图2中步骤S40一实施例的细化流程示意图;FIG. 4 is a detailed flowchart of an embodiment of step S40 in FIG. 2;
图5为图2中步骤S60一实施例的细化流程示意图;FIG. 5 is a detailed flowchart of an embodiment of step S60 in FIG. 2;
图6为本申请语音端点检测一实施例的功能模块示意图。FIG. 6 is a schematic diagram of functional modules of an embodiment of voice endpoint detection according to this application.
本申请目的的实现、功能特点及优点将结合实施例,参照附图做进一步说明。The realization, functional characteristics, and advantages of the purpose of this application will be further described in conjunction with the embodiments and with reference to the accompanying drawings.
具体实施方式Detailed ways
应当理解,此处所描述的具体实施例仅用以解释本申请,并不用于限定本申请。It should be understood that the specific embodiments described herein are only used to explain the application, and not used to limit the application.
本申请提供一种语音端点检测设备。This application provides a voice endpoint detection device.
参照图1,图1为本申请实施例方案涉及的语音端点检测设备运行环境的结构示意图。Referring to FIG. 1, FIG. 1 is a schematic structural diagram of an operating environment of a voice endpoint detection device involved in a solution in an embodiment of this application.
如图1所示,该语音端点检测设备包括:处理器1001,例如CPU,通信总线1002、用户接口1003,网络接口1004,存储器1005。其中,通信总线1002用于实现这些组件之间的连接通信。用户接口1003可以包括显示屏(Display)、输入单元比如键盘(Keyboard),网络接口1004可选的可以包括标准的有线接口、无线接口(如WI-FI接口)。存储器1005可以是高速RAM存储器,也可以是稳定的存储器(non-volatile memory),例如磁盘存储器。存储器1005可选的还可以是独立于前述处理器1001的存储装置。As shown in FIG. 1, the voice endpoint detection device includes: a processor 1001, such as a CPU, a communication bus 1002, a user interface 1003, a network interface 1004, and a memory 1005. Among them, the communication bus 1002 is used to implement connection and communication between these components. The user interface 1003 may include a display screen (Display) and an input unit such as a keyboard (Keyboard), and the network interface 1004 may optionally include a standard wired interface and a wireless interface (such as a WI-FI interface). The memory 1005 may be a high-speed RAM memory, or a non-volatile memory (non-volatile memory), such as a magnetic disk memory. Optionally, the memory 1005 may also be a storage device independent of the foregoing processor 1001.
本领域技术人员可以理解,图1中示出的语音端点检测设备的硬件结构并不构成对语音端点检测设备的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。Those skilled in the art can understand that the hardware structure of the voice endpoint detection device shown in FIG. 1 does not constitute a limitation on the voice endpoint detection device, and may include more or less components than shown in the figure, or a combination of certain components, Or different component arrangements.
如图1所示,作为一种计算机可读存储介质的存储器1005中可以包括操作系统、网络通信模块、用户接口模块以及语音端点检测程序。其中,操作系统是管理和控制语音端点检测设备和软件资源的程序,支持语音端点检测程序以及其它软件和/或程序的运行。As shown in FIG. 1, the memory 1005 as a computer-readable storage medium may include an operating system, a network communication module, a user interface module, and a voice endpoint detection program. Among them, the operating system is a program that manages and controls the voice endpoint detection equipment and software resources, and supports the operation of the voice endpoint detection program and other software and/or programs.
在图1所示的语音端点检测设备的硬件结构中,网络接口1004主要用于接入网络;用户接口1003主要用于侦测确认指令和编辑指令等。而处理器1001可以用于调用存储器1005中存储的语音端点检测程序,并执行以下语音端点检测方法的各实施例的操作。In the hardware structure of the voice endpoint detection device shown in FIG. 1, the network interface 1004 is mainly used to access the network; the user interface 1003 is mainly used to detect and confirm instructions and edit instructions. The processor 1001 may be used to call the voice endpoint detection program stored in the memory 1005, and execute the operations of the following embodiments of the voice endpoint detection method.
基于上述语音端点检测设备硬件结构,提出本申请语音端点检测方法的各个实施例。Based on the foregoing hardware structure of the voice endpoint detection device, various embodiments of the voice endpoint detection method of the present application are proposed.
参照图2,图2为本申请语音端点检测方法一实施例的流程示意图。本实施例中,所述语音端点检测方法包括以下步骤:Referring to Fig. 2, Fig. 2 is a schematic flowchart of an embodiment of a voice endpoint detection method according to the present application. In this embodiment, the voice endpoint detection method includes the following steps:
步骤S10,获取待检测的输入语音以及预置语音帧检测模型;Step S10, acquiring the input voice to be detected and a preset voice frame detection model;
本实施例对于输入语音不限,既可以是安静环境下的语音,也可以是各 种嘈杂环境下的语音。同时,为提升语音端点检测的准确度,本实施例预先训练了语音帧检测模型,通过语音帧检测模型对输入语音进行检测。In this embodiment, the input voice is not limited, and it may be voice in a quiet environment or voice in various noisy environments. At the same time, in order to improve the accuracy of voice endpoint detection, this embodiment pre-trains a voice frame detection model, and detects the input voice through the voice frame detection model.
步骤S20,对所述输入语音进行分帧处理,得到多个带时序的语音帧;Step S20: Perform framing processing on the input voice to obtain multiple voice frames with time series;
语音信号通常在宏观上是不平稳的,而在微观上是平稳的,具有短时平稳性(10-30ms内可以认为语音信号近似不变),因此在进行语音信号处理时,为减少语音信号整体的非稳态、时变的影响,因而需要对语音信号进行分帧处理。也即把语音信号分为一些短段来进行处理,每一个短段称为一帧。The voice signal is usually not stable on the macro level, and stable on the micro level, with short-term stability (the voice signal can be considered to be approximately unchanged within 10-30ms). Therefore, in the process of voice signal processing, in order to reduce the voice signal The overall non-steady state and time-varying influences require framing processing of the speech signal. That is, the voice signal is divided into short segments for processing, and each short segment is called a frame.
步骤S30,依次将所述输入语音的各语音帧输入所述语音帧检测模型进行检测,输出各语音帧对应的第一检测结果;Step S30, sequentially input each voice frame of the input voice into the voice frame detection model for detection, and output a first detection result corresponding to each voice frame;
本实施例中,考虑到现有语音端点检测很难准确地区分复杂场景下的正常语音与噪音,究其原因主要表现在以下两方面:一方面是由于现有的语音端点检测算法适用的场景比较单一,比如对于较稳定的噪声(如白噪声、汽笛声等)检测效果较好,但对于嘈杂环境(如较多人说话的公共场合)则检测效果较差;另一方面,现有语音端点检测算法通常只能从单个维度来进行检测,因而容易产生误判。In this embodiment, considering that the existing voice endpoint detection is difficult to accurately distinguish normal voice and noise in complex scenarios, the reasons are mainly manifested in the following two aspects: On the one hand, it is due to the applicable scenarios of the existing voice endpoint detection algorithm Relatively simple, for example, the detection effect is better for relatively stable noise (such as white noise, siren, etc.), but the detection effect is poor for noisy environments (such as public places where many people speak); on the other hand, the existing voice Endpoint detection algorithms usually can only detect from a single dimension, which is prone to misjudgment.
因此,本实施例优选采用多种方式对输入语音进行端点检测,由于采用多种检测方式从而多个维度进行检测,因而可以结合多种检测算法的优势,使检测结果更加精确。Therefore, this embodiment preferably adopts multiple methods to perform endpoint detection on the input voice. Since multiple detection methods are used to detect multiple dimensions, the advantages of multiple detection algorithms can be combined to make the detection result more accurate.
本实施例使用预先训练的语音帧检测模型对输入语音的各语音帧分别进行检测,输出各语音帧对应的第一检测结果,比如某一个语音帧为有效语音帧(也即人的说话语音)的概率,某一个语音帧为噪声帧的概率。In this embodiment, a pre-trained voice frame detection model is used to detect each voice frame of the input voice, and the first detection result corresponding to each voice frame is output, for example, a certain voice frame is a valid voice frame (that is, a person's speaking voice) The probability that a certain speech frame is a noise frame.
步骤S40,依次对所述输入语音的各语音帧进行谐波能量检测,得到各语音帧对应的第二检测结果;Step S40, performing harmonic energy detection on each voice frame of the input voice in turn, to obtain a second detection result corresponding to each voice frame;
本实施例中除基于模型维度进行检测外,还基于谐波能量维度对输入语音的各语音帧进行检测。语音信号是一种谐波信号,具有能量特征,谐波能量可通过谐波振幅大小进行衡量。若谐波能量高,则谐波振幅较大,而若谐波能量低,则谐波振幅较小。In this embodiment, in addition to the detection based on the model dimension, each voice frame of the input speech is also detected based on the harmonic energy dimension. The voice signal is a harmonic signal with energy characteristics, and the harmonic energy can be measured by the magnitude of the harmonic amplitude. If the harmonic energy is high, the harmonic amplitude is large, and if the harmonic energy is low, the harmonic amplitude is small.
因此,本实施中通过检测各语音帧的谐波能量以区分有效语音帧与噪声帧。谐波能量检测能够在安静环境下快速区分语音与噪音,然而对于嘈杂环境,则由于噪声干扰而降低了检测的准确度。Therefore, in this implementation, the harmonic energy of each voice frame is detected to distinguish between valid voice frames and noise frames. Harmonic energy detection can quickly distinguish voice and noise in a quiet environment, but for a noisy environment, the accuracy of detection is reduced due to noise interference.
步骤S50,基于所述第一检测结果与所述第二检测结果,确定各语音帧对应的帧类别,所述帧类别包括有效语音帧、噪声帧;Step S50: Determine a frame category corresponding to each voice frame based on the first detection result and the second detection result, where the frame category includes valid voice frames and noise frames;
本实施例中,由于采用的是多个检测算法,因此对于输入语音的每一语音帧的检测结果都会存在多种结果,比如当前检测的语音帧为有效语音帧或者为噪音帧。同一语音帧的不同检测结果,可以是全部相同,也可以是全部不同,还可以是部分相同、部分不相同。本实施例结合模型维度的第一检测结果以及谐波能量维度的第二检测结果进行综合分析,进而确定各语音帧最终对应的帧类别。本实施例中的语音帧不仅具备语音特征,而且还具备语音能力特征,因而基于多维度检测而得到的综合判断结果是可信的。In this embodiment, because multiple detection algorithms are used, there are multiple results for the detection result of each voice frame of the input voice, for example, the currently detected voice frame is a valid voice frame or a noise frame. The different detection results of the same speech frame may be all the same, or all may be different, or may be partly the same and partly different. In this embodiment, a comprehensive analysis is performed in combination with the first detection result of the model dimension and the second detection result of the harmonic energy dimension, and then the frame type corresponding to each voice frame is determined. The speech frame in this embodiment not only has speech features, but also has speech capability features, so the comprehensive judgment result obtained based on multi-dimensional detection is credible.
可选的,在一实施例中,具体采用以下规则确定语音帧对应的帧类别:Optionally, in an embodiment, the following rules are specifically used to determine the frame category corresponding to the voice frame:
A、若所述第一检测结果为语音帧为有效语音帧、所述第二检测结果为语音帧为有效语音帧,则判定语音帧对应的帧类别为有效语音帧;A. If the first detection result is that the voice frame is a valid voice frame, and the second detection result is that the voice frame is a valid voice frame, determine that the frame category corresponding to the voice frame is a valid voice frame;
B、若所述第一检测结果为语音帧为有效语音帧、所述第二检测结果为语音帧为噪声帧,则判定语音帧对应的帧类别为噪声帧;B. If the first detection result is that the voice frame is a valid voice frame, and the second detection result is that the voice frame is a noise frame, determine that the frame category corresponding to the voice frame is a noise frame;
C、若所述第一检测结果为语音帧为噪声帧、所述第二检测结果为语音帧为有效语音帧,则判定语音帧对应的帧类别为噪声帧;C. If the first detection result is that the voice frame is a noise frame, and the second detection result is that the voice frame is a valid voice frame, determine that the frame category corresponding to the voice frame is a noise frame;
D、若所述第一检测结果为语音帧为噪声帧、所述第二检测结果为语音帧为噪声帧,则判定语音帧对应的帧类别为噪声帧。D. If the first detection result is that the voice frame is a noise frame, and the second detection result is that the voice frame is a noise frame, then it is determined that the frame category corresponding to the voice frame is a noise frame.
本可选实施例中,当采用多个检测模型、检测算法进行语音帧检测时,当且仅当各检测结果一致且都为有效语音帧时,才判定语音帧对应的帧类别为有效语音帧,否则判定语音帧对应的帧类别为噪音帧。In this optional embodiment, when multiple detection models and detection algorithms are used for voice frame detection, if and only if the detection results are consistent and all are valid voice frames, the frame category corresponding to the voice frame is determined to be a valid voice frame Otherwise, it is determined that the frame category corresponding to the speech frame is a noise frame.
步骤S60,基于各语音帧对应的帧类别,确定所述输入语音的语音开始端点与语音结束端点。Step S60: Determine the voice start endpoint and voice end endpoint of the input voice based on the frame category corresponding to each voice frame.
通常,在一般较为安静环境下的语音开始端点对应的是有效语音帧,而语音结束端点对应的是噪声帧(或者静音),然而在嘈杂环境下,由于外部环境噪音的干扰,因而并不能使用现有方式进行语音端点的判定。本实施例具体基于各语音帧对应的帧类别来确定输入语音的语音开始端点与语音结束端点。比如连续多个语音帧为有效语音帧,则确定当前存在语音开始端点,而若连续多个语音帧为噪声帧,则确定当前存在语音结束端点。Generally, in a generally quiet environment, the voice start endpoint corresponds to a valid voice frame, while the voice end endpoint corresponds to a noise frame (or silence). However, in a noisy environment, due to the interference of external environmental noise, it cannot be used Existing methods are used to determine voice endpoints. This embodiment specifically determines the voice start endpoint and voice end endpoint of the input voice based on the frame category corresponding to each voice frame. For example, if multiple consecutive voice frames are valid voice frames, it is determined that the voice start endpoint currently exists, and if multiple consecutive voice frames are noise frames, it is determined that the voice end endpoint currently exists.
本实施例中使用预置语音帧检测模型和谐波能量检测方式分别对输入语 音的各语音帧进行检测,然后再综合两次检测结果,确定各语音帧是属于有效语音帧还是属于噪声帧;最后再基于各语音帧对应的帧类别,确定所述输入语音的语音开始端点与语音结束端点。本实施例综合了多种检测算法,因而可在一定程度上提升语音端点检测的准确性,并且本申请时根据各语音帧对应的帧类别来确定语音端点,因而能够适应各种语音识别场景,提升语音识别准确率。In this embodiment, a preset voice frame detection model and a harmonic energy detection method are used to detect each voice frame of the input voice, and then the two detection results are combined to determine whether each voice frame is a valid voice frame or a noise frame; Finally, based on the frame category corresponding to each voice frame, the voice start endpoint and voice end endpoint of the input voice are determined. This embodiment integrates multiple detection algorithms, which can improve the accuracy of voice endpoint detection to a certain extent. In this application, the voice endpoint is determined according to the frame category corresponding to each voice frame, so it can adapt to various voice recognition scenarios. Improve the accuracy of speech recognition.
进一步地,在本申请语音端点检测方法一实施例中,使用多个语音帧检测模型进行模型维度的语音帧检测,具体包括:Further, in an embodiment of the voice endpoint detection method of the present application, using multiple voice frame detection models to perform model-dimensional voice frame detection specifically includes:
(1)语音模型(1) Voice model
本实施例中,在进行语音端点检测之前,构建语音模型。具体以正常语音数据为训练样本,采用预设第一机器学习算法进行训练,构建语音模型,以供用于检测有效语音帧。In this embodiment, before performing voice endpoint detection, a voice model is constructed. Specifically, normal speech data is used as a training sample, and a preset first machine learning algorithm is used for training to construct a speech model for detecting valid speech frames.
本实施例中,根据预先采集的正常语音数据,通过预设的机器学习算法进行训练,构建语音模型,比如采用深度学习算法、长短期记忆网络模型等机器学习算法构建模型,提取正常语音数据的语音特征并输入模型进行训练,进而构建可检测有效语音帧的语音模型。In this embodiment, according to the pre-collected normal voice data, a preset machine learning algorithm is used to train to construct a voice model. For example, a deep learning algorithm, a long short-term memory network model and other machine learning algorithms are used to build a model to extract normal voice data. The voice features are input into the model for training, and then a voice model that can detect valid voice frames is constructed.
(2)噪声模型(2) Noise model
本实施例中,在进行语音端点检测之前,构建噪声模型。具体以真实环境噪声为训练样本,采用预设第二机器学习算法进行训练,构建噪声模型,以供用于检测噪声帧。In this embodiment, before performing voice endpoint detection, a noise model is constructed. Specifically, real environmental noise is used as a training sample, and a preset second machine learning algorithm is used for training to construct a noise model for detecting noise frames.
本实施例中,根据预先采集的稳定噪音数据和不稳定噪音数据,通过预设的机器学习算法进行训练,构建噪声模型,比如采用深度学习算法、长短期记忆网络模型等机器学习算法构建模型,提取噪音数据的语音特征并输入模型进行训练,进而构建可检测噪声帧的噪声模型。In this embodiment, according to the pre-collected stable noise data and unstable noise data, a preset machine learning algorithm is used to train to construct a noise model, for example, a deep learning algorithm, a long short-term memory network model and other machine learning algorithms are used to build the model. Extract the voice features of the noise data and input the model for training, and then build a noise model that can detect noise frames.
参照图3,图3为图2中步骤S30一实施例的细化流程示意图。基于上述实施例,本实施例中,上述步骤S30进一步包括:Referring to FIG. 3, FIG. 3 is a detailed flowchart of an embodiment of step S30 in FIG. Based on the foregoing embodiment, in this embodiment, the foregoing step S30 further includes:
步骤S301,依次将所述输入语音的各语音帧输入所述语音模型进行检测,输出每一语音帧为有效语音帧的第一概率值;Step S301: sequentially input each voice frame of the input voice into the voice model for detection, and output a first probability value that each voice frame is a valid voice frame;
本实施例中,基于输入语音中各语音帧的时序,依次将各语音帧输入训练好的语音模型进行检测,输出每一语音帧为有效语音帧的概率值。In this embodiment, based on the timing of each voice frame in the input voice, each voice frame is sequentially input into the trained voice model for detection, and the probability value that each voice frame is a valid voice frame is output.
步骤S302,依次将所述输入语音的各语音帧输入所述噪声模型进行检测,输出每一语音帧为噪声帧的第二概率值;Step S302: sequentially input each voice frame of the input voice into the noise model for detection, and output a second probability value for each voice frame as a noise frame;
本实施例中,基于输入语音中各语音帧的时序,依次将各语音帧输入训练好的噪声模型进行检测,输出每一语音帧为噪声帧的概率值。In this embodiment, based on the timing of each voice frame in the input voice, each voice frame is sequentially input into the trained noise model for detection, and the probability value of each voice frame being a noise frame is output.
步骤S303,基于所述第一概率值与所述第二概率值,输出各语音帧对应的第一检测结果,其中,若语音帧为有效语音帧的第一概率值大于为噪声帧的第二概率值,则判定语音帧为有效语音帧,否则为噪声帧。Step S303, based on the first probability value and the second probability value, output a first detection result corresponding to each speech frame, wherein if the speech frame is a valid speech frame, the first probability value is greater than the second probability value of the noise frame. Probability value, the speech frame is determined to be a valid speech frame, otherwise it is a noise frame.
本实施例中,将相同的语音帧分别输入两个不同的模型进行语音帧识别,从而获得该语音帧为有效语音帧的概率值以及该语音帧为噪声帧的概率值,若语音帧为有效语音帧的概率值大于为噪声帧的概率值,则判定该语音帧为有效语音帧,而若语音帧为噪声帧的概率值大于为有效语音帧的概率值,则判定该语音帧为噪声帧。In this embodiment, the same speech frame is input into two different models for speech frame recognition, thereby obtaining the probability value of the speech frame being a valid speech frame and the probability value of the speech frame being a noise frame. If the speech frame is valid If the probability value of a speech frame is greater than the probability value of a noise frame, it is determined that the speech frame is a valid speech frame, and if the probability value of the speech frame is a noise frame is greater than the probability value of a valid speech frame, the speech frame is determined to be a noise frame .
例如,输入语音帧中有a、b、c三帧语音帧,分别输入语音模型和噪声模型进行检测,语音模型输出的概率值依次为70%、50%、80%,噪声模型输出的概率值依次为45%、80%、25%,则最终判定语音帧a为有效语音帧、语音帧b为噪声帧、语音帧c为有效语音帧。For example, there are three speech frames a, b, and c in the input speech frame, respectively input the speech model and the noise model for detection, the probability value of the speech model output is 70%, 50%, 80%, and the probability value of the noise model output The sequence is 45%, 80%, 25%, and finally it is determined that the speech frame a is a valid speech frame, the speech frame b is a noise frame, and the speech frame c is a valid speech frame.
参照图4,图4为图2中步骤S40一实施例的细化流程示意图。基于上述实施例,本实施例中,上述步骤S40进一步包括:Referring to FIG. 4, FIG. 4 is a detailed flowchart of an embodiment of step S40 in FIG. Based on the foregoing embodiment, in this embodiment, the foregoing step S40 further includes:
步骤S401,依次提取所述输入语音的第i帧语音帧在时域上的短时语音能量;Step S401, extracting the short-term speech energy in the time domain of the i-th speech frame of the input speech in sequence;
步骤S402,判断第i帧语音帧对应的短时语音能量是否大于预置短时语音能量;Step S402, judging whether the short-term speech energy corresponding to the i-th speech frame is greater than the preset short-term speech energy;
步骤S403,若是,则判定第i帧语音帧为有效语音帧,否则为噪声帧。Step S403, if yes, determine that the i-th speech frame is a valid speech frame, otherwise it is a noise frame.
短时语音能量指音频信号在较短时间内的语音能量。这里的较短时间,通常指的是一帧语音帧,也即将一帧时间内的语音能量称作短时能量。由于在同一语音中,通常语音帧的能量要远高于噪声的语音能量,因此,可通过短时语音能量用于区分有效语音帧与噪声帧。本实施例对于计算短时语音能 量的计算方式不限。Short-term speech energy refers to the speech energy of audio signals in a relatively short time. The short time here usually refers to one frame of speech, that is, the speech energy within one frame is called short-term energy. In the same speech, the energy of the speech frame is usually much higher than that of the noise. Therefore, the short-term speech energy can be used to distinguish between effective speech frames and noise frames. In this embodiment, the calculation method for calculating the short-term speech energy is not limited.
可选的,在一实施例中,所述短时语音能量的计算公式如下:Optionally, in an embodiment, the calculation formula of the short-term speech energy is as follows:
Figure PCTCN2019118699-appb-000002
Figure PCTCN2019118699-appb-000002
其中,M(i)表示第i帧语音帧的短时语音能量;x(n)表示语音波形时域信号;w(n)表示窗函数;yi(n)表示经过w(n)分帧处理后得到的第i帧语音信号;b表示帧移长度;n=1,2,…L;i=1,2,…fn;L表示帧长,fn表示分帧后的总帧数。Among them, M(i) represents the short-term speech energy of the i-th speech frame; x(n) represents the time domain signal of the speech waveform; w(n) represents the window function; yi(n) represents the w(n) framing process The i-th frame speech signal obtained later; b represents the frame shift length; n=1, 2,...L; i=1, 2,...fn; L represents the frame length, and fn represents the total number of frames after framing.
本实施例中,在计算出一帧语音帧的短时语音能量后,先判断该帧语音帧的短时语音能量是否超过预设的短时语音能量阈值,若是,则判定该帧语音帧为有效语音帧,否则判定为噪声帧。In this embodiment, after calculating the short-term speech energy of a speech frame, first determine whether the short-term speech energy of the speech frame exceeds the preset short-term speech energy threshold, and if so, it is determined that the speech frame is Valid speech frame, otherwise judged as noise frame.
本实施例从语音帧的短时语音能量角度来对输入语音信号进行检测,从而确定输入语音的每一帧语音帧对应的帧类别,由于短时语音能量检测方式便捷、识别准确率也较高,因而能够大幅提升对输入语音进行语音端点检测的效率。This embodiment detects the input speech signal from the perspective of the short-term speech energy of the speech frame, thereby determining the frame category corresponding to each frame of the input speech. Because the short-term speech energy detection method is convenient and the recognition accuracy rate is high. Therefore, the efficiency of voice endpoint detection for input voice can be greatly improved.
参照图5,图5为图2中步骤S60一实施例的细化流程示意图。基于上述实施例,本实施例中,上述步骤S60进一步包括:Referring to FIG. 5, FIG. 5 is a detailed flowchart of an embodiment of step S60 in FIG. Based on the foregoing embodiment, in this embodiment, the foregoing step S60 further includes:
步骤S601,在预设检测窗口内,判断所述检测窗口内各语音帧对应的帧类别是否满足预设的语音端点判定条件;Step S601: In a preset detection window, determine whether the frame type corresponding to each voice frame in the detection window meets a preset voice endpoint determination condition;
步骤S602,若满足,则判定所述输入语音的语音开始端点或语音结束端点位于当前检测窗口内。In step S602, if it is satisfied, it is determined that the voice start endpoint or the voice end endpoint of the input voice is located in the current detection window.
考虑到单独以某个语音帧是否为有效语音帧或噪声帧来判断语音端点的方式容易存在误判的情形,因此,本实施例中采用检测窗口与占比相结合的方式,进行语音端点的判断。Considering that the method of judging the voice endpoint based on whether a certain voice frame is a valid voice frame or a noise frame is prone to misjudgment, therefore, in this embodiment, the detection window and the proportion are combined to perform voice endpoint detection. judgment.
本实施例中,检测窗口具体包括:语音开始端点检测窗口和语音结束端点检测窗口。其中,语音开始端点判断所使用的检测窗口大小与语音结束端点判断所使用的不同,通常语音开始端点所使用的检测窗口小于语音结束端点所使用的检测窗口。具体可根据实际需要进行设置与调整。In this embodiment, the detection window specifically includes: a voice start endpoint detection window and a voice end endpoint detection window. Among them, the size of the detection window used for the voice start endpoint judgment is different from that used for the voice end endpoint judgment. Generally, the detection window used by the voice start endpoint is smaller than the detection window used by the voice end endpoint. Specific settings and adjustments can be made according to actual needs.
本实施例中,语音端点判定条件具体包括:In this embodiment, the voice endpoint determination conditions specifically include:
A、语音开始端点判定条件:若当前检测窗口内有效语音帧的比例超过预设第一比例,则判定当前检测窗口内存在所述输入语音的语音开始端点;A. Voice start endpoint determination condition: if the ratio of valid voice frames in the current detection window exceeds the preset first ratio, it is determined that the voice start endpoint of the input voice exists in the current detection window;
B、语音结束端点判定条件:若当前检测窗口内有效语音帧的比例低于预设第二比例,则判定当前检测窗口内存在所述输入语音的语音结束端点。B. Voice end endpoint determination condition: if the ratio of valid voice frames in the current detection window is lower than the preset second ratio, it is determined that the voice end endpoint of the input voice exists in the current detection window.
例如,在进行语音开始端点检测时,预先设置一个语音开始端点检测窗口,比如该窗口的大小为20帧,然后统计该检测窗口内帧类别为有效语音帧的数量,最后再判断该监测窗口内有效语音帧与窗口内总帧数之间的比例值是否超过预设的比例值(比如60%),若是,则判定当前检测窗口内存在语音开始端点。For example, when performing voice start endpoint detection, set a voice start endpoint detection window in advance, for example, the size of the window is 20 frames, and then count the number of valid voice frames in the detection window, and finally judge the monitoring window Whether the ratio value between the effective speech frame and the total number of frames in the window exceeds the preset ratio value (such as 60%), if so, it is determined that there is a speech start endpoint in the current detection window.
在进行语音结束端点检测时,预先设置一个语音结束端点检测窗口,比如该窗口的大小为50帧,然后统计该检测窗口内帧类别为噪声帧的数量,最后再判断该监测窗口内噪声帧与窗口内总帧数之间的比例值是否低于预设的比例值(比如10%),若是,则判定当前检测窗口内存在语音结束端点。When detecting the end-of-speech endpoint, set up a detection window for the end-of-speech endpoint in advance. For example, the size of the window is 50 frames, and then count the number of noise frames in the detection window. Whether the ratio between the total number of frames in the window is lower than a preset ratio (for example, 10%), if so, it is determined that there is a voice end endpoint in the current detection window.
本申请还提供一种语音端点检测装置。This application also provides a voice endpoint detection device.
参照图6,图6为本申请语音端点检测一实施例的功能模块示意图。本实施例中,所述语音端点检测装置包括:Referring to FIG. 6, FIG. 6 is a schematic diagram of functional modules of an embodiment of voice endpoint detection in this application. In this embodiment, the voice endpoint detection device includes:
获取模块10,用于获取待检测的输入语音以及预置语音帧检测模型;The obtaining module 10 is used to obtain the input voice to be detected and the preset voice frame detection model;
分帧模块20,用于对所述输入语音进行分帧处理,得到多个带时序的语音帧;The framing module 20 is configured to perform framing processing on the input voice to obtain multiple voice frames with time series;
第一检测模块30,用于依次将所述输入语音的各语音帧输入所述语音帧检测模型进行检测,输出各语音帧对应的第一检测结果;The first detection module 30 is configured to sequentially input each voice frame of the input voice into the voice frame detection model for detection, and output a first detection result corresponding to each voice frame;
第二检测模块40,用于依次对所述输入语音的各语音帧进行谐波能量检测,得到各语音帧对应的第二检测结果;The second detection module 40 is configured to perform harmonic energy detection on each voice frame of the input voice in sequence to obtain a second detection result corresponding to each voice frame;
帧类别确定模块50,用于基于所述第一检测结果与所述第二检测结果,确定各语音帧对应的帧类别,所述帧类别包括有效语音帧、噪声帧;The frame type determining module 50 is configured to determine the frame type corresponding to each speech frame based on the first detection result and the second detection result, where the frame type includes valid speech frames and noise frames;
语音端点确定模块60,用于基于各语音帧对应的帧类别,确定所述输入语音的语音开始端点与语音结束端点。The voice endpoint determining module 60 is configured to determine the voice start endpoint and voice end endpoint of the input voice based on the frame category corresponding to each voice frame.
基于与上述本申请语音端点检测方法相同的实施例说明内容,因此本实施例对语音端点检测装置的实施例内容不做过多赘述。Based on the description content of the same embodiment as the voice endpoint detection method of the present application, the content of the embodiment of the voice endpoint detection device will not be repeated in this embodiment.
本实施例中使用预置语音帧检测模型和谐波能量检测方式分别对输入语音的各语音帧进行检测,然后再综合两次检测结果,确定各语音帧是属于有 效语音帧还是属于噪声帧;最后再基于各语音帧对应的帧类别,确定所述输入语音的语音开始端点与语音结束端点。本实施例综合了多种检测算法,因而可在一定程度上提升语音端点检测的准确性,并且本申请时根据各语音帧对应的帧类别来确定语音端点,因而能够适应各种语音识别场景,提升语音识别准确率。In this embodiment, a preset voice frame detection model and a harmonic energy detection method are used to detect each voice frame of the input voice, and then the two detection results are combined to determine whether each voice frame is a valid voice frame or a noise frame; Finally, based on the frame category corresponding to each voice frame, the voice start endpoint and voice end endpoint of the input voice are determined. This embodiment integrates multiple detection algorithms, which can improve the accuracy of voice endpoint detection to a certain extent. In this application, the voice endpoint is determined according to the frame category corresponding to each voice frame, so it can adapt to various voice recognition scenarios. Improve the accuracy of speech recognition.
本申请还提供一种计算机可读存储介质,其中,该计算机可读存储介质可以为易失性的,也可以为非易失性的,具体本申请不做限定。The present application also provides a computer-readable storage medium, where the computer-readable storage medium may be volatile or non-volatile, which is not specifically limited by the present application.
本实施例中,所述计算机可读存储介质上存储有语音端点检测程序,所述语音端点检测程序被处理器执行时实现如上述任一项实施例中所述的语音端点检测方法的步骤。其中,语音端点检测程序被处理器执行时所实现的方法可参照本申请语音端点检测方法的各个实施例,因此不再过多赘述。In this embodiment, a voice endpoint detection program is stored on the computer-readable storage medium, and the voice endpoint detection program is executed by the processor to implement the steps of the voice endpoint detection method described in any of the above embodiments. Among them, the method implemented when the voice endpoint detection program is executed by the processor can refer to the various embodiments of the voice endpoint detection method of the present application, so it will not be repeated.
基于上面结合附图对本申请的实施例进行了描述,但是本申请并不局限于上述的具体实施方式,上述的具体实施方式仅仅是示意性的,而不是限制性的,本领域的普通技术人员在本申请的启示下,在不脱离本申请宗旨和权利要求所保护的范围情况下,还可做出很多形式,凡是利用本申请说明书及附图内容所作的等效结构或等效流程变换,或直接或间接运用在其他相关的技术领域,这些均属于本申请的保护之内。The embodiments of the present application are described based on the above in conjunction with the accompanying drawings, but the present application is not limited to the above-mentioned specific embodiments. The above-mentioned specific embodiments are only illustrative and not restrictive. Those skilled in the art Under the enlightenment of this application, without departing from the purpose of this application and the scope of protection of the claims, many forms can be made. Any equivalent structure or equivalent process transformation made by using the content of the description and drawings of this application, Or directly or indirectly used in other related technical fields, these are all within the protection of this application.

Claims (20)

  1. 一种语音端点检测方法,所述语音端点检测方法包括以下步骤:A voice endpoint detection method, the voice endpoint detection method includes the following steps:
    获取待检测的输入语音以及预置语音帧检测模型;Obtain the input voice to be detected and the preset voice frame detection model;
    对所述输入语音进行分帧处理,得到多个带时序的语音帧;Framing the input voice to obtain multiple voice frames with time sequence;
    依次将所述输入语音的各语音帧输入所述语音帧检测模型进行检测,输出各语音帧对应的第一检测结果;Sequentially input each voice frame of the input voice into the voice frame detection model for detection, and output a first detection result corresponding to each voice frame;
    依次对所述输入语音的各语音帧进行谐波能量检测,得到各语音帧对应的第二检测结果;Sequentially perform harmonic energy detection on each voice frame of the input voice to obtain a second detection result corresponding to each voice frame;
    基于所述第一检测结果与所述第二检测结果,确定各语音帧对应的帧类别,所述帧类别包括有效语音帧、噪声帧;Determine a frame category corresponding to each speech frame based on the first detection result and the second detection result, where the frame category includes valid speech frames and noise frames;
    基于各语音帧对应的帧类别,确定所述输入语音的语音开始端点与语音结束端点。Based on the frame category corresponding to each voice frame, the voice start endpoint and voice end endpoint of the input voice are determined.
  2. 如权利要求1所述的语音端点检测方法,所述语音帧检测模型包括:语音模型和噪声模型;在所述获取待检测的输入语音以及预置语音帧检测模型的步骤之前,还包括:5. The voice endpoint detection method according to claim 1, wherein the voice frame detection model comprises: a voice model and a noise model; before the step of obtaining the input voice to be detected and a preset voice frame detection model, the method further comprises:
    以正常语音数据为训练样本,采用预设第一机器学习算法进行训练,构建语音模型,以供用于检测有效语音帧;Take normal speech data as training samples, and use the preset first machine learning algorithm for training to construct a speech model for detecting valid speech frames;
    以真实环境噪声为训练样本,采用预设第二机器学习算法进行训练,构建噪声模型,以供用于检测噪声帧。Using real environmental noise as a training sample, a preset second machine learning algorithm is used for training, and a noise model is constructed for use in detecting noise frames.
  3. 如权利要求2所述的语音端点检测方法,所述依次将所述输入语音的各语音帧输入所述语音帧检测模型进行检测,输出各语音帧对应的第一检测结果包括:3. The voice endpoint detection method according to claim 2, wherein said sequentially inputting each voice frame of said input voice into said voice frame detection model for detection, and outputting a first detection result corresponding to each voice frame comprises:
    依次将所述输入语音的各语音帧输入所述语音模型进行检测,输出每一语音帧为有效语音帧的第一概率值;Sequentially input each voice frame of the input voice into the voice model for detection, and output the first probability value of each voice frame as a valid voice frame;
    依次将所述输入语音的各语音帧输入所述噪声模型进行检测,输出每一语音帧为噪声帧的第二概率值;Sequentially input each voice frame of the input voice into the noise model for detection, and output a second probability value of each voice frame as a noise frame;
    基于所述第一概率值与所述第二概率值,输出各语音帧对应的第一检测 结果,其中,若语音帧为有效语音帧的第一概率值大于为噪声帧的第二概率值,则判定语音帧为有效语音帧,否则为噪声帧。Based on the first probability value and the second probability value, the first detection result corresponding to each speech frame is output, wherein if the first probability value of the speech frame being a valid speech frame is greater than the second probability value of being a noise frame, The voice frame is determined to be a valid voice frame, otherwise it is a noise frame.
  4. 如权利要求1所述的语音端点检测方法,所述依次对所述输入语音的各语音帧进行谐波能量检测,得到各语音帧对应的第二检测结果包括:8. The voice endpoint detection method according to claim 1, wherein said sequentially performing harmonic energy detection on each voice frame of said input voice to obtain a second detection result corresponding to each voice frame comprises:
    依次提取所述输入语音的第i帧语音帧在时域上的短时语音能量;Sequentially extracting the short-term speech energy in the time domain of the i-th speech frame of the input speech;
    判断第i帧语音帧对应的短时语音能量是否大于预置短时语音能量;Determine whether the short-term speech energy corresponding to the i-th speech frame is greater than the preset short-term speech energy;
    若是,则判定第i帧语音帧为有效语音帧,否则为噪声帧。If yes, it is determined that the i-th speech frame is a valid speech frame, otherwise it is a noise frame.
  5. 如权利要求4所述的语音端点检测方法,所述短时语音能量的计算公式如下:According to the voice endpoint detection method of claim 4, the calculation formula of the short-term voice energy is as follows:
    Figure PCTCN2019118699-appb-100001
    Figure PCTCN2019118699-appb-100001
    其中,M(i)表示第i帧语音帧的短时语音能量;x(n)表示语音波形时域信号;w(n)表示窗函数;y i(n)表示经过w(n)分帧处理后得到的第i帧语音信号;b表示帧移长度;n=1,2,…L;i=1,2,…f n;L表示帧长,f n表示分帧后的总帧数。 Among them, M(i) represents the short-term speech energy of the i-th speech frame; x(n) represents the time domain signal of the speech waveform; w(n) represents the window function; y i (n) represents the frame after w(n) The i-th frame of speech signal obtained after processing; b represents the frame shift length; n=1, 2,...L; i=1, 2,...f n ; L represents the frame length, f n represents the total number of frames after framing .
  6. 如权利要求3所述的语音端点检测方法,所述基于所述第一检测结果与所述第二检测结果,确定各语音帧对应的帧类别包括:5. The voice endpoint detection method according to claim 3, said determining the frame category corresponding to each voice frame based on the first detection result and the second detection result comprises:
    若所述第一检测结果为语音帧为有效语音帧、所述第二检测结果为语音帧为有效语音帧,则判定语音帧对应的帧类别为有效语音帧;If the first detection result is that the voice frame is a valid voice frame, and the second detection result is that the voice frame is a valid voice frame, determining that the frame category corresponding to the voice frame is a valid voice frame;
    若所述第一检测结果为语音帧为有效语音帧、所述第二检测结果为语音帧为噪声帧,则判定语音帧对应的帧类别为噪声帧;If the first detection result is that the voice frame is a valid voice frame, and the second detection result is that the voice frame is a noise frame, determining that the frame category corresponding to the voice frame is a noise frame;
    若所述第一检测结果为语音帧为噪声帧、所述第二检测结果为语音帧为有效语音帧,则判定语音帧对应的帧类别为噪声帧;If the first detection result is that the voice frame is a noise frame, and the second detection result is that the voice frame is a valid voice frame, determining that the frame category corresponding to the voice frame is a noise frame;
    若所述第一检测结果为语音帧为噪声帧、所述第二检测结果为语音帧为噪声帧,则判定语音帧对应的帧类别为噪声帧。If the first detection result is that the voice frame is a noise frame, and the second detection result is that the voice frame is a noise frame, it is determined that the frame category corresponding to the voice frame is a noise frame.
  7. 如权利要求1所述的语音端点检测方法,所述基于各语音帧对应的帧类别,确定所述输入语音的语音开始端点与语音结束端点包括:8. The voice endpoint detection method according to claim 1, wherein the determining the voice start endpoint and the voice end endpoint of the input voice based on the frame category corresponding to each voice frame comprises:
    在预设检测窗口内,判断所述检测窗口内各语音帧对应的帧类别是否满 足预设的语音端点判定条件;In the preset detection window, determine whether the frame category corresponding to each voice frame in the detection window meets the preset voice endpoint determination condition;
    若满足,则判定所述输入语音的语音开始端点或语音结束端点位于当前检测窗口内;If it is satisfied, it is determined that the voice start endpoint or the voice end endpoint of the input voice is located in the current detection window;
    其中,所述语音端点判定条件包括:若当前检测窗口内有效语音帧的比例超过预设第一比例,则判定当前检测窗口内存在所述输入语音的语音开始端点;若当前检测窗口内有效语音帧的比例低于预设第二比例,则判定当前检测窗口内存在所述输入语音的语音结束端点。Wherein, the voice endpoint determination condition includes: if the ratio of valid voice frames in the current detection window exceeds the preset first ratio, determining that the voice start endpoint of the input voice exists in the current detection window; if there is valid voice in the current detection window If the ratio of the frame is lower than the preset second ratio, it is determined that there is a voice end endpoint of the input voice in the current detection window.
  8. 一种语音端点检测装置,所述语音端点检测装置包括:A voice endpoint detection device, the voice endpoint detection device includes:
    获取模块,用于获取待检测的输入语音以及预置语音帧检测模型;The acquisition module is used to acquire the input voice to be detected and the preset voice frame detection model;
    分帧模块,用于对所述输入语音进行分帧处理,得到多个带时序的语音帧;The framing module is used to perform framing processing on the input voice to obtain multiple voice frames with time sequence;
    第一检测模块,用于依次将所述输入语音的各语音帧输入所述语音帧检测模型进行检测,输出各语音帧对应的第一检测结果;The first detection module is configured to sequentially input each voice frame of the input voice into the voice frame detection model for detection, and output a first detection result corresponding to each voice frame;
    第二检测模块,用于依次对所述输入语音的各语音帧进行谐波能量检测,得到各语音帧对应的第二检测结果;The second detection module is configured to perform harmonic energy detection on each voice frame of the input voice in sequence to obtain a second detection result corresponding to each voice frame;
    帧类别确定模块,用于基于所述第一检测结果与所述第二检测结果,确定各语音帧对应的帧类别,所述帧类别包括有效语音帧、噪声帧;A frame type determining module, configured to determine a frame type corresponding to each speech frame based on the first detection result and the second detection result, the frame type including valid speech frames and noise frames;
    语音端点确定模块,用于基于各语音帧对应的帧类别,确定所述输入语音的语音开始端点与语音结束端点。The voice endpoint determination module is used to determine the voice start endpoint and voice end endpoint of the input voice based on the frame category corresponding to each voice frame.
  9. 如权利要求8所述的语音端点检测装置,所述语音帧检测模型包括:语音模型和噪声模型,所述语音端点检测装置还包括:8. The voice endpoint detection device according to claim 8, wherein the voice frame detection model comprises: a voice model and a noise model, and the voice endpoint detection device further comprises:
    语音模型训练模块,用于以正常语音数据为训练样本,采用预设第一机器学习算法进行训练,构建语音模型,以供用于检测有效语音帧;The voice model training module is used to use normal voice data as training samples and use the preset first machine learning algorithm for training to construct a voice model for use in detecting valid voice frames;
    噪声模型训练模块,用于以真实环境噪声为训练样本,采用预设第二机器学习算法进行训练,构建噪声模型,以供用于检测噪声帧。The noise model training module is used to take real environmental noise as a training sample and use a preset second machine learning algorithm for training to construct a noise model for use in detecting noise frames.
  10. 如权利要求9所述的语音端点检测装置,所述第一检测模块具体用于:The voice endpoint detection device according to claim 9, wherein the first detection module is specifically configured to:
    依次将所述输入语音的各语音帧输入所述语音模型进行检测,输出每一语音帧为有效语音帧的第一概率值;Sequentially input each voice frame of the input voice into the voice model for detection, and output the first probability value of each voice frame as a valid voice frame;
    依次将所述输入语音的各语音帧输入所述噪声模型进行检测,输出每一语音帧为噪声帧的第二概率值;Sequentially input each voice frame of the input voice into the noise model for detection, and output a second probability value of each voice frame as a noise frame;
    基于所述第一概率值与所述第二概率值,输出各语音帧对应的第一检测结果,其中,若语音帧为有效语音帧的第一概率值大于为噪声帧的第二概率值,则判定语音帧为有效语音帧,否则为噪声帧。Based on the first probability value and the second probability value, the first detection result corresponding to each speech frame is output, wherein if the first probability value of the speech frame being a valid speech frame is greater than the second probability value of being a noise frame, The voice frame is determined to be a valid voice frame, otherwise it is a noise frame.
  11. 如权利要求8所述的语音端点检测装置,所述第二检测模块具体用于:The voice endpoint detection device according to claim 8, wherein the second detection module is specifically configured to:
    依次提取所述输入语音的第i帧语音帧在时域上的短时语音能量;Sequentially extracting the short-term speech energy in the time domain of the i-th speech frame of the input speech;
    判断第i帧语音帧对应的短时语音能量是否大于预置短时语音能量;Determine whether the short-term speech energy corresponding to the i-th speech frame is greater than the preset short-term speech energy;
    若是,则判定第i帧语音帧为有效语音帧,否则为噪声帧。If yes, it is determined that the i-th speech frame is a valid speech frame, otherwise it is a noise frame.
  12. 如权利要求11所述的语音端点检测装置,所述短时语音能量的计算公式如下:The voice endpoint detection device according to claim 11, the calculation formula of the short-term voice energy is as follows:
    Figure PCTCN2019118699-appb-100002
    Figure PCTCN2019118699-appb-100002
    其中,M(i)表示第i帧语音帧的短时语音能量;x(n)表示语音波形时域信号;w(n)表示窗函数;y i(n)表示经过w(n)分帧处理后得到的第i帧语音信号;b表示帧移长度;n=1,2,…L;i=1,2,…f n;L表示帧长,f n表示分帧后的总帧数。 Among them, M(i) represents the short-term speech energy of the i-th speech frame; x(n) represents the time domain signal of the speech waveform; w(n) represents the window function; y i (n) represents the frame after w(n) The i-th frame of speech signal obtained after processing; b represents the frame shift length; n=1, 2,...L; i=1, 2,...f n ; L represents the frame length, f n represents the total number of frames after framing .
  13. 如权利要求10所述的语音端点检测装置,所述帧类别确定模块具体用于:The voice endpoint detection device according to claim 10, wherein the frame type determining module is specifically configured to:
    若所述第一检测结果为语音帧为有效语音帧、所述第二检测结果为语音帧为有效语音帧,则判定语音帧对应的帧类别为有效语音帧;If the first detection result is that the voice frame is a valid voice frame, and the second detection result is that the voice frame is a valid voice frame, determining that the frame category corresponding to the voice frame is a valid voice frame;
    若所述第一检测结果为语音帧为有效语音帧、所述第二检测结果为语音帧为噪声帧,则判定语音帧对应的帧类别为噪声帧;If the first detection result is that the voice frame is a valid voice frame, and the second detection result is that the voice frame is a noise frame, determining that the frame category corresponding to the voice frame is a noise frame;
    若所述第一检测结果为语音帧为噪声帧、所述第二检测结果为语音帧为有效语音帧,则判定语音帧对应的帧类别为噪声帧;If the first detection result is that the voice frame is a noise frame, and the second detection result is that the voice frame is a valid voice frame, determining that the frame category corresponding to the voice frame is a noise frame;
    若所述第一检测结果为语音帧为噪声帧、所述第二检测结果为语音帧为 噪声帧,则判定语音帧对应的帧类别为噪声帧。If the first detection result is that the voice frame is a noise frame, and the second detection result is that the voice frame is a noise frame, it is determined that the frame category corresponding to the voice frame is a noise frame.
  14. 如权利要求8所述的语音端点检测装置,所述语音端点确定模块具体用于:8. The voice endpoint detection device according to claim 8, wherein the voice endpoint determination module is specifically configured to:
    在预设检测窗口内,判断所述检测窗口内各语音帧对应的帧类别是否满足预设的语音端点判定条件;In a preset detection window, determine whether the frame type corresponding to each voice frame in the detection window meets a preset voice endpoint determination condition;
    若满足,则判定所述输入语音的语音开始端点或语音结束端点位于当前检测窗口内;If it is satisfied, it is determined that the voice start endpoint or the voice end endpoint of the input voice is located in the current detection window;
    其中,所述语音端点判定条件包括:若当前检测窗口内有效语音帧的比例超过预设第一比例,则判定当前检测窗口内存在所述输入语音的语音开始端点;若当前检测窗口内有效语音帧的比例低于预设第二比例,则判定当前检测窗口内存在所述输入语音的语音结束端点。Wherein, the voice endpoint determination condition includes: if the ratio of valid voice frames in the current detection window exceeds the preset first ratio, determining that the voice start endpoint of the input voice exists in the current detection window; if there is valid voice in the current detection window If the ratio of the frame is lower than the preset second ratio, it is determined that there is a voice end endpoint of the input voice in the current detection window.
  15. 一种语音端点检测设备,所述语音端点检测设备包括存储器、处理器以及存储在所述存储器上并可在所述处理器上运行的语音端点检测程序,所述语音端点检测程序被所述处理器执行时实现以下所述的语音端点检测方法的步骤:A voice endpoint detection device, the voice endpoint detection device includes a memory, a processor, and a voice endpoint detection program stored on the memory and running on the processor, the voice endpoint detection program being processed by the The steps of the voice endpoint detection method described below are implemented when the device is executed:
    获取待检测的输入语音以及预置语音帧检测模型;Obtain the input voice to be detected and the preset voice frame detection model;
    对所述输入语音进行分帧处理,得到多个带时序的语音帧;Framing the input voice to obtain multiple voice frames with time sequence;
    依次将所述输入语音的各语音帧输入所述语音帧检测模型进行检测,输出各语音帧对应的第一检测结果;Sequentially input each voice frame of the input voice into the voice frame detection model for detection, and output a first detection result corresponding to each voice frame;
    依次对所述输入语音的各语音帧进行谐波能量检测,得到各语音帧对应的第二检测结果;Sequentially perform harmonic energy detection on each voice frame of the input voice to obtain a second detection result corresponding to each voice frame;
    基于所述第一检测结果与所述第二检测结果,确定各语音帧对应的帧类别,所述帧类别包括有效语音帧、噪声帧;Determine a frame category corresponding to each speech frame based on the first detection result and the second detection result, where the frame category includes valid speech frames and noise frames;
    基于各语音帧对应的帧类别,确定所述输入语音的语音开始端点与语音结束端点。Based on the frame category corresponding to each voice frame, the voice start endpoint and voice end endpoint of the input voice are determined.
  16. 如权利要求15所述的语音端点检测设备,所述语音帧检测模型包括:语音模型和噪声模型;所述语音端点检测程序被所述处理器执行时,还实现 以下所述的语音端点检测方法的步骤:The voice endpoint detection device according to claim 15, wherein the voice frame detection model includes: a voice model and a noise model; when the voice endpoint detection program is executed by the processor, the voice endpoint detection method described below is also implemented A step of:
    以正常语音数据为训练样本,采用预设第一机器学习算法进行训练,构建语音模型,以供用于检测有效语音帧;Take normal speech data as training samples, and use the preset first machine learning algorithm for training to construct a speech model for detecting valid speech frames;
    以真实环境噪声为训练样本,采用预设第二机器学习算法进行训练,构建噪声模型,以供用于检测噪声帧。Using real environmental noise as a training sample, a preset second machine learning algorithm is used for training, and a noise model is constructed for use in detecting noise frames.
  17. 如权利要求16所述的语音端点检测设备,所述语音端点检测程序被所述处理器执行实现所述依次将所述输入语音的各语音帧输入所述语音帧检测模型进行检测,输出各语音帧对应的第一检测结果的步骤时,还包括以下步骤:The voice endpoint detection device according to claim 16, wherein the voice endpoint detection program is executed by the processor to realize the sequentially inputting each voice frame of the input voice into the voice frame detection model for detection, and outputting each voice The step of the first detection result corresponding to the frame further includes the following steps:
    依次将所述输入语音的各语音帧输入所述语音模型进行检测,输出每一语音帧为有效语音帧的第一概率值;Sequentially input each voice frame of the input voice into the voice model for detection, and output the first probability value of each voice frame as a valid voice frame;
    依次将所述输入语音的各语音帧输入所述噪声模型进行检测,输出每一语音帧为噪声帧的第二概率值;Sequentially input each voice frame of the input voice into the noise model for detection, and output a second probability value of each voice frame as a noise frame;
    基于所述第一概率值与所述第二概率值,输出各语音帧对应的第一检测结果,其中,若语音帧为有效语音帧的第一概率值大于为噪声帧的第二概率值,则判定语音帧为有效语音帧,否则为噪声帧。Based on the first probability value and the second probability value, the first detection result corresponding to each speech frame is output, wherein if the first probability value of the speech frame being a valid speech frame is greater than the second probability value of being a noise frame, The voice frame is determined to be a valid voice frame, otherwise it is a noise frame.
  18. 一种计算机可读存储介质,所述计算机可读存储介质上存储有语音端点检测程序,所述语音端点检测程序被处理器执行时实现以下所述的语音端点检测方法的步骤:A computer-readable storage medium having a voice endpoint detection program stored on the computer-readable storage medium, and when the voice endpoint detection program is executed by a processor, the steps of the voice endpoint detection method described below are implemented:
    获取待检测的输入语音以及预置语音帧检测模型;Obtain the input voice to be detected and the preset voice frame detection model;
    对所述输入语音进行分帧处理,得到多个带时序的语音帧;Framing the input voice to obtain multiple voice frames with time sequence;
    依次将所述输入语音的各语音帧输入所述语音帧检测模型进行检测,输出各语音帧对应的第一检测结果;Sequentially input each voice frame of the input voice into the voice frame detection model for detection, and output a first detection result corresponding to each voice frame;
    依次对所述输入语音的各语音帧进行谐波能量检测,得到各语音帧对应的第二检测结果;Sequentially perform harmonic energy detection on each voice frame of the input voice to obtain a second detection result corresponding to each voice frame;
    基于所述第一检测结果与所述第二检测结果,确定各语音帧对应的帧类别,所述帧类别包括有效语音帧、噪声帧;Determine a frame category corresponding to each speech frame based on the first detection result and the second detection result, where the frame category includes valid speech frames and noise frames;
    基于各语音帧对应的帧类别,确定所述输入语音的语音开始端点与语音 结束端点。Based on the frame category corresponding to each voice frame, the voice start endpoint and voice end endpoint of the input voice are determined.
  19. 如权利要求18所述的计算机可读存储介质,所述语音帧检测模型包括:语音模型和噪声模型;所述语音端点检测程序被处理器执行时,还实现以下所述的语音端点检测方法的步骤:The computer-readable storage medium of claim 18, wherein the voice frame detection model includes: a voice model and a noise model; when the voice endpoint detection program is executed by the processor, it also implements the following voice endpoint detection method step:
    以正常语音数据为训练样本,采用预设第一机器学习算法进行训练,构建语音模型,以供用于检测有效语音帧;Take normal speech data as training samples, and use the preset first machine learning algorithm for training to construct a speech model for detecting valid speech frames;
    以真实环境噪声为训练样本,采用预设第二机器学习算法进行训练,构建噪声模型,以供用于检测噪声帧。Using real environmental noise as a training sample, a preset second machine learning algorithm is used for training, and a noise model is constructed for use in detecting noise frames.
  20. 如权利要求19所述的计算机可读存储介质,所述语音端点检测程序被处理器执行实现所述依次将所述输入语音的各语音帧输入所述语音帧检测模型进行检测,输出各语音帧对应的第一检测结果的步骤时,还包括以下步骤:The computer-readable storage medium of claim 19, wherein the voice endpoint detection program is executed by a processor to implement the sequence of inputting each voice frame of the input voice into the voice frame detection model for detection, and outputting each voice frame The step corresponding to the first detection result further includes the following steps:
    依次将所述输入语音的各语音帧输入所述语音模型进行检测,输出每一语音帧为有效语音帧的第一概率值;Sequentially input each voice frame of the input voice into the voice model for detection, and output the first probability value of each voice frame as a valid voice frame;
    依次将所述输入语音的各语音帧输入所述噪声模型进行检测,输出每一语音帧为噪声帧的第二概率值;Sequentially input each voice frame of the input voice into the noise model for detection, and output a second probability value of each voice frame as a noise frame;
    基于所述第一概率值与所述第二概率值,输出各语音帧对应的第一检测结果,其中,若语音帧为有效语音帧的第一概率值大于为噪声帧的第二概率值,则判定语音帧为有效语音帧,否则为噪声帧。Based on the first probability value and the second probability value, the first detection result corresponding to each speech frame is output, wherein if the first probability value of the speech frame being a valid speech frame is greater than the second probability value of being a noise frame, The voice frame is determined to be a valid voice frame, otherwise it is a noise frame.
PCT/CN2019/118699 2019-06-17 2019-11-15 Speech endpoint detection method, apparatus and device, and storage medium WO2020253073A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910521084.6A CN110335593A (en) 2019-06-17 2019-06-17 Sound end detecting method, device, equipment and storage medium
CN201910521084.6 2019-06-17

Publications (1)

Publication Number Publication Date
WO2020253073A1 true WO2020253073A1 (en) 2020-12-24

Family

ID=68141111

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/118699 WO2020253073A1 (en) 2019-06-17 2019-11-15 Speech endpoint detection method, apparatus and device, and storage medium

Country Status (2)

Country Link
CN (1) CN110335593A (en)
WO (1) WO2020253073A1 (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110335593A (en) * 2019-06-17 2019-10-15 平安科技(深圳)有限公司 Sound end detecting method, device, equipment and storage medium
CN110600010B (en) * 2019-09-20 2022-05-17 度小满科技(北京)有限公司 Corpus extraction method and apparatus
CN111312256A (en) * 2019-10-31 2020-06-19 平安科技(深圳)有限公司 Voice identity recognition method and device and computer equipment
CN110970051A (en) * 2019-12-06 2020-04-07 广州国音智能科技有限公司 Voice data acquisition method, terminal and readable storage medium
CN110967685B (en) * 2019-12-09 2022-03-22 Oppo广东移动通信有限公司 Method and system for evaluating interference signal, electronic device and storage medium
CN111862951B (en) * 2020-07-23 2024-01-26 海尔优家智能科技(北京)有限公司 Voice endpoint detection method and device, storage medium and electronic equipment

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101308653A (en) * 2008-07-17 2008-11-19 安徽科大讯飞信息科技股份有限公司 End-point detecting method applied to speech identification system
CN105513614A (en) * 2015-12-03 2016-04-20 广东顺德中山大学卡内基梅隆大学国际联合研究院 Voice activation detection method based on noise power spectrum density Gamma distribution statistical model
US20160261749A1 (en) * 2015-03-05 2016-09-08 Raytheon Company Methods and apparatus for reducing audio conference noise using voice quality measures
CN106356076A (en) * 2016-09-09 2017-01-25 北京百度网讯科技有限公司 Method and device for detecting voice activity on basis of artificial intelligence
CN108346425A (en) * 2017-01-25 2018-07-31 北京搜狗科技发展有限公司 A kind of method and apparatus of voice activity detection, the method and apparatus of speech recognition
CN110335593A (en) * 2019-06-17 2019-10-15 平安科技(深圳)有限公司 Sound end detecting method, device, equipment and storage medium

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109859749A (en) * 2017-11-30 2019-06-07 阿里巴巴集团控股有限公司 A kind of voice signal recognition methods and device
CN108877776B (en) * 2018-06-06 2023-05-16 平安科技(深圳)有限公司 Voice endpoint detection method, device, computer equipment and storage medium
CN109036471B (en) * 2018-08-20 2020-06-30 百度在线网络技术(北京)有限公司 Voice endpoint detection method and device
CN109801646B (en) * 2019-01-31 2021-11-16 嘉楠明芯(北京)科技有限公司 Voice endpoint detection method and device based on fusion features

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101308653A (en) * 2008-07-17 2008-11-19 安徽科大讯飞信息科技股份有限公司 End-point detecting method applied to speech identification system
US20160261749A1 (en) * 2015-03-05 2016-09-08 Raytheon Company Methods and apparatus for reducing audio conference noise using voice quality measures
CN105513614A (en) * 2015-12-03 2016-04-20 广东顺德中山大学卡内基梅隆大学国际联合研究院 Voice activation detection method based on noise power spectrum density Gamma distribution statistical model
CN106356076A (en) * 2016-09-09 2017-01-25 北京百度网讯科技有限公司 Method and device for detecting voice activity on basis of artificial intelligence
CN108346425A (en) * 2017-01-25 2018-07-31 北京搜狗科技发展有限公司 A kind of method and apparatus of voice activity detection, the method and apparatus of speech recognition
CN110335593A (en) * 2019-06-17 2019-10-15 平安科技(深圳)有限公司 Sound end detecting method, device, equipment and storage medium

Also Published As

Publication number Publication date
CN110335593A (en) 2019-10-15

Similar Documents

Publication Publication Date Title
WO2020253073A1 (en) Speech endpoint detection method, apparatus and device, and storage medium
WO2019101123A1 (en) Voice activity detection method, related device, and apparatus
CN103811003B (en) A kind of audio recognition method and electronic equipment
WO2020181824A1 (en) Voiceprint recognition method, apparatus and device, and computer-readable storage medium
WO2019232884A1 (en) Voice endpoint detection method and apparatus, computer device and storage medium
CN108172242B (en) Improved Bluetooth intelligent cloud sound box voice interaction endpoint detection method
WO2018107874A1 (en) Method and apparatus for automatically controlling gain of audio data
WO2016008311A1 (en) Method and device for detecting audio signal according to frequency domain energy
KR20160024858A (en) Voice data recognition method, device and server for distinguishing regional accent
WO2021093380A1 (en) Noise processing method and apparatus, and system
WO2022105570A1 (en) Speech endpoint detection method, apparatus and device, and computer readable storage medium
CN105118522A (en) Noise detection method and device
CN103996399B (en) Speech detection method and system
CN112951259A (en) Audio noise reduction method and device, electronic equipment and computer readable storage medium
WO2021051566A1 (en) Machine-synthesized speech recognition method, apparatus, electronic device, and storage medium
WO2022218254A1 (en) Voice signal enhancement method and apparatus, and electronic device
CN106571138B (en) Signal endpoint detection method, detection device and detection equipment
CN109994129A (en) Speech processing system, method and apparatus
CN109389993A (en) A kind of data under voice method, apparatus, equipment and storage medium
WO2017128910A1 (en) Method, apparatus and electronic device for determining speech presence probability
WO2023193573A1 (en) Audio processing method and apparatus, storage medium, and electronic device
CN110895930A (en) Voice recognition method and device
US20230186943A1 (en) Voice activity detection method and apparatus, and storage medium
WO2022199461A1 (en) Method for testing speech interaction system, audio recognition method, and related devices
TW200811833A (en) Detection method for voice activity endpoint

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19933968

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19933968

Country of ref document: EP

Kind code of ref document: A1