CN112652296A - Streaming voice endpoint detection method, device and equipment - Google Patents

Streaming voice endpoint detection method, device and equipment Download PDF

Info

Publication number
CN112652296A
CN112652296A CN202011543429.7A CN202011543429A CN112652296A CN 112652296 A CN112652296 A CN 112652296A CN 202011543429 A CN202011543429 A CN 202011543429A CN 112652296 A CN112652296 A CN 112652296A
Authority
CN
China
Prior art keywords
voice
detected
point
state
streaming
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011543429.7A
Other languages
Chinese (zh)
Other versions
CN112652296B (en
Inventor
李锴
丛继晔
沈来信
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Thunisoft Information Technology Co ltd
Original Assignee
Beijing Thunisoft Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Thunisoft Information Technology Co ltd filed Critical Beijing Thunisoft Information Technology Co ltd
Priority to CN202011543429.7A priority Critical patent/CN112652296B/en
Publication of CN112652296A publication Critical patent/CN112652296A/en
Application granted granted Critical
Publication of CN112652296B publication Critical patent/CN112652296B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/04Segmentation; Word boundary detection
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The application discloses a streaming voice endpoint detection method, a device and equipment, wherein the method comprises the following steps: judging whether the voice state of a point to be detected of the streaming voice meets a preset condition or not by using a voice endpoint detection model; and when the voice state of the point to be detected of the streaming voice meets the preset condition, confirming that the point to be detected is a voice endpoint.

Description

Streaming voice endpoint detection method, device and equipment
Technical Field
The present application relates to the field of speech recognition technologies, and in particular, to a method, an apparatus, and a device for detecting a streaming speech endpoint.
Background
In the prior art, the voice endpoint detection method is a technology for judging whether real voice exists in given audio data or not based on a threshold and a statistical probability model. It is often used in algorithms for speech recognition, speech codec, noise reduction, gain, etc. to separate speech from non-speech data. In recent years, deep learning techniques have also been applied in algorithms for speech endpoint detection.
In the process of realizing the prior art, the inventor finds that:
the conventional voice endpoint detection method, whether a traditional probability model or a deep learning model, is already in a mature stage in accuracy, but when the environment is in transient voice noise or background noise, the conventional voice endpoint detection method still can cause the situation that jump is generated by misjudgment of the voice endpoint detection method or the situation that the model misjudges and loses words due to the problems of unclear pronunciations and the like of the start point and the tail point of the voice endpoint in the prior art.
Therefore, the invention provides a streaming voice endpoint detection method, a device and equipment which can reduce jump caused by noise or word loss caused by unclear pronunciation and pronunciation.
Disclosure of Invention
The embodiment of the application provides a streaming voice endpoint detection method, a device and equipment capable of reducing noise to cause jump or losing characters due to unclear pronunciation of a spitting character, and aims to solve the problem that jump is generated by misjudgment of the voice endpoint detection method or misjudgment of a model and word loss are caused by the problems that the start point and the tail point of a voice endpoint are unclear and the like.
A streaming voice endpoint detection method, comprising:
judging whether the voice state of a point to be detected of the streaming voice meets a preset condition or not by using a voice endpoint detection model;
and when the voice state of the point to be detected of the streaming voice meets the preset condition, confirming that the point to be detected is a voice endpoint.
Further, when the voice state of the point to be detected of the streaming voice satisfies the preset condition, determining that the point to be detected is a voice endpoint specifically includes:
when the voice state of the point to be detected is changed from mute to non-mute, and the number of detection points with continuous non-mute states of the point to be detected is greater than or equal to a preset parameter, the mute state of the point to be detected is a voice endpoint;
when the voice state of the point to be detected is changed from non-mute state to mute state, and the number of the detection points with the mute state continuous of the point to be detected is greater than or equal to the preset parameter, the detection points with the number of the detection points with the mute state continuous of the point to be detected equal to the preset parameter are the voice end points.
Further, when the voice state of the point to be detected is changed from mute to non-mute, and the number of the detection points in the non-mute state of the point to be detected is greater than or equal to the preset parameter, the mute state of the point to be detected is the starting endpoint of the voice endpoint.
Further, when the voice state of the point to be detected is changed from non-silent state to silent state, and the number of the detection points in the silent state of the point to be detected is greater than or equal to the preset parameter, the number of the detection points in the silent state of the point to be detected, which is equal to the preset parameter, is the ending end point of the voice end point.
Further, the preset parameters are set according to the length of each frame in the streaming voice, and specifically include:
the length of the voice frame is set to be 20ms-50ms, and the preset parameter is set to be 5-7 according to the length of the voice frame.
Further, the number of segment frames of the streaming voice is equal to the number of frames specified by the voice endpoint detection model.
Furthermore, the voice endpoint detection model is trained by one of convlstm neural networks which are the combination of a feedforward type sequence memory network FSMN neural network, a convolutional neural network CNN and a cyclic neural network LSTM.
Further, the main process of the convlstm neural network training of the voice endpoint detection model by using the combination of the convolutional neural network CNN and the cyclic neural network LSTM is as follows: the system comprises an input layer, a one-dimensional convolution operation layer, an LSTM layer, a full connection layer, a batch normalization layer, an activation layer, a full connection layer and an output layer.
A streaming voice endpoint detection apparatus, the voice endpoint detection apparatus comprising:
the input module is used for judging whether the voice state of a point to be detected of the streaming voice meets a preset condition or not by using a voice endpoint detection model;
and the output module is used for confirming that the point to be detected is a voice endpoint when the voice state of the point to be detected of the streaming voice meets the preset condition.
A streaming voice endpoint detection device, the voice endpoint detection device comprising: a memory and at least one processor, the memory having instructions stored therein;
the at least one processor invokes the instructions in the memory to cause the streaming voice endpoint detection device to perform the voice endpoint detection method of claims 1-8.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:
fig. 1 is a flowchart of a streaming voice endpoint detection method according to an embodiment of the present application;
fig. 2 is a schematic diagram of a network structure of a streaming voice endpoint detection model according to an embodiment of the present application;
fig. 3 is a diagram illustrating a judgment result of a streaming voice endpoint detection provided in an embodiment of the present application;
fig. 4 is a schematic structural diagram of a streaming voice endpoint detection apparatus according to an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the technical solutions of the present application will be described in detail and completely with reference to the following specific embodiments of the present application and the accompanying drawings. It should be apparent that the described embodiments are only some of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
Referring to fig. 1, the streaming voice endpoint detection method provided in the present application includes:
s100: and judging whether the voice state of the point to be detected of the streaming voice meets a preset condition or not by using a voice endpoint detection model.
The streaming voice is one of a section of voice which is made through various steps and specially set for a detection end point or randomly acquired.
Further, in a preferred embodiment provided by the present application, the streaming voice may be a randomly acquired segment of voice; or the voice can be made through various steps and specially set for detecting the end point. The voice is set for the detection end point through a plurality of steps, namely, a plurality of spaced blank segments are randomly inserted into existing pure voice data to be used as a mute segment, and after audio frequency is formed between the mute segment and the pure voice data which is not the mute segment, the noise data is randomly mixed with the existing audio frequency to obtain streaming voice. Setting the length of the mute section to be specific time according to the duration of the pure voice, wherein the length of the mute is set to be 1s-5s randomly; when each mute section is inserted into the existing pure voice data, a corresponding label is generated so as to facilitate subsequent use; the frame number of the mute section is consistent with the frame number of the original pure voice; and the noise intensity of the noise data may be randomly set where the noise intensity of the noise data is set to-5 dB to-25 dB. It is understood that the specific setting of the noise intensity of the pure speech data, the silence segments and the noise data is not limited to the specific protection scope of the present application.
The number of segment frames of the streaming voice is equal to the number of frames specified by the voice endpoint detection model.
Further, in a preferred embodiment provided by the present application, the number of segment frames of the streaming voice should be set to enable the audio detected by the voice endpoint detection model, and therefore, the number of segment frames of the streaming voice should be equal to the number of frames specified by the voice endpoint detection model.
The voice endpoint detection model is obtained by training streaming voice which is manufactured through various steps and specially set for detecting endpoints for multiple times. Dividing the streaming voice into a training set and a verification set according to a proportion, and extracting Mel frequency cepstrum coefficient characteristics from the training set and the verification set; the acquired mel-frequency cepstrum coefficient characteristics are obtained by training through a feedforward type sequence memory network neural network or a convlstm neural network which is a combination of a convolutional neural network CNN and a cyclic neural network LSTM.
Further, in a preferred embodiment provided herein, the streaming voice is presented as 7: 3, dividing the ratio into a training set and a verification set, then extracting the characteristics of the mel-frequency cepstrum coefficients from the training set and the verification set, and obtaining the mel-frequency cepstrum coefficients in a specific process: pre-emphasis, framing and windowing, fast Fourier transform, angular band-pass filters, calculation of logarithmic energy and logarithmic energy output by each filter bank and extraction of dynamic differential parameters. The input data dimension of the mel frequency cepstrum coefficient adopts 36 x 10, the characteristics of the mel frequency cepstrum coefficient are 36 dimensions, 10 is context length and 10 frames, the length of each frame is positioned to be 30ms, and after the frame number is combined, namely, whether the current frame is a non-mute section is predicted by using audio with the length of 300ms each time. It is to be understood that the specific numerical values for processing the voice data herein are not to be construed as limiting the specific scope of the present application.
Referring to fig. 2, the neural network used in the present application adopts a convlstm neural network which is a combination of a convolutional neural network CNN and a cyclic neural network LSTM to train the acquired mel-frequency cepstrum coefficient features, and the training sequence is as follows: the system comprises an input layer, a one-dimensional convolution operation layer, an LSTM layer, a full connection layer, a batch normalization layer, an activation layer, a full connection layer and an output layer. In the embodiment of the present application, a convlstm neural network that is a combination of a convolutional neural network CNN and a cyclic neural network LSTM adopts a following specific formula of Focal local:
FL(pt)=-(1-pt)γlog(pt),
where γ is equal to or greater than zero, and is set to 2 here, and (1-pt) γ is a loss function balance factor for inclining the parameters to a few cases and a difficult case.
The training is performed to obtain the voice endpoint detection model, and it should be understood that the specific numerical values for processing the voice data herein obviously do not limit the specific protection scope of the present application.
And judging whether the voice state of the point to be detected of the streaming voice meets the preset condition by the voice endpoint detection model to obtain the point to be detected of the streaming voice.
Further, in a preferred embodiment provided in the present application, the determining a speech state of a point to be detected of a streaming speech by using a speech endpoint detection model specifically includes:
referring to fig. 3, in the embodiment of the present application, the non-silence and silence results obtained by determining the flow-type speech are represented by arabic numerals:
if the voice endpoint detection model detects that the voice is changed from mute to non-mute or is non-mute, the output result is 1;
if the voice endpoint detection model detects that the voice is changed from non-mute to mute or is mute, the output result is 0.
And judging the voice state through the voice endpoint model so as to obtain a digital combination formed by combining 0 or 1 as shown in the third figure. It can be seen from the digital combination that some points to be detected of the streaming voice satisfy the preset condition, and some points do not satisfy the preset condition, and when the preset condition is not satisfied, the method can be specifically divided into two cases: when the point to be detected is not converted from mute to non-mute, or from non-mute to mute, the state of mute or non-mute duration can be judged without judgment; when the voice state of the point to be detected is converted from mute to non-mute, the number of the detection points with continuous mute and non-mute states at two sides of the point to be detected is smaller than the preset parameter, the voice state of the point to be detected is converted from non-mute to mute, and the number of the detection points with continuous non-mute and mute states at two sides of the point to be detected is smaller than the preset parameter, the voice state of the point to be detected can be determined as that the point to be detected is not a voice endpoint but a jump is generated by misjudgment of a voice endpoint detection method, or a model misjudgment and word loss are caused by the. When the preset condition is satisfied, the voice endpoint can be determined. It is to be understood that the expression used for the output here obviously does not constitute a limitation to the scope of the present application.
S200: and when the voice state of the point to be detected of the streaming voice meets the preset condition, confirming that the point to be detected is a voice endpoint. When the voice endpoint meets the preset condition, namely when the voice state of the point to be detected is changed from mute to non-mute, and the number of the detection points with continuous non-mute states of the point to be detected is greater than or equal to the preset parameter, the mute state of the point to be detected is the voice endpoint; when the voice state of the point to be detected is changed from non-mute state to mute state, and the number of the detection points with the mute state continuous of the point to be detected is greater than or equal to the preset parameter, the detection points with the number of the detection points with the mute state continuous of the point to be detected equal to the preset parameter are the voice end points. When the voice endpoint is greater than or equal to the preset parameter, the maxbuff function is also used to determine the specific frame of the voice endpoint, and in this embodiment, the preset parameter is the same as the maxbuff value.
Further, in a preferred embodiment provided by the present application, when the voice state of the point to be detected is changed from mute to non-mute, and the number of detection points in which the non-mute state of the point to be detected is continuous is greater than or equal to a preset parameter, the first frame in maxbuf is the frame in the mute state of the point to be detected, which is the starting end point of the voice end point.
Further, in a preferred embodiment provided by the present application, when the voice state of the point to be detected is changed from non-silent state to silent state, and the number of detected points in the silent state of the point to be detected is greater than or equal to the preset parameter, the last frame in maxbuff, that is, the frame in which the number of detected points in the silent state of the point to be detected is equal to the number of detected points in the silent state of the point to be detected, is the ending end point of the voice end point.
Wherein, the setting of the preset parameter is decided according to the length of each frame in the streaming voice.
Further, in a preferred embodiment provided by the present application, the preset parameter is set according to the length of each frame in the streaming voice, the length of the streaming voice frame is set to 20ms-50ms, and the preset parameter is set to 5-7 according to the length of the voice frame. In this embodiment, the length of the streaming voice frame is set to 30ms, the preset parameter setting value is 5, that is, any continuous 5 points to be detected in the streaming voice are one unit, and when the result output by the voice endpoint detection model is continuous 50 or 1 or more than 50 or 1, a part of the preset condition is satisfied. It will be understood that the specific values of the preset parameters are not intended to limit the scope of the present application.
Referring to fig. 4, a streaming voice endpoint detection apparatus 100 includes:
the input module 10 determines whether the voice state of the point to be detected of the streaming voice satisfies a preset condition by using the voice endpoint detection model.
Further, in an embodiment provided in the present application, the input module 10 is configured to determine whether a voice state of a point to be detected of streaming voice satisfies a preset condition by using a voice endpoint detection model.
And the output module 11 is used for confirming that the point to be detected is a voice endpoint when the voice state of the point to be detected of the streaming voice meets the preset condition.
Further, in an embodiment provided by the present application, the output module 11 is configured to confirm that a point to be detected of the streaming voice is a voice endpoint when a voice state of the point meets a preset condition.
A specific application of the speech end-point detection means here is understood to be a virtual means, e.g. a software product similar to a browser. One specific application of the input module 10 and the output module 11 can be understood as functional functions which can be packaged independently.
A streaming voice endpoint detection device, the voice endpoint detection device comprising: a memory having instructions stored therein, and at least one processor.
The memory stores a voice endpoint detection program, which when executed by the processor implements the steps of the voice endpoint detection method as described in any of the above embodiments. The method implemented by the voice endpoint detection program executed by the processor may refer to the embodiments of the voice endpoint detection method of the present invention, and therefore, the description thereof is not repeated.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The above description is only an example of the present application and is not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims (10)

1. A method for streaming voice endpoint detection, comprising:
judging whether the voice state of a point to be detected of the streaming voice meets a preset condition or not by using a voice endpoint detection model;
and when the voice state of the point to be detected of the streaming voice meets the preset condition, confirming that the point to be detected is a voice endpoint.
2. The method according to claim 1, wherein when the voice state of the point to be detected of the streaming voice satisfies a preset condition, confirming that the point to be detected is a voice endpoint specifically includes:
when the voice state of the point to be detected is changed from mute to non-mute, and the number of detection points with continuous non-mute states of the point to be detected is greater than or equal to a preset parameter, the mute state of the point to be detected is a voice endpoint;
when the voice state of the point to be detected is changed from non-mute state to mute state, and the number of the detection points with the mute state continuous of the point to be detected is greater than or equal to the preset parameter, the detection points with the number of the detection points with the mute state continuous of the point to be detected equal to the preset parameter are the voice end points.
3. The method according to claim 2, wherein when the voice state of the point to be detected is changed from mute to non-mute, and the number of the detection points in which the non-mute state of the point to be detected continues is greater than or equal to a preset parameter, the mute state of the point to be detected is the starting endpoint of the voice endpoint.
4. The method according to claim 2, wherein when the voice state of the point to be detected is changed from non-silent state to silent state, and the number of the detection points in the silence state of the point to be detected is greater than or equal to the preset parameter, the number of the detection points in the silence state of the point to be detected is equal to the number of the detection points in the silence state of the preset parameter, and the detection points are the end points of the voice end points.
5. The method of claim 2, wherein the preset parameters are set according to a length of each frame in the streaming speech, and specifically include:
the length of the voice frame is set to be 20ms-50ms, and the preset parameter is set to be 5-7 according to the length of the voice frame.
6. The streaming voice endpoint detection method of claim 1, where a number of segment frames of the streaming voice is equal to a number of frames specified by a voice endpoint detection model.
7. The streaming voice endpoint detection method of claim 1, wherein the voice endpoint detection model is trained by one of convlstm neural networks, which is a combination of a feedforward type sequence memory network FSMN neural network, a convolutional neural network CNN, and a cyclic neural network LSTM.
8. The streaming voice endpoint detection method according to claim 7, wherein the primary process of the training of the voice endpoint detection model using convlstm neural network, which is a combination of convolutional neural network CNN and cyclic neural network LSTM, is as follows: the system comprises an input layer, a one-dimensional convolution operation layer, an LSTM layer, a full connection layer, a batch normalization layer, an activation layer, a full connection layer and an output layer.
9. A streaming voice endpoint detection apparatus, the voice endpoint detection apparatus comprising:
the input module is used for judging whether the voice state of a point to be detected of the streaming voice meets a preset condition or not by using a voice endpoint detection model;
and the output module is used for confirming that the point to be detected is a voice endpoint when the voice state of the point to be detected of the streaming voice meets the preset condition.
10. A streaming voice endpoint detection device, the voice endpoint detection device comprising: a memory and at least one processor, the memory having instructions stored therein;
the at least one processor invokes the instructions in the memory to cause the streaming voice endpoint detection device to perform the voice endpoint detection method of claims 1-8.
CN202011543429.7A 2020-12-23 2020-12-23 Method, device and equipment for detecting streaming voice endpoint Active CN112652296B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011543429.7A CN112652296B (en) 2020-12-23 2020-12-23 Method, device and equipment for detecting streaming voice endpoint

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011543429.7A CN112652296B (en) 2020-12-23 2020-12-23 Method, device and equipment for detecting streaming voice endpoint

Publications (2)

Publication Number Publication Date
CN112652296A true CN112652296A (en) 2021-04-13
CN112652296B CN112652296B (en) 2023-07-04

Family

ID=75360028

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011543429.7A Active CN112652296B (en) 2020-12-23 2020-12-23 Method, device and equipment for detecting streaming voice endpoint

Country Status (1)

Country Link
CN (1) CN112652296B (en)

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2001086633A1 (en) * 2000-05-10 2001-11-15 Multimedia Technologies Institute - Mti S.R.L. Voice activity detection and end-point detection
CN107919116A (en) * 2016-10-11 2018-04-17 芋头科技(杭州)有限公司 A kind of voice-activation detecting method and device
CN108109633A (en) * 2017-12-20 2018-06-01 北京声智科技有限公司 The System and method for of unattended high in the clouds sound bank acquisition and intellectual product test
CN108428448A (en) * 2017-02-13 2018-08-21 芋头科技(杭州)有限公司 A kind of sound end detecting method and audio recognition method
CN109119070A (en) * 2018-10-19 2019-01-01 科大讯飞股份有限公司 A kind of sound end detecting method, device, equipment and storage medium
CN109767759A (en) * 2019-02-14 2019-05-17 重庆邮电大学 End-to-end speech recognition methods based on modified CLDNN structure
US20200104709A1 (en) * 2018-09-27 2020-04-02 Deepmind Technologies Limited Stacked convolutional long short-term memory for model-free reinforcement learning
WO2020199990A1 (en) * 2019-03-29 2020-10-08 Goodix Technology (Hk) Company Limited Speech processing system and method therefor
CN111816216A (en) * 2020-08-25 2020-10-23 苏州思必驰信息科技有限公司 Voice activity detection method and device
CN111860386A (en) * 2020-07-27 2020-10-30 山东大学 Video semantic segmentation method based on ConvLSTM convolutional neural network

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2001086633A1 (en) * 2000-05-10 2001-11-15 Multimedia Technologies Institute - Mti S.R.L. Voice activity detection and end-point detection
CN107919116A (en) * 2016-10-11 2018-04-17 芋头科技(杭州)有限公司 A kind of voice-activation detecting method and device
CN108428448A (en) * 2017-02-13 2018-08-21 芋头科技(杭州)有限公司 A kind of sound end detecting method and audio recognition method
CN108109633A (en) * 2017-12-20 2018-06-01 北京声智科技有限公司 The System and method for of unattended high in the clouds sound bank acquisition and intellectual product test
US20200104709A1 (en) * 2018-09-27 2020-04-02 Deepmind Technologies Limited Stacked convolutional long short-term memory for model-free reinforcement learning
CN109119070A (en) * 2018-10-19 2019-01-01 科大讯飞股份有限公司 A kind of sound end detecting method, device, equipment and storage medium
CN109767759A (en) * 2019-02-14 2019-05-17 重庆邮电大学 End-to-end speech recognition methods based on modified CLDNN structure
WO2020199990A1 (en) * 2019-03-29 2020-10-08 Goodix Technology (Hk) Company Limited Speech processing system and method therefor
CN111860386A (en) * 2020-07-27 2020-10-30 山东大学 Video semantic segmentation method based on ConvLSTM convolutional neural network
CN111816216A (en) * 2020-08-25 2020-10-23 苏州思必驰信息科技有限公司 Voice activity detection method and device

Also Published As

Publication number Publication date
CN112652296B (en) 2023-07-04

Similar Documents

Publication Publication Date Title
EP3069336B1 (en) Envelope comparison for utterance detection
JP3990136B2 (en) Speech recognition method
WO2017084334A1 (en) Language recognition method, apparatus and device and computer storage medium
JP2011033680A (en) Voice processing device and method, and program
CN106548775B (en) Voice recognition method and system
Lippmann Speech perception by humans and machines
CN113192535B (en) Voice keyword retrieval method, system and electronic device
CN110390948B (en) Method and system for rapid speech recognition
CN111951796A (en) Voice recognition method and device, electronic equipment and storage medium
CN114385800A (en) Voice conversation method and device
JP6915637B2 (en) Information processing equipment, information processing methods, and programs
KR101122591B1 (en) Apparatus and method for speech recognition by keyword recognition
KR101122590B1 (en) Apparatus and method for speech recognition by dividing speech data
JP7096707B2 (en) Electronic devices, control devices that control electronic devices, control programs and control methods
CN113658596A (en) Semantic identification method and semantic identification device
CN112652296A (en) Streaming voice endpoint detection method, device and equipment
WO2011077924A1 (en) Voice detection device, voice detection method, and voice detection program
CN112820281B (en) Voice recognition method, device and equipment
CN111717754A (en) Car type elevator control method based on safety alarm words
Singh et al. A comparative study of recognition of speech using improved MFCC algorithms and Rasta filters
JP2880436B2 (en) Voice recognition device
CN112216272A (en) Language identification method for civil aviation air-land communication field
CN110895941A (en) Voiceprint recognition method and device and storage device
KR20200109832A (en) A Method for Reducing Waiting Time in Automatic Speech Recognition
KR100350003B1 (en) A system for determining a word from a speech signal

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant