CN111583912A - Voice endpoint detection method and device and electronic equipment - Google Patents
Voice endpoint detection method and device and electronic equipment Download PDFInfo
- Publication number
- CN111583912A CN111583912A CN202010458648.9A CN202010458648A CN111583912A CN 111583912 A CN111583912 A CN 111583912A CN 202010458648 A CN202010458648 A CN 202010458648A CN 111583912 A CN111583912 A CN 111583912A
- Authority
- CN
- China
- Prior art keywords
- voice
- detected
- end point
- text data
- speech
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 74
- 238000000034 method Methods 0.000 claims abstract description 48
- 238000004590 computer program Methods 0.000 claims description 6
- 238000003062 neural network model Methods 0.000 claims description 6
- 208000030979 Language Development disease Diseases 0.000 claims description 2
- 238000003058 natural language processing Methods 0.000 description 23
- 238000010586 diagram Methods 0.000 description 14
- 238000012545 processing Methods 0.000 description 9
- 230000008569 process Effects 0.000 description 8
- 238000005457 optimization Methods 0.000 description 7
- 238000005516 engineering process Methods 0.000 description 6
- 238000004891 communication Methods 0.000 description 5
- 230000008901 benefit Effects 0.000 description 4
- 230000006870 function Effects 0.000 description 4
- 230000008878 coupling Effects 0.000 description 3
- 238000010168 coupling process Methods 0.000 description 3
- 238000005859 coupling reaction Methods 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 238000004422 calculation algorithm Methods 0.000 description 2
- 230000035945 sensitivity Effects 0.000 description 2
- 239000013598 vector Substances 0.000 description 2
- 206010063385 Intellectualisation Diseases 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 230000001934 delay Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 230000002452 interceptive effect Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000012821 model calculation Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
- 230000003595 spectral effect Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/16—Speech classification or search using artificial neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/18—Speech classification or search using natural language modelling
- G10L15/1815—Semantic context, e.g. disambiguation of the recognition hypotheses based on word meaning
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/26—Speech to text systems
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/78—Detection of presence or absence of voice signals
- G10L25/87—Detection of discrete points within a voice signal
Landscapes
- Engineering & Computer Science (AREA)
- Human Computer Interaction (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Signal Processing (AREA)
- User Interface Of Digital Computer (AREA)
- Telephonic Communication Services (AREA)
Abstract
The application provides a voice endpoint detection method and device and electronic equipment, relates to the technical field of voice recognition, and solves the technical problem that a user waits for a long time to return a result after speaking is finished. The method comprises the following steps: acquiring a voice to be detected; determining voice time delay based on the tail end point of the voice to be detected; and if the voice time delay exceeds a preset time threshold and/or the text data corresponding to the voice to be detected has complete semantics, determining that the tail end point of the voice to be detected is a voice end point.
Description
Technical Field
The present application relates to the field of speech recognition technologies, and in particular, to a method and an apparatus for detecting a speech endpoint, and an electronic device.
Background
With the development of the AI technology, especially the use of the voice recognition technology, it is possible to provide the AI application with a natural interactive mode for the user, i.e. to implement the intelligent voice robot. The intelligent voice robot has the advantages of low cost, easy expansion, uniform service experience and the like, is widely used in various industries, and particularly has a large number of scenes of communication with customers through the forms of telephone, network voice and the like in the insurance industry.
In the process of voice recognition and detection, the starting point of the detected effective voice is used as the starting point detection, and the end point of the detected effective voice is used as the tail point detection. At present, the mute duration is mostly used for tail point detection, and the sensitivity of tail point detection is changed by adjusting the mute duration. The tail point detection mechanism can be compatible with the condition that the user pauses in speech, but easily causes that the time for waiting for the robot to return the result after the normal user finishes the speech is longer, and causes poor user experience.
Disclosure of Invention
The invention aims to provide a voice endpoint detection method, a voice endpoint detection device and electronic equipment, which are used for relieving the technical problem that the time for waiting for returning a result is long after a user finishes speaking.
In a first aspect, an embodiment of the present application provides a method for detecting a voice endpoint, where the method includes:
acquiring a voice to be detected;
determining voice time delay based on the tail end point of the voice to be detected;
and if the voice time delay exceeds a preset time threshold and/or the text data corresponding to the voice to be detected has complete semantics, determining that the tail end point of the voice to be detected is a voice end point.
In one possible implementation, the step of obtaining the speech to be detected includes:
and acquiring the voice to be detected every other preset detection period.
In a possible implementation, if the voice delay exceeds a preset time threshold and/or the text data corresponding to the voice to be detected has complete semantics, determining that the tail end point of the voice to be detected is a voice end point, including:
judging whether the voice time delay exceeds the preset time threshold value or not;
if the voice time delay exceeds the preset time threshold, determining that the tail end point of the voice to be detected is a voice end point;
if the voice time delay does not exceed the preset time threshold, judging whether the text data corresponding to the voice to be detected has complete semantics;
and if the text data has complete semantics, determining that the tail end point of the voice to be detected is a voice end point.
In one possible implementation, the step of converting the speech to be detected into the text data includes:
and recognizing the voice to be detected by using voice Recognition (ASR), and converting the voice to be detected into ASR text data according to a Recognition result.
In a possible implementation, the step of determining whether the text data corresponding to the speech to be detected has complete semantics includes:
and judging whether the text data corresponding to the voice to be detected has complete semantics by using a trained Natural Language Processing (NLP) neural network model.
In one possible implementation, after the step of determining whether the text data is semantically complete, the method further includes:
and if the voice time delay does not exceed the preset time threshold and the text data is not complete in semantics, continuously acquiring the voice to be detected in the next preset detection period.
In a second aspect, an apparatus for detecting a voice endpoint is provided, including:
the acquisition unit is used for acquiring the voice to be detected;
the first determining unit is used for determining voice time delay based on the tail end point of the voice to be detected;
and the second determining unit is used for determining that the tail end point of the voice to be detected is a voice end point if the voice time delay exceeds a preset time threshold and/or the text data corresponding to the voice to be detected has complete semantics.
In one possible implementation, the voice endpoint detection device is disposed at an NLP robot end or an ASR (Automatic Speech Recognition) end of a service module.
In a third aspect, an embodiment of the present application further provides an electronic device, which includes a memory and a processor, where the memory stores a computer program that is executable on the processor, and the processor implements the method of the first aspect when executing the computer program.
In a fourth aspect, this embodiment of the present application further provides a computer-readable storage medium storing machine executable instructions, which, when invoked and executed by a processor, cause the processor to perform the method of the first aspect.
The embodiment of the application brings the following beneficial effects:
according to the voice endpoint detection method, the voice endpoint detection device and the electronic equipment, the voice time delay can be determined based on the acquired tail end point of the voice to be detected, and if the voice time delay exceeds the preset time threshold value and/or the text data semantic integrity corresponding to the voice to be detected, the tail end point of the voice to be detected is determined to be the voice end point.
In order to make the aforementioned objects, features and advantages of the present application more comprehensible, preferred embodiments accompanied with figures are described in detail below.
Drawings
In order to more clearly illustrate the detailed description of the present application or the technical solutions in the prior art, the drawings needed to be used in the detailed description of the present application or the prior art description will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and other drawings can be obtained by those skilled in the art without creative efforts.
Fig. 1 is a schematic flowchart of a voice endpoint detection method according to an embodiment of the present application;
fig. 2 is another schematic flow chart of a voice endpoint detection method according to an embodiment of the present application;
fig. 3 is another schematic flow chart of a voice endpoint detection method according to an embodiment of the present application;
fig. 4 is a schematic diagram of an NLP model provided in an embodiment of the present application;
fig. 5 is a schematic structural diagram of a voice endpoint detection apparatus according to an embodiment of the present application;
fig. 6 is a schematic diagram of an implementation of an NLP robot end service module provided in the embodiment of the present application;
fig. 7 is a schematic diagram of an implementation of end-to-end point detection of an NLP robot according to an embodiment of the present application;
fig. 8 is a schematic diagram illustrating an implementation of an ASR end service module according to an embodiment of the present application;
FIG. 9 is a schematic diagram illustrating an implementation of ASR end-to-end detection according to an embodiment of the present application;
fig. 10 is a schematic structural diagram illustrating an electronic device provided in an embodiment of the present application.
Detailed Description
To make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions of the present application will be clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are some, but not all embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
The terms "comprising" and "having," and any variations thereof, as referred to in the embodiments of the present application, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements but may alternatively include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
At present, the introduction of intelligent voice robots realizes full-service enabling and service flow intellectualization for enterprises. The conversation mode between the intelligent voice robot and the user is as follows: intelligent voice robot-user-intelligent voice robot-user. In this mode, a speech recognition engine (ASR) converts a client's speech into text, and the ASR usually uses a voice endpoint detection (VAD) module to determine whether a sentence has ended, and if so, returns a recognition result; after the ASR returns the result, the robot considers that the answer of the user is already spoken, and then carries out the next semantic understanding processing on the recognition result to accurately understand the service requirement of the client; then entering a dialogue tracking module for dialogue management; then the dialogue strategy module outputs dialogue content, and the natural language generation module converts the dialogue content into natural language; finally, the voice is converted into voice through TTS and is conveyed to the user.
It should be noted that Speech Recognition (ASR) is a technology for converting Speech of a telephone terminal or an internet terminal into text. Voice endpoint Detection, also known as Voice Activity Detection (VAD), is a basic front-end processing link in various voice processing applications, and is widely applied in technical scenarios such as voice coding, speaker recognition, keyword Detection, automatic voice recognition, and the like. Text To Speech (TTS) is a technique that converts Text to speech output.
In the intelligent voice robot, the user's conversational voice needs to go through voice endpoint detection (VAD), i.e. a process of analyzing the input audio stream to determine the starting point and the ending point of the client's speech. Common methods for voice endpoint detection can be roughly classified into three categories: threshold-based VADs, VADs using classifiers, and model-based VADs.
First, threshold-based VAD: the aim of distinguishing voice from non-voice is achieved by extracting characteristics of a time domain (short-time energy, short-time zero-crossing rate and the like) or a frequency domain (Mel frequency cepstrum coefficient, spectral entropy and the like) and reasonably setting a threshold, and the method is a traditional VAD method.
Second, VAD using classifier: the voice detection can be regarded as a two-classification problem of voice and non-voice, and a classifier is trained by a machine learning method to achieve the purpose of voice detection.
Third, acoustic model-based VAD: the speech segments and the non-speech segments can be distinguished by global information on the basis of decoding by using a complete acoustic model (the granularity of the modeling unit can be very coarse).
The end point detection technology (VAD) detects the starting point of the effective voice as the starting point detection and detects the end point of the effective voice as the tail point detection. Most of the current VAD algorithms use the mute time length to judge the tail point, and the sensitivity of tail point detection is changed by adjusting the mute time length. In the conversation process between the intelligent voice robot and the user, sometimes the user pauses in speaking. Such as "take a year, beat a tenth". There may be a mute period between "take" and "beat to ten", and when the mute duration at the end point of the VAD is set too short, the VAD may cause the user to think that the user has spoken, and thus only "take a letter" is returned. Therefore, the current method for accommodating such problems is usually to adjust the VAD tail point mute period to be longer, such as 1 second. The long time can be compatible with the condition that the user pauses in speaking, but the long time can cause that the time for waiting for the robot to return the result after the normal user finishes speaking is longer, so that the user experience is poor.
Based on this, the embodiment of the application provides a voice endpoint detection method, a voice endpoint detection device and an electronic device, and by using the method, the technical problem that the time for waiting for returning the result after the user finishes speaking can be long can be solved.
Embodiments of the present invention are further described below with reference to the accompanying drawings.
Fig. 1 is a schematic flowchart of a voice endpoint detection method according to an embodiment of the present application. As shown in fig. 1, the method includes:
step S110, acquiring the voice to be detected.
And step S120, determining the voice time delay based on the tail end point of the voice to be detected.
The voice time delay refers to time delay information after the characters in the voice to be detected are output. For example, a Timer (Timer) may be used to record the delay information after each word in the voice stream to be detected is output, and the Timer is reset every time each word is output.
Step S130, if the voice time delay exceeds a preset time threshold and/or the text data corresponding to the voice to be detected has complete semantics, determining that the tail end point of the voice to be detected is a voice end point.
In the embodiment of the application, two types of data are mainly input, namely text data corresponding to the voice to be detected; and timer information of the voice to be detected, namely time related data. The step can also comprise other steps such as format conversion of text data, a text word segmentation analysis and entry system and the like.
The tail end point detection of the voice to be detected is carried out by combining the voice time delay of the voice to be detected and the text data semantic integrity of the voice to be detected, so that the judgment processing of the voice end point can be started when a client speaks, the tail end point detection time is shortened, the waiting time delay of the client is reduced, the real-time performance of a system is improved, and the user experience is further improved. The method provided by the embodiment of the application can be used as a method for optimizing VAD tail point detection delay, and voice tail points are intelligently judged, so that the waiting time of a system is reduced, and the user experience is improved.
The above steps are described in detail below.
In some embodiments, the step S110 may include the following steps: step a), acquiring the voice to be detected every other preset detection period.
For example, the decider may output one decision result of whether or not it is a voice end point in each decision period.
In some embodiments, the step S130 may include the following steps:
step b), judging whether the voice time delay exceeds a preset time threshold value;
step c), if the voice time delay exceeds a preset time threshold, determining that the tail end point of the voice to be detected is a voice end point;
step d), if the voice time delay does not exceed the preset time threshold, judging whether the text data corresponding to the voice to be detected has complete semantics;
and e), if the text data has complete semantics, determining that the tail end point of the voice to be detected is a voice end point.
In some embodiments, the process of converting the speech to be detected into text data in step d) may include the following steps:
and f), recognizing the speech to be detected by using the ASR, and converting the speech to be detected into ASR text data according to a recognition result.
In some embodiments, the step d) of determining whether the text data is semantically complete may include the following steps:
and g), judging whether the text data corresponding to the voice to be detected is complete in semantics by using the trained NLP neural network model.
It should be noted that Natural Language Processing (NLP) is an important direction in the fields of computer science and artificial intelligence, and is a theory and method for realizing effective communication between people and intelligent devices through Natural Language.
As shown in fig. 2 and fig. 3, ASR text and timer information are used as input, then tail point detection is realized by NLP model algorithm, and finally a model calculation result is output. For example, as shown in fig. 4, firstly, the input text information is subjected to embedding (embedding) processing, characters are converted into word vectors, and then the word vectors and time delays are trained through a neural network model to determine whether the current point is a speech end point. Output in fig. 4 is a decision result of whether the decision device outputs an end point in each decision period.
In addition, in practical application, the NLP model is custom-trained according to the application scenario. The tail point detection time of the trained NLP neural network model is more accurate, and the method is simpler and more convenient to implement.
In the embodiment of the application, the intelligent voice endpoint detection method based on the natural language processing NLP can intelligently judge the voice endpoint by utilizing the NLP technology, namely, the NLP language model is used for tail point detection, so that more accurate judgment processing can be started when a client speaks, the tail point detection time is shortened, the waiting time delay of the client is reduced, and the real-time performance and the accuracy of the system are improved.
In some embodiments, after step d) above, the method may further comprise the steps of:
and h), if the voice time delay does not exceed the preset time threshold and the text data is not complete in semantics, continuously acquiring the voice to be detected in the next preset detection period.
After the ending point judgment is performed on the input text, if the judgment result is the voice ending point, the voice ending point information is sent to an external application system, otherwise, the steps S110, S120 and S130 are continued, so that the detection result is output, and the ending point information is fed back.
For example, the decision device may use a decision model to determine whether the sentence is spoken, and output the result if the sentence is spoken; if the time length of the waiting time exceeds the preset waiting time threshold value, the result is directly output.
The following description will take the voice end point detection of the word "i want to make you ten's best" as an example. The following is prescribed in advance:
(1) the decision device will make an output every 100ms, namely the preset detection period is 100 ms;
(2) the 800ms mute time threshold is preset, that is, the preset time threshold is 800 ms.
If the user finishes speaking in 500ms, the user does not need to wait for the preset mute time of 800ms, and the feedback voice end point can be saved by 300ms, as shown in the following table 1; if the user has not finished speaking in 800ms and the waiting time is equal to the preset 800ms, the user does not need to continue waiting and directly feeds back the result, as shown in the following table 2.
Table 1 tail point detection example one
TABLE 2 Tail Point detection example two
Fig. 5 provides a schematic structural diagram of a voice endpoint detection apparatus. As shown in fig. 5, the voice endpoint detection apparatus 500 includes:
an obtaining unit 501, configured to obtain a voice to be detected;
a first determining unit 502, configured to determine a voice delay based on a tail end point of a voice to be detected;
the second determining unit 503 is configured to determine that the tail end point of the to-be-detected speech is a speech end point if the speech delay exceeds the preset time threshold and/or the text data corresponding to the to-be-detected speech has complete semantics.
In some embodiments, the voice endpoint detection device is disposed at the NLP robot end or the ASR end of the service module.
In the actual service module, the VAD optimization module in the embodiment of the present application may be added at the robot end and the ASR end.
Aiming at adding a VAD optimization module at an NLP robot end, the implementation mode of the embodiment is as follows:
in this example, a VAD optimization module is added to the NLP robot side in the traffic module, and a specific traffic module implementation diagram is shown in fig. 6.
The robot end inputs the ASR speech recognition text and Timer (Timer) information into a decision device, wherein the Timer is used for recording time delay information after each character in the text stream is output, and the Timer is reset when each character is output. The judger judges whether the sentence is spoken or not by using a judgment model, and if the sentence is spoken, the judgment is output; if the voice is not finished, waiting until the voice is finished, wherein if the accumulated waiting time exceeds a preset waiting time threshold, directly outputting a result; a specific tail-point detection implementation is shown in fig. 7.
The VAD optimization module is added at the robot end, the implementation mode does not need to change the original VAD internal endpoint detection mode, and the NLP model can be trained in a customized mode according to the application scene, so that the whole customer service robot system is convenient to integrate, and the implementation is simple and convenient.
The implementation of this embodiment is as follows for adding VAD optimization module to ASR end:
in this example, a VAD optimization module is added to the ASR end of the traffic module, and a specific traffic module implementation diagram is shown in fig. 8.
The ASR end inputs ASR speech recognition text and Timer (Timer) information to a decision device, wherein the Timer resets the Timer each time VAD outputs are equal. The judger judges whether the sentence is spoken or not by using a judgment model, and if the sentence is spoken, the judgment is output; if the voice is not finished, waiting until the voice is finished, wherein if the accumulated waiting time exceeds a preset waiting time threshold, directly outputting a result; a specific tail-point detection implementation is shown in fig. 9.
The VAD optimization module is added at the ASR end and needs to be modified, the NLP model can be trained according to the application scene in a customized mode, the integration of the whole customer service robot system is not facilitated, and the method has the advantage that the tail point detection time is more accurate compared with that of the NLP robot end at the ASR end.
In some embodiments, the obtaining unit 501 is specifically configured to:
and acquiring the voice to be detected every other preset detection period.
In some embodiments, the second determining unit 503 is specifically configured to:
judging whether the voice time delay exceeds a preset time threshold value or not;
if the voice time delay exceeds a preset time threshold, determining that the tail end point of the voice to be detected is a voice end point;
if the voice time delay does not exceed the preset time threshold, judging whether the text data corresponding to the voice to be detected has complete semantics;
and if the text data has complete semantics, determining that the tail end point of the voice to be detected is the voice end point.
In some embodiments, the second determining unit 503 is further configured to:
and recognizing the speech to be detected by using the ASR, and converting the speech to be detected into ASR text data according to a recognition result.
In some embodiments, the second determining unit 503 is further configured to:
and judging whether the text data corresponding to the voice to be detected is complete in semantics by using the trained NLP neural network model.
In some embodiments, the second determining unit 503 is further configured to:
and if the voice time delay does not exceed the preset time threshold and the text data is not complete in semantics, continuously acquiring the voice to be detected in the next preset detection period.
The voice endpoint detection apparatus provided in the embodiment of the present application has the same technical features as the voice endpoint detection method provided in the above embodiment, so that the same technical problems can be solved, and the same technical effects can be achieved.
As shown in fig. 10, an electronic device 1000 provided in an embodiment of the present application includes: a processor 1001, a memory 1002 and a bus, wherein the memory 1002 stores machine-readable instructions executable by the processor 1001, when the electronic device runs, the processor 1001 and the memory 1002 communicate with each other through the bus, and the processor 1001 executes the machine-readable instructions to execute the steps of the voice endpoint detection method.
Specifically, the memory 1002 and the processor 1001 may be general-purpose memory and processor, and are not specifically limited herein, and the voice endpoint detection method may be executed when the processor 1001 runs a computer program stored in the memory 1002.
The processor 1001 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be implemented by integrated logic circuits of hardware or instructions in the form of software in the processor 1001. The Processor 1001 may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; the device can also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field-Programmable Gate Array (FPGA) or other Programmable logic device, a discrete Gate or transistor logic device, or a discrete hardware component. The various methods, steps, and logic blocks disclosed in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in the memory 1002, and the processor 1001 reads the information in the memory 1002 and performs the steps of the method in combination with the hardware.
Corresponding to the voice endpoint detection method, an embodiment of the present application further provides a computer-readable storage medium, where a machine executable instruction is stored, and when the computer executable instruction is called and executed by a processor, the computer executable instruction causes the processor to execute the steps of the voice endpoint detection method.
The voice endpoint detection apparatus provided in the embodiments of the present application may be specific hardware on the device, or software or firmware installed on the device, or the like. The device provided by the embodiment of the present application has the same implementation principle and technical effect as the foregoing method embodiments, and for the sake of brief description, reference may be made to the corresponding contents in the foregoing method embodiments where no part of the device embodiments is mentioned. It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the foregoing systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one logical division, and there may be other divisions when actually implemented, and for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some communication interfaces, and may be in an electrical, mechanical or other form.
For another example, the flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments provided in the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or a part of the technical solution may be essentially implemented in the form of a software product, which is stored in a storage medium and includes several instructions to enable a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the voice endpoint detection method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus once an item is defined in one figure, it need not be further defined and explained in subsequent figures, and moreover, the terms "first", "second", "third", etc. are used merely to distinguish one description from another and are not to be construed as indicating or implying relative importance.
Finally, it should be noted that: the above-mentioned embodiments are only specific embodiments of the present application, and are used for illustrating the technical solutions of the present application, but not limiting the same, and the scope of the present application is not limited thereto, and although the present application is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive the technical solutions described in the foregoing embodiments or equivalent substitutes for some technical features within the technical scope disclosed in the present application; such modifications, changes or substitutions do not depart from the scope of the embodiments of the present application. Are intended to be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.
Claims (10)
1. A method for voice endpoint detection, the method comprising:
acquiring a voice to be detected;
determining voice time delay based on the tail end point of the voice to be detected;
and if the voice time delay exceeds a preset time threshold and/or the text data corresponding to the voice to be detected has complete semantics, determining that the tail end point of the voice to be detected is a voice end point.
2. The method of claim 1, wherein the step of obtaining the speech to be detected comprises:
and acquiring the voice to be detected every other preset detection period.
3. The method according to claim 2, wherein if the speech delay exceeds a preset time threshold and/or the text data corresponding to the speech to be detected has complete semantics, the step of determining that the tail end point of the speech to be detected is a speech end point comprises:
judging whether the voice time delay exceeds the preset time threshold value or not;
if the voice time delay exceeds the preset time threshold, determining that the tail end point of the voice to be detected is a voice end point;
if the voice time delay does not exceed the preset time threshold, judging whether the text data corresponding to the voice to be detected has complete semantics;
and if the text data has complete semantics, determining that the tail end point of the voice to be detected is a voice end point.
4. The method according to claim 3, wherein the step of converting the speech to be detected into the text data comprises:
and recognizing the speech to be detected by using ASR, and converting the speech to be detected into ASR text data according to a recognition result.
5. The method according to claim 3, wherein the step of determining whether the text data corresponding to the speech to be detected has complete semantic meaning comprises:
and judging whether the text data corresponding to the voice to be detected is complete in semantics by using the trained NLP neural network model.
6. The method according to any one of claims 3 to 5, wherein the step of determining whether the text data is semantically complete further comprises:
and if the voice time delay does not exceed the preset time threshold and the text data is not complete in semantics, continuously acquiring the voice to be detected in the next preset detection period.
7. A voice endpoint detection apparatus, comprising:
the acquisition unit is used for acquiring the voice to be detected;
the first determining unit is used for determining voice time delay based on the tail end point of the voice to be detected;
and the second determining unit is used for determining that the tail end point of the voice to be detected is a voice end point if the voice time delay exceeds a preset time threshold and/or the text data corresponding to the voice to be detected has complete semantics.
8. The apparatus according to claim 7, wherein the speech endpoint detection apparatus is disposed at the NLP robot end or the ASR end of the service module.
9. An electronic device comprising a memory and a processor, wherein the memory stores a computer program operable on the processor, and wherein the processor implements the steps of the method of any of claims 1 to 6 when executing the computer program.
10. A computer readable storage medium having stored thereon machine executable instructions which, when invoked and executed by a processor, cause the processor to execute the method of any of claims 1 to 6.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010458648.9A CN111583912A (en) | 2020-05-26 | 2020-05-26 | Voice endpoint detection method and device and electronic equipment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010458648.9A CN111583912A (en) | 2020-05-26 | 2020-05-26 | Voice endpoint detection method and device and electronic equipment |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111583912A true CN111583912A (en) | 2020-08-25 |
Family
ID=72112693
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010458648.9A Pending CN111583912A (en) | 2020-05-26 | 2020-05-26 | Voice endpoint detection method and device and electronic equipment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111583912A (en) |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112002349A (en) * | 2020-09-25 | 2020-11-27 | 北京声智科技有限公司 | Voice endpoint detection method and device |
CN112069796A (en) * | 2020-09-03 | 2020-12-11 | 阳光保险集团股份有限公司 | Voice quality inspection method and device, electronic equipment and storage medium |
CN113241071A (en) * | 2021-05-10 | 2021-08-10 | 湖北亿咖通科技有限公司 | Voice processing method, electronic equipment and storage medium |
CN113345473A (en) * | 2021-06-24 | 2021-09-03 | 科大讯飞股份有限公司 | Voice endpoint detection method and device, electronic equipment and storage medium |
CN113380275A (en) * | 2021-06-18 | 2021-09-10 | 北京声智科技有限公司 | Voice processing method and device, intelligent device and storage medium |
CN113744726A (en) * | 2021-08-23 | 2021-12-03 | 阿波罗智联(北京)科技有限公司 | Voice recognition method and device, electronic equipment and storage medium |
CN113838458A (en) * | 2021-09-30 | 2021-12-24 | 联想(北京)有限公司 | Parameter adjusting method and device |
CN114255742A (en) * | 2021-11-19 | 2022-03-29 | 北京声智科技有限公司 | Method, device, equipment and storage medium for voice endpoint detection |
CN115240716A (en) * | 2021-04-23 | 2022-10-25 | 华为技术有限公司 | Voice detection method, device and storage medium |
CN115497457A (en) * | 2022-09-29 | 2022-12-20 | 贵州小爱机器人科技有限公司 | Voice recognition method and device, electronic equipment and storage medium |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2012055113A1 (en) * | 2010-10-29 | 2012-05-03 | 安徽科大讯飞信息科技股份有限公司 | Method and system for endpoint automatic detection of audio record |
CN105529028A (en) * | 2015-12-09 | 2016-04-27 | 百度在线网络技术(北京)有限公司 | Voice analytical method and apparatus |
US9437186B1 (en) * | 2013-06-19 | 2016-09-06 | Amazon Technologies, Inc. | Enhanced endpoint detection for speech recognition |
CN107919130A (en) * | 2017-11-06 | 2018-04-17 | 百度在线网络技术(北京)有限公司 | Method of speech processing and device based on high in the clouds |
CN110689877A (en) * | 2019-09-17 | 2020-01-14 | 华为技术有限公司 | Voice end point detection method and device |
CN110827795A (en) * | 2018-08-07 | 2020-02-21 | 阿里巴巴集团控股有限公司 | Voice input end judgment method, device, equipment, system and storage medium |
-
2020
- 2020-05-26 CN CN202010458648.9A patent/CN111583912A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2012055113A1 (en) * | 2010-10-29 | 2012-05-03 | 安徽科大讯飞信息科技股份有限公司 | Method and system for endpoint automatic detection of audio record |
US9437186B1 (en) * | 2013-06-19 | 2016-09-06 | Amazon Technologies, Inc. | Enhanced endpoint detection for speech recognition |
CN105529028A (en) * | 2015-12-09 | 2016-04-27 | 百度在线网络技术(北京)有限公司 | Voice analytical method and apparatus |
CN107919130A (en) * | 2017-11-06 | 2018-04-17 | 百度在线网络技术(北京)有限公司 | Method of speech processing and device based on high in the clouds |
CN110827795A (en) * | 2018-08-07 | 2020-02-21 | 阿里巴巴集团控股有限公司 | Voice input end judgment method, device, equipment, system and storage medium |
CN110689877A (en) * | 2019-09-17 | 2020-01-14 | 华为技术有限公司 | Voice end point detection method and device |
Cited By (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112069796A (en) * | 2020-09-03 | 2020-12-11 | 阳光保险集团股份有限公司 | Voice quality inspection method and device, electronic equipment and storage medium |
CN112002349A (en) * | 2020-09-25 | 2020-11-27 | 北京声智科技有限公司 | Voice endpoint detection method and device |
CN112002349B (en) * | 2020-09-25 | 2022-08-12 | 北京声智科技有限公司 | Voice endpoint detection method and device |
CN115240716A (en) * | 2021-04-23 | 2022-10-25 | 华为技术有限公司 | Voice detection method, device and storage medium |
CN113241071A (en) * | 2021-05-10 | 2021-08-10 | 湖北亿咖通科技有限公司 | Voice processing method, electronic equipment and storage medium |
CN113380275A (en) * | 2021-06-18 | 2021-09-10 | 北京声智科技有限公司 | Voice processing method and device, intelligent device and storage medium |
CN113345473A (en) * | 2021-06-24 | 2021-09-03 | 科大讯飞股份有限公司 | Voice endpoint detection method and device, electronic equipment and storage medium |
CN113345473B (en) * | 2021-06-24 | 2024-02-13 | 中国科学技术大学 | Voice endpoint detection method, device, electronic equipment and storage medium |
CN113744726A (en) * | 2021-08-23 | 2021-12-03 | 阿波罗智联(北京)科技有限公司 | Voice recognition method and device, electronic equipment and storage medium |
CN113838458A (en) * | 2021-09-30 | 2021-12-24 | 联想(北京)有限公司 | Parameter adjusting method and device |
CN114255742A (en) * | 2021-11-19 | 2022-03-29 | 北京声智科技有限公司 | Method, device, equipment and storage medium for voice endpoint detection |
CN115497457A (en) * | 2022-09-29 | 2022-12-20 | 贵州小爱机器人科技有限公司 | Voice recognition method and device, electronic equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111583912A (en) | Voice endpoint detection method and device and electronic equipment | |
CN108428447B (en) | Voice intention recognition method and device | |
CN108520741B (en) | Method, device and equipment for restoring ear voice and readable storage medium | |
CN110534099B (en) | Voice wake-up processing method and device, storage medium and electronic equipment | |
KR101942521B1 (en) | Speech endpointing | |
CN112530408A (en) | Method, apparatus, electronic device, and medium for recognizing speech | |
CN109036471B (en) | Voice endpoint detection method and device | |
KR20060022156A (en) | Distributed speech recognition system and method | |
CN108039181B (en) | Method and device for analyzing emotion information of sound signal | |
CN112581938B (en) | Speech breakpoint detection method, device and equipment based on artificial intelligence | |
CN110600008A (en) | Voice wake-up optimization method and system | |
CN110634479B (en) | Voice interaction system, processing method thereof, and program thereof | |
CN110556105B (en) | Voice interaction system, processing method thereof, and program thereof | |
CN110428853A (en) | Voice activity detection method, Voice activity detection device and electronic equipment | |
CN114038457B (en) | Method, electronic device, storage medium, and program for voice wakeup | |
CN110956958A (en) | Searching method, searching device, terminal equipment and storage medium | |
CN112071310A (en) | Speech recognition method and apparatus, electronic device, and storage medium | |
CN114627868A (en) | Intention recognition method and device, model and electronic equipment | |
JP5342629B2 (en) | Male and female voice identification method, male and female voice identification device, and program | |
CN111640423B (en) | Word boundary estimation method and device and electronic equipment | |
CN113889091A (en) | Voice recognition method and device, computer readable storage medium and electronic equipment | |
CN114328867A (en) | Intelligent interruption method and device in man-machine conversation | |
CN114360514A (en) | Speech recognition method, apparatus, device, medium, and product | |
US20070192097A1 (en) | Method and apparatus for detecting affects in speech | |
CN112735395B (en) | Speech recognition method, electronic equipment and storage device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20200825 |