CN111128166A - Optimization method and device for continuous awakening recognition function - Google Patents

Optimization method and device for continuous awakening recognition function Download PDF

Info

Publication number
CN111128166A
CN111128166A CN201911379635.6A CN201911379635A CN111128166A CN 111128166 A CN111128166 A CN 111128166A CN 201911379635 A CN201911379635 A CN 201911379635A CN 111128166 A CN111128166 A CN 111128166A
Authority
CN
China
Prior art keywords
voice
audio
recognition result
voice recognition
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201911379635.6A
Other languages
Chinese (zh)
Other versions
CN111128166B (en
Inventor
李路天
甘津瑞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
AI Speech Ltd
Original Assignee
AI Speech Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by AI Speech Ltd filed Critical AI Speech Ltd
Priority to CN201911379635.6A priority Critical patent/CN111128166B/en
Publication of CN111128166A publication Critical patent/CN111128166A/en
Application granted granted Critical
Publication of CN111128166B publication Critical patent/CN111128166B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • G10L15/30Distributed recognition, e.g. in client-server systems, for mobile phones or network applications
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/50Network services
    • H04L67/56Provisioning of proxy services
    • H04L67/568Storing data temporarily at an intermediate stage, e.g. caching
    • H04L67/5683Storage of data provided by user terminals, i.e. reverse caching
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command

Abstract

The invention discloses an optimization method and a device for a continuous awakening identification function, wherein the method comprises the following steps: continuously receiving audio until a wake-up word is detected; performing voice recognition on the audio containing the awakening word to form a first voice recognition result, and caching a second audio received after the first audio in preset time; judging whether the first voice recognition result contains voice except the awakening word; if the first voice recognition result does not contain the voice except the awakening word, judging whether the voice activity detection of the second audio is overtime; if the voice activity detection is not overtime, performing voice recognition on the second audio to form a second voice recognition result; and if the second voice recognition result contains the voice except the awakening word, calling back the second voice recognition result. The scheme provided by the method and the device can optimize the existing continuous awakening recognition function, and the user experience is better.

Description

Optimization method and device for continuous awakening recognition function
Technical Field
The invention belongs to the technical field of voice awakening recognition, and particularly relates to a method and a device for optimizing a continuous awakening recognition function.
Background
In the related technology, OneShot is achieved immediately, popular points can be called as 'one word' and the integrated mode of 'awakening word + voice semantic recognition' is adopted, so that zero interval, zero delay and seamless connection between the awakening word and voice control are realized, the traditional question-answer mode is abandoned, the steps of voice control of a user are greatly reduced, information feedback is realized, the complexity is simplified, the simple operation is realized, and the simplicity is not simple at the beginning of design.
The OneShot has the characteristics of integration of recognition and awakening and semantic understanding, ensures the uniformity and continuity of voice interaction and completes control. That is, the user can directly issue the instruction without the need of starting interaction by asking and answering like the voice interaction method in the past. The OneShot function can realize the integration of 'awakening words + voice semantic recognition' in one language, and compared with the traditional voice interaction, the efficiency is much higher.
The technology similar to OneShot in the prior art has a certain flying 'awakening identification' and a certain degree 'awakening identification continuous saying'.
The inventor finds that the technologies do not disclose a solution to the situation of OneShot deficiency in the process of realizing the application. And although the scheme can realize the Oneshot function under relatively ideal voice environment, the audio frequency can only recognize the awakening word and discard the command word when the following situations occur:
1) AEC (echo cancellation) cancellation is not clean;
2) the environmental noise is large;
3) the user speaks more slowly so that the silence between the wake up word and the command word is too long.
Disclosure of Invention
An embodiment of the present invention provides an optimization method and apparatus for a continuous wake-up recognition function, which are used to solve at least one of the above technical problems.
In a first aspect, an embodiment of the present invention provides an optimization method for a continuous wake-up recognition function, including: continuously receiving audio until a wake-up word is detected; performing voice recognition on the audio containing the awakening word to form a first voice recognition result, and continuously caching a second audio received after the first audio within a preset time; judging whether the first voice recognition result contains voice except for the awakening word; if the first voice recognition result does not contain the voice except the awakening word, judging whether the voice activity detection of the second audio is overtime; if the voice activity detection is not overtime, performing voice recognition on the second audio to form a second voice recognition result; and if the second voice recognition result contains voice except the awakening word, calling back the second voice recognition result.
In a second aspect, an embodiment of the present invention provides an apparatus for optimizing a continuous wake-up recognition function, including: a wake-up detection module configured to continuously receive audio until a wake-up word is detected; the first recognition module is configured to perform voice recognition on the audio containing the awakening word to form a first voice recognition result, and continuously cache a second audio received after the first audio in a preset time; the recognition judging module is configured to judge whether the first voice recognition result contains voice except the awakening word; the overtime judging module is configured to judge whether the voice activity detection of the second audio is overtime or not if the first voice recognition result does not contain the voice except the awakening word; a second recognition module configured to perform speech recognition on the second audio to form a second speech recognition result if the voice activity detection is not over time; and the callback module is configured to callback the second voice recognition result if the second voice recognition result contains voice except the awakening word.
In a third aspect, an electronic device is provided, comprising: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the method for optimizing a continuous wake up identification function of any embodiment of the present invention.
In a fourth aspect, the present invention further provides a computer program product, which includes a computer program stored on a non-volatile computer-readable storage medium, where the computer program includes program instructions, and when the program instructions are executed by a computer, the computer executes the steps of the optimization method of the continuous wake recognition function according to any embodiment of the present invention.
According to the scheme provided by the method and the device, the second audio received after the first audio is continuously cached in the preset time while the audio containing the awakening word is subjected to voice recognition, and then when the first voice recognition result does not recognize the voice except the awakening word, the second audio is subjected to voice recognition to form a second voice recognition result, and the second voice recognition result is used for continuously recognizing the content behind the awakening word, so that the command word spoken by the user can be possibly recognized for the user with slow speaking, and the compensation optimization scheme can be used as the compensation optimization scheme of the existing continuous awakening recognition.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.
Fig. 1 is a flowchart of an optimization method for a continuous wake-up recognition function according to an embodiment of the present invention;
fig. 2a, fig. 2b and fig. 2c are flowcharts illustrating an embodiment of a method for optimizing a continuous wake-up recognition function according to an embodiment of the present invention;
fig. 3 is a block diagram of an optimization apparatus for continuous wake-up recognition according to an embodiment of the present invention;
fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Referring to fig. 1, which shows a flowchart of an embodiment of the method for optimizing a continuous wake-up recognition function according to the present application, the method for optimizing a continuous wake-up recognition function according to the present embodiment may be applied to a voice device with a continuous wake-up recognition function or an Oneshot function, such as a smart speaker, a story machine, a smart voice television, a smart voice handset, and other smart voice devices.
As shown in fig. 1, in step 101, audio continues to be received until a wake-up word is detected;
in step 102, performing voice recognition on the audio containing the wakeup word to form a first voice recognition result, and caching a second audio received after the first audio in a preset time;
in step 103, determining whether the first speech recognition result includes speech other than the wakeup word;
in step 104, if the first speech recognition result does not include speech other than the wakeup word, determining whether the activity detection for the second audio speech is overtime;
in step 105, if the voice activity detection is not over time, performing voice recognition on the second audio to form a second voice recognition result;
in step 106, if the second speech recognition result includes speech other than the wakeup word, the second speech recognition result is recalled.
In this embodiment, for step 101, the optimizing means for continuously waking up the recognition function may utilize voice activity detection until detecting the user voice, and then send the user voice to the wake engine, if the wake engine is successfully woken up, indicating that the wake word is detected. Then, for step 102, the optimizing device for continuously waking up the recognition function may send the audio containing the wake word to the speech recognition engine for speech recognition, so as to obtain a first speech recognition result. The audio containing the wake word is the audio that is buffered until no human voice is present after the wake word is detected during the voice activity detection, for example, a silence of 50ms after the audio containing the wake word is detected indicates no human voice. While the audio containing the wakeup word is sent to the speech recognition engine, the optimization device for continuously waking up the recognition function can also buffer a second audio received after the first audio within a preset time. The preset time may be a time set by a developer, and is not limited herein. Certainly, the preset time is not suitable for being too long, and if the preset time is too long, the user can feel that the system processing time is a bit long, so that the user experience is influenced.
Then, in step 103, the optimizing device for continuously waking up the recognition function determines whether the first speech recognition result includes speech other than the wake-up word. Then, in step 104, if the first speech recognition result does not include speech other than the wakeup word, the voice activity detection may be performed on the second audio, and it is determined whether the detection of the voice activity detection on the second audio is overtime, and if the detection is overtime, it indicates that no human voice has been detected for a long time by the voice activity detection, that is, the second audio does not include human voice.
Then, in step 105, if the voice activity detection is not over time, it indicates that the second audio contains a human voice, and performs voice recognition on the second audio to form a second voice recognition result, because the second audio is an audio that continues to be cached for a period of time after the audio containing the wakeup word is sent to the voice recognition engine, the second voice recognition result formed by performing voice recognition on the second audio may possibly contain a command word of the user.
Finally, in step 106, if the second speech recognition result includes speech other than the wakeup word, it indicates that the second speech recognition result includes content that is not recognized by the first speech recognition result, such as a command word of the user, and therefore the second speech recognition result can be recalled. In the oneshot scenario, if the first speech recognition result only contains the wakeup word, it is not clear that the user has spoken the command word, so that a second speech recognition is required.
According to the method, the audio frequency containing the awakening word is sent to voice recognition, meanwhile, the second audio frequency received after the first audio frequency is cached continuously within the preset time, then when the first voice recognition result does not recognize the voice except the awakening word, the second audio frequency is subjected to voice recognition to form a second voice recognition result, and the second voice recognition result recognizes the content after the awakening word, so that the command word spoken by the user with slow speaking can be possibly recognized, and the method can be used as a compensation optimization scheme of the existing continuous awakening recognition.
In some optional embodiments, after performing speech recognition on the second audio to form a second speech recognition result, the method further comprises: and if the second voice recognition result does not contain the voice except the awakening word, throwing a result which is recognized to be empty. If the second speech recognition result does not contain the speech except the awakening word, it indicates that the user may actually say that the command word is not spoken any more after the user just speaks the awakening word, and at this time, the information of "recognition result" needs to be thrown, which indicates that the continuous awakening recognition is not successful, so as to give a signal to the system, and the system can take other measures, for example, responding to the awakening word of the user as soon as possible.
In some embodiments, after determining whether the first speech recognition result includes speech other than the wakeup word, the method further includes: if the first speech recognition result contains speech except the awakening word, the first speech recognition result is recalled, so that the existing oneshot scheme cannot be influenced under the condition of starting optimization.
In a further optional embodiment, after determining whether the voice activity detection is overtime if the first voice recognition result does not include a voice other than the wakeup word, the method further includes: if the voice activity detection is overtime, the information with the recognition result of "" is thrown out. The voice activity detection is over time for the detection of the second audio, which indicates that no human voice is detected for a long time, and the previous first voice recognition result does not contain voice except the awakening word, so that the voice activity detection can be ended in advance when the voice activity detection is over time, and therefore, the subsequent second voice recognition for the second audio is not needed, the system flow is saved, the system processing time can be shortened, and the user experience is better.
In further alternative embodiments, the second audio received after buffering the first audio for the preset time includes: after the awakening words are detected and sent to the recognition for the first recognition, the second audio received after the first audio is cached is received until the returned first voice recognition result is received, the cached audio does not temporarily use other time in the period of time, the processing time of the system is not prolonged, therefore, the original continuous awakening recognition function is not influenced negatively, only optimization is carried out, side effects are not generated, and the user experience is better.
The following description is provided to enable those skilled in the art to better understand the present disclosure by describing some of the problems encountered by the inventors in implementing the present disclosure and by describing one particular embodiment of the finally identified solution.
The inventor finds that the defects in the prior art are mainly caused by the following contents in the process of implementing the application: OneShot recognizes the wakeup word + command word. And the OneShot function relies on vad to detect the voice, the audio frequency of the detected voice is sent to a wake-up engine, the same audio frequency is sent to be identified after wake-up, and if vad detects the end of the voice after the word is woken up, the subsequent command word is lost.
Since the technologies generally provided to the public are mostly general and have certain limitation of the use range, those skilled in the art usually adopt a scheme for improving the audio quality to avoid the defects.
The Oneshot optimization scheme provides the customer with an option that in most cases is not needed, but the application provides a solution when the above-mentioned drawbacks occur, and the solution does not have a side effect on the original Oneshot function.
In the scheme provided by the application, after the OneShot optimization function is started by a user, ASR is firstly carried out on the audio sent after awakening, meanwhile, the audio after the awakening word is cached, when only the awakening word is contained after identification, the cached audio is sent to vad, if the voice exists, ASR is carried out again, and the identified result is fed back to the user together.
In the embodiment of the application, the audio sent to wake-up may not be processed by vad, the audio may be sent to the wake-up engine first, the wake-up engine detects that the same segment of audio and the subsequent audio are sent to vad for detection after wake-up, if the user says that the interval time between the wake-up word and the command word is slightly long, vad may detect that the voice is over after detecting the wake-up word, the subsequent command word may be lost, and if the command word is lost or not, the judgment needs to be performed by the subsequent second recognition, so that the situation that the command word is lost during the first recognition can be optimized by the second recognition.
Please refer to fig. 2a, which shows a flowchart of an OneShot optimization scheme according to an embodiment of the present application. As shown in fig. 2 a:
step 1: voice input wake-up engine until a wake-up word is detected
Step 2: awakening the audio to the ASR engine for vad detection
And step 3: the awakening audio is speech, ASR recognition is carried out, and the audio after the awakening audio is cached
And 4, step 4: and after the ASR server returns the recognition result, stopping caching the audio.
And 5: and if the recognition result is null (the ASR service filters out the awakening words), the awakening audio only contains the awakening words, the next step is continued, if the awakening audio is not null, the recognition result is recalled, and the operation is finished.
Step 6: 2 nd recognition is carried out on the cached audio and the subsequent audio, voice detection is carried out by the local vad, and ASR recognition is carried out when voice exists
And 7: and (5) calling back the identification result, and ending.
With continued reference to fig. 2b and 2c, there is shown a flow diagram explaining a part of the highlighted steps of fig. 2 a.
oneshot refers to the wake word + command word, hereinafter "hello relax on light", wake word refers to "hello relax", command word refers to "turn on light".
The oneshot optimization procedure comprises the following 3 cases:
1. the user says "hello die turn on light", recognizes as "turn on light" for the first time, and recalls "turn on light", ending. This case is problem-free and does not require optimization. (awakening words are filtered by the server, the same applies hereinafter)
2. The user says that the light is turned on by the Nihao relaxation, the first recognition is "" and the description only contains the awakening word, the second recognition is turned on, the second recognition is turn on, the light is turned back, and the operation is finished. The purpose of optimization is achieved.
3. The user says ' your hello is small and relaxed ', the first recognition is ' for ' indicating that only the awakening word is contained ', the second recognition is started, the second recognition is ' for ' or the second vad is overtime ', the user says that only the awakening word is spoken, and the user's call back ' is ' and the process is finished. The optimization procedure does not affect the situation where only the wake-up word is spoken.
The oneshot optimization process mainly optimizes the 2 nd case without influencing the 1 st and 3 rd cases.
The audio identified by feed asr is also described herein as follows:
the audio identified by the first pass asr includes the wake-up audio and the audio after the wake-up audio, and the specific end point is determined by vad. (vad does voiceprint detection, without voice determine asr audio end point, the same below)
The audio sent asr for the second time includes the audio buffered during the first recognition and the audio thereafter, and the specific end point is determined by vad.
Through the scheme provided by the embodiment of the application, the existing oneshot function can be optimized without being influenced.
Referring to fig. 3, a block diagram of an apparatus for optimizing a continuous wake up identification function according to an embodiment of the present invention is shown.
As shown in fig. 3, the apparatus 300 for optimizing the continuous wake-up recognition function includes a wake-up detection module 310, a first recognition module 320, a recognition determination module 330, a timeout determination module 340, a second recognition module 350, and a callback module 360.
Wherein the wake-up detection module 310 is configured to continuously receive the audio until a wake-up word is detected; the first recognition module 320 is configured to perform voice recognition on the audio including the wakeup word to form a first voice recognition result, and continue to cache a second audio received after the first audio within a preset time; a recognition judging module 330 configured to judge whether the first speech recognition result includes speech other than the wakeup word; a timeout determining module 340 configured to determine whether the voice activity detection for the second audio is timeout if the first voice recognition result does not include a voice other than the wakeup word; a second recognition module 350 configured to perform speech recognition on the second audio to form a second speech recognition result if the voice activity detection is not over time; and a callback module 360 configured to callback the second speech recognition result if the second speech recognition result includes speech other than the wakeup word.
In some optional embodiments, the apparatus further comprises: a result throwing module (not shown in the figure) configured to throw out the result that is recognized as empty if the second speech recognition result does not include the speech other than the wakeup word.
In some other optional embodiments, the callback module is further configured to: and if the first voice recognition result contains voice except the awakening word, calling back the first voice recognition result.
It should be understood that the modules depicted in fig. 3 correspond to various steps in the method described with reference to fig. 1. Thus, the operations and features described above for the method and the corresponding technical effects are also applicable to the modules in fig. 3, and are not described again here.
It should be noted that the modules in the embodiments of the present application are not limited to the solution of the present application, for example, the recognition determining module may describe a module for determining whether the first speech recognition result includes a speech module other than the wakeup word. In addition, the related function module may also be implemented by a hardware processor, for example, the identification and determination module may also be implemented by a processor, which is not described herein again.
In other embodiments, an embodiment of the present invention further provides a non-volatile computer storage medium, where the computer storage medium stores computer-executable instructions, and the computer-executable instructions may perform the optimization method for the continuous wake-up recognition function in any of the above method embodiments;
as one embodiment, a non-volatile computer storage medium of the present invention stores computer-executable instructions configured to:
continuously receiving audio until a wake-up word is detected;
performing voice recognition on a first audio frequency containing a wakeup word to form a first voice recognition result, and continuously caching a second audio frequency received after the first audio frequency in a preset time;
judging whether the first voice recognition result contains voice except for the awakening word;
if the first voice recognition result does not contain the voice except the awakening word, judging whether the voice activity detection of the second audio is overtime;
if the voice activity detection is not overtime, performing voice recognition on the second audio to form a second voice recognition result;
and if the second voice recognition result contains voice except the awakening word, calling back the second voice recognition result.
The non-volatile computer-readable storage medium may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to use of the optimizing means of the continuous wake up identifying function, and the like. Further, the non-volatile computer-readable storage medium may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, the non-transitory computer readable storage medium optionally includes memory remotely located from the processor, and these remote memories may be connected over a network to the optimizing device for continuous wake recognition functionality. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
Embodiments of the present invention also provide a computer program product, which includes a computer program stored on a non-volatile computer-readable storage medium, where the computer program includes program instructions, and when the program instructions are executed by a computer, the computer executes any one of the above optimization methods for continuous wake up recognition function.
Fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present invention, and as shown in fig. 4, the electronic device includes: one or more processors 410 and a memory 420, with one processor 410 being an example in fig. 4. The apparatus of the voice recognition method may further include: an input device 430 and an output device 440. The processor 410, the memory 420, the input device 430, and the output device 440 may be connected by a bus or other means, such as the bus connection in fig. 4. The memory 420 is a non-volatile computer-readable storage medium as described above. The processor 410 executes various functional applications and data processing of the server by executing the nonvolatile software programs, instructions and modules stored in the memory 420, namely, the optimization method for implementing the continuous wake up identification function of the above method embodiment. The input means 430 may receive input numeric or character information and generate key signal inputs related to user settings and function control of the optimizing means for the continuous wake-up recognition function. The output device 440 may include a display device such as a display screen.
The product can execute the method provided by the embodiment of the invention, and has corresponding functional modules and beneficial effects of the execution method. For technical details that are not described in detail in this embodiment, reference may be made to the method provided by the embodiment of the present invention.
As an embodiment, the electronic device is applied to an optimization apparatus for a continuous wake-up recognition function, and includes:
at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to cause the at least one processor to:
continuously receiving audio until a wake-up word is detected;
performing voice recognition on the audio containing the awakening word to form a first voice recognition result, and continuously caching a second audio received after the first audio in a preset time;
judging whether the first voice recognition result contains voice except for the awakening word;
if the first voice recognition result does not contain the voice except the awakening word, judging whether the activity detection of the second audio voice is overtime;
if the voice activity detection is not overtime, performing voice recognition on the second audio to form a second voice recognition result;
and if the second voice recognition result contains voice except the awakening word, calling back the second voice recognition result.
The electronic device of the embodiments of the present application exists in various forms, including but not limited to:
(1) a mobile communication device: such devices are characterized by mobile communications capabilities and are primarily targeted at providing voice, data communications. Such terminals include smart phones (e.g., iphones), multimedia phones, functional phones, and low-end phones, among others.
(2) Ultra mobile personal computer device: the equipment belongs to the category of personal computers, has calculation and processing functions and generally has the characteristic of mobile internet access. Such terminals include: PDA, MID, and UMPC devices, etc., such as ipads.
(3) A portable entertainment device: such devices can display and play multimedia content. Such devices include audio and video players (e.g., ipods), handheld game consoles, electronic books, as well as smart toys and portable car navigation devices.
(4) The server is similar to a general computer architecture, but has higher requirements on processing capability, stability, reliability, safety, expandability, manageability and the like because of the need of providing highly reliable services.
(5) And other electronic devices with data interaction functions.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (10)

1. An optimization method for continuous wake-up recognition function includes:
continuously receiving audio until a wake-up word is detected;
performing voice recognition on a first audio frequency containing a wakeup word to form a first voice recognition result, and continuously caching a second audio frequency received after the first audio frequency in a preset time;
judging whether the first voice recognition result contains voice except for the awakening word;
if the first voice recognition result does not contain the voice except the awakening word, judging whether the voice activity detection of the second audio is overtime;
if the voice activity detection is not overtime, performing voice recognition on the second audio to form a second voice recognition result;
and if the second voice recognition result contains voice except the awakening word, calling back the second voice recognition result.
2. The method of claim 1, wherein after speech recognizing the second audio to form a second speech recognition result, the method further comprises:
and if the second voice recognition result does not contain the voice except the awakening word, throwing a result which is recognized to be empty.
3. The method according to claim 1, wherein after the determining whether the first speech recognition result contains speech other than a wake word, the method further comprises:
and if the first voice recognition result contains voice except the awakening word, calling back the first voice recognition result.
4. The method according to any of claims 1-3, wherein after determining whether the detection of speech activity for the second audio is timed out if the first speech recognition result does not include speech other than a wake-up word, the method further comprises:
and if the voice activity detection is overtime, throwing a result which is identified as empty.
5. The method of claim 4, wherein the continuing to buffer the second audio received after the first audio for the preset time comprises:
and after the awakening word is detected, continuing to cache the second audio received after the first audio until the first voice recognition result is received, and stopping caching.
6. An apparatus for optimizing a continuous wake up recognition function, comprising:
a wake-up detection module configured to continuously receive audio until a wake-up word is detected;
the first recognition module is configured to perform voice recognition on the audio containing the awakening word to form a first voice recognition result, and continuously cache a second audio received after the first audio in a preset time;
the recognition judging module is configured to judge whether the first voice recognition result contains voice except the awakening word;
the overtime judging module is configured to judge whether the voice activity detection of the second audio is overtime or not if the first voice recognition result does not contain the voice except the awakening word;
a second recognition module configured to perform speech recognition on the second audio to form a second speech recognition result if the voice activity detection is not over time;
and the callback module is configured to callback the second voice recognition result if the second voice recognition result contains voices except the awakening words.
7. The apparatus of claim 6, wherein the apparatus further comprises:
and the error throwing module is configured to throw out a result which is identified as empty if the second voice recognition result does not contain the voice except the awakening word.
8. The apparatus of claim 6, wherein the callback module is further configured to:
and if the first voice recognition result contains voice except the awakening word, calling back the first voice recognition result.
9. An electronic device, comprising: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the method of any one of claims 1 to 6.
10. A storage medium having stored thereon a computer program, characterized in that the program, when being executed by a processor, is adapted to carry out the steps of the method of any one of claims 1 to 6.
CN201911379635.6A 2019-12-27 2019-12-27 Optimization method and device for continuous awakening recognition function Active CN111128166B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911379635.6A CN111128166B (en) 2019-12-27 2019-12-27 Optimization method and device for continuous awakening recognition function

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911379635.6A CN111128166B (en) 2019-12-27 2019-12-27 Optimization method and device for continuous awakening recognition function

Publications (2)

Publication Number Publication Date
CN111128166A true CN111128166A (en) 2020-05-08
CN111128166B CN111128166B (en) 2022-11-25

Family

ID=70504254

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911379635.6A Active CN111128166B (en) 2019-12-27 2019-12-27 Optimization method and device for continuous awakening recognition function

Country Status (1)

Country Link
CN (1) CN111128166B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111897584A (en) * 2020-08-14 2020-11-06 苏州思必驰信息科技有限公司 Wake-up method and device for voice equipment

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108962262A (en) * 2018-08-14 2018-12-07 苏州思必驰信息科技有限公司 Voice data processing method and device
CN109378000A (en) * 2018-12-19 2019-02-22 科大讯飞股份有限公司 Voice awakening method, device, system, equipment, server and storage medium
CN109994106A (en) * 2017-12-29 2019-07-09 阿里巴巴集团控股有限公司 A kind of method of speech processing and equipment
CN110473539A (en) * 2019-08-28 2019-11-19 苏州思必驰信息科技有限公司 Promote the method and apparatus that voice wakes up performance

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109994106A (en) * 2017-12-29 2019-07-09 阿里巴巴集团控股有限公司 A kind of method of speech processing and equipment
CN108962262A (en) * 2018-08-14 2018-12-07 苏州思必驰信息科技有限公司 Voice data processing method and device
CN109378000A (en) * 2018-12-19 2019-02-22 科大讯飞股份有限公司 Voice awakening method, device, system, equipment, server and storage medium
CN110473539A (en) * 2019-08-28 2019-11-19 苏州思必驰信息科技有限公司 Promote the method and apparatus that voice wakes up performance

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111897584A (en) * 2020-08-14 2020-11-06 苏州思必驰信息科技有限公司 Wake-up method and device for voice equipment
CN111897584B (en) * 2020-08-14 2022-07-08 思必驰科技股份有限公司 Wake-up method and device for voice equipment

Also Published As

Publication number Publication date
CN111128166B (en) 2022-11-25

Similar Documents

Publication Publication Date Title
AU2019246868B2 (en) Method and system for voice activation
US20210287671A1 (en) Speech recognition method, speech wakeup apparatus, speech recognition apparatus, and terminal
CN108962262B (en) Voice data processing method and device
CN109147779A (en) Voice data processing method and device
CN112201246B (en) Intelligent control method and device based on voice, electronic equipment and storage medium
KR20160005050A (en) Adaptive audio frame processing for keyword detection
WO2017151406A1 (en) Conversational software agent
WO2017151415A1 (en) Speech recognition
CN113362828B (en) Method and apparatus for recognizing speech
WO2021082133A1 (en) Method for switching between man-machine dialogue modes
JP2020527739A (en) Speaker dialization
CN110968353A (en) Central processing unit awakening method and device, voice processor and user equipment
US20230317096A1 (en) Audio signal processing method and apparatus, electronic device, and storage medium
CN109686372B (en) Resource playing control method and device
CN111128166B (en) Optimization method and device for continuous awakening recognition function
CN109697981B (en) Voice interaction method, device, equipment and storage medium
US10726829B2 (en) Performing speaker change detection and speaker recognition on a trigger phrase
CN112634911B (en) Man-machine conversation method, electronic device and computer readable storage medium
JP6817386B2 (en) Voice recognition methods, voice wakeup devices, voice recognition devices, and terminals
CN112700767B (en) Man-machine conversation interruption method and device
CN112447177B (en) Full duplex voice conversation method and system
CN108922523B (en) Position prompting method and device, storage medium and electronic equipment
CN113362845B (en) Method, apparatus, device, storage medium and program product for noise reduction of sound data
CN115083396A (en) Voice processing method and device for audio tail end detection, electronic equipment and medium
CN112786031B (en) Man-machine conversation method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information
CB02 Change of applicant information

Address after: 215123 building 14, Tengfei Innovation Park, 388 Xinping street, Suzhou Industrial Park, Suzhou City, Jiangsu Province

Applicant after: Sipic Technology Co.,Ltd.

Address before: 215123 building 14, Tengfei Innovation Park, 388 Xinping street, Suzhou Industrial Park, Suzhou City, Jiangsu Province

Applicant before: AI SPEECH Co.,Ltd.

GR01 Patent grant
GR01 Patent grant