CN112634897B - Equipment awakening method and device, storage medium and electronic device - Google Patents

Equipment awakening method and device, storage medium and electronic device Download PDF

Info

Publication number
CN112634897B
CN112634897B CN202011635662.8A CN202011635662A CN112634897B CN 112634897 B CN112634897 B CN 112634897B CN 202011635662 A CN202011635662 A CN 202011635662A CN 112634897 B CN112634897 B CN 112634897B
Authority
CN
China
Prior art keywords
target
audio signal
awakening
model
keyword
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011635662.8A
Other languages
Chinese (zh)
Other versions
CN112634897A (en
Inventor
赵培
苏腾荣
张卓博
朱文博
葛路奇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Qingdao Haier Technology Co Ltd
Haier Smart Home Co Ltd
Original Assignee
Qingdao Haier Technology Co Ltd
Haier Smart Home Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Qingdao Haier Technology Co Ltd, Haier Smart Home Co Ltd filed Critical Qingdao Haier Technology Co Ltd
Priority to CN202011635662.8A priority Critical patent/CN112634897B/en
Publication of CN112634897A publication Critical patent/CN112634897A/en
Application granted granted Critical
Publication of CN112634897B publication Critical patent/CN112634897B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • G10L15/30Distributed recognition, e.g. in client-server systems, for mobile phones or network applications
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/225Feedback of the input speech

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Telephonic Communication Services (AREA)

Abstract

The invention discloses a device awakening method and device, a storage medium and an electronic device. Wherein, the method comprises the following steps: acquiring a target audio signal, wherein the target audio signal is an audio signal acquired by target equipment in a target time interval, and the use frequency of the target equipment in the target time interval is less than or equal to a target threshold value; determining the target audio signal as a candidate audio signal under the condition that the target audio signal carries an awakening keyword, wherein the awakening keyword is used for starting the target equipment to enter a voice interaction state; sending the candidate audio signals to a server so that the server can identify and check the candidate awakening words in the candidate audio signals; and under the condition that the identification and verification result returned by the server indicates that the awakening keyword passes the verification, determining to awaken the target equipment and controlling the target equipment to enter a voice interaction state. The invention solves the technical problem of lower equipment awakening accuracy.

Description

Equipment awakening method and device, storage medium and electronic device
Technical Field
The present invention relates to the field of computers, and in particular, to a device wake-up method and apparatus, a storage medium, and an electronic apparatus.
Background
With the continuous maturity of intelligent voice application technology, more and more household devices apply the intelligent voice technology, and the requirements and the use scenes of voice interaction are continuously increased. Especially in the place at home, people are more and more used to through pronunciation give the instruction, obtain corresponding information simultaneously, because pronunciation exchange itself is the mode that the human exchanges, released both hands like this to the distance of exchanging has been extended, let people and intelligent household electrical appliances's interaction more natural.
However, in the daily use process, the intelligent terminal cannot completely and accurately judge the instruction of the user, taking waking up the intelligent terminal as an example, when the user sends a wake-up word of "small superior, small superior", the intelligent terminal will switch to the wake-up mode and perform corresponding response to the user, but this process will have misjudgment, that is: other speech or background sounds of the user are treated as wake-up words or do not respond to the actual wake-up words. These all can produce not good experience to user's use, especially in the night period, the mistake of smart machine awakens, often can disturb user's rest, even frighten user. Namely, the prior art has the technical problem of low equipment awakening accuracy.
In view of the above problems, no effective solution has been proposed.
Disclosure of Invention
The embodiment of the invention provides a device awakening method and device, a storage medium and an electronic device, and at least solves the technical problem of low device awakening accuracy.
According to an aspect of an embodiment of the present invention, there is provided a device wake-up method, including: acquiring a target audio signal, wherein the target audio signal is an audio signal acquired by target equipment in a target time interval, and the use frequency of the target equipment in the target time interval is less than or equal to a target threshold value; determining the target audio signal as a candidate audio signal under the condition that the target audio signal carries an awakening keyword, wherein the awakening keyword is used for starting the target equipment to enter a voice interaction state; sending the candidate audio signal to a server so that the server identifies and verifies the candidate awakening words in the candidate audio signal; and under the condition that the identification and verification result returned by the server indicates that the awakening keyword passes verification, determining to awaken the target equipment and controlling the target equipment to enter the voice interaction state.
According to another aspect of the embodiments of the present invention, there is also provided an apparatus wake-up apparatus, including: the device comprises a first acquisition unit, a second acquisition unit and a third acquisition unit, wherein the first acquisition unit is used for acquiring a target audio signal, the target audio signal is an audio signal acquired by target equipment in a target time interval, and the use frequency of the target equipment in the target time interval is less than or equal to a target threshold; a first determining unit, configured to determine the target audio signal as a candidate audio signal when the target audio signal carries an awakening keyword, where the awakening keyword is used to start the target device to enter a voice interaction state; a sending unit, configured to send the candidate audio signal to a server, so that the server performs recognition and verification on the candidate wake-up word in the candidate audio signal; and a second determining unit, configured to determine to wake up the target device and control the target device to enter the voice interaction state when the recognition and verification result returned by the server indicates that the wake-up keyword has passed verification.
According to another aspect of the embodiments of the present invention, there is also provided a computer-readable storage medium, in which a computer program is stored, where the computer program is configured to execute the above device wake-up method when running.
According to another aspect of the embodiments of the present invention, there is also provided an electronic apparatus, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the device wake-up method through the computer program.
In the embodiment of the invention, a target audio signal is obtained, wherein the target audio signal is an audio signal collected by target equipment in a target time interval, and the use frequency of the target equipment in the target time interval is less than or equal to a target threshold; determining the target audio signal as a candidate audio signal under the condition that the target audio signal carries an awakening keyword, wherein the awakening keyword is used for starting the target equipment to enter a voice interaction state; sending the candidate audio signal to a server so that the server identifies and verifies the candidate awakening words in the candidate audio signal; and confirming to awaken the target equipment under the condition that the identification and verification result returned by the server indicates that the awakening keyword passes the verification, controlling the target equipment to enter the voice interaction state, and performing secondary audio verification on whether the awakening word is included or not by using the server on the basis that whether the original primary audio verification is the awakening word or not so as to achieve the technical purpose of accurately awakening the equipment, thereby realizing the technical effect of improving the awakening accuracy of the equipment and further solving the technical problem of lower awakening accuracy of the equipment.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention and do not constitute a limitation of the invention. In the drawings:
FIG. 1 is a schematic diagram of an application environment of an alternative device wake-up method according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of a flow chart of an alternative device wake-up method according to an embodiment of the present invention;
FIG. 3 is a diagram illustrating an alternative device wake-up method according to an embodiment of the invention;
FIG. 4 is a schematic diagram of an alternative device wake-up method according to an embodiment of the invention;
FIG. 5 is a schematic diagram of an alternative device wake-up method according to an embodiment of the invention;
FIG. 6 is a schematic diagram of an alternative device wake-up method according to an embodiment of the invention;
FIG. 7 is a schematic diagram of an alternative device wake-up apparatus according to an embodiment of the present invention;
fig. 8 is a schematic structural diagram of an alternative electronic device according to an embodiment of the invention.
Detailed Description
In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
According to an aspect of the embodiments of the present invention, a device wake-up method is provided, and optionally, as an optional implementation manner, the device wake-up method may be applied, but not limited, to the environment shown in fig. 1. The user device 102, the network 110, and the server 112 may be, but not limited to, included, wherein the user device 102 may include, but is not limited to, the display 108, the processor 106, and the memory 104. Optionally, the user device 102 may be, but is not limited to, a target device 1024, where the target device 1024 may be, but is not limited to, a smart home device in the internet of things, and may be, but is not limited to, acquire an audio signal within a target range through the target device 1024, for example, acquire a target audio signal emitted by a target object 1022 located within the target range of the target device 1024, and instruct, based on the target audio signal, the target device 1024 to generate a response audio and a target instruction corresponding to the target audio signal, where the target instruction is used to instruct the target device 1024 to execute a target event.
The specific process comprises the following steps:
step S102, the user equipment 102 acquires a target audio signal triggered by the target object 1022, and the target equipment 1024 performs a first recognition check to check whether the target audio signal includes an audio signal corresponding to the wake-up keyword, where the target object 1022 is located within an audio acquisition range of the target equipment 1024;
steps S104-S106, the user device 102 sends the target audio signal to the server 112 via the network 110;
step S108, performing, by the server, a second identification check through the processing engine 116, generating an identification check result, and storing the identification check result in the database 114;
steps S110-S112, the server 112 sends the identification check result to the user device 102 through the network 110, the processor 106 in the user device 102 instructs the target device 1024 to answer and/or execute the corresponding instruction event according to the identification check result, displays the execution result of the instruction event in the display 108, and stores the identification check result and the execution result in the memory 104.
Optionally, as an optional implementation manner, as shown in fig. 2, the device wake-up method includes:
s202, acquiring a target audio signal, wherein the target audio signal is an audio signal acquired by target equipment in a target time interval, and the use frequency of the target equipment in the target time interval is less than or equal to a target threshold;
s204, under the condition that the target audio signal carries an awakening keyword, determining the target audio signal as a candidate audio signal, wherein the awakening keyword is used for starting the target equipment to enter a voice interaction state;
s206, sending the candidate audio signal to a server so that the server can identify and check the candidate awakening words in the candidate audio signal;
and S208, under the condition that the identification and verification result returned by the server indicates that the awakening keyword passes the verification, determining to awaken the target equipment, and controlling the target equipment to enter a voice interaction state.
Optionally, in this embodiment, the device wake-up method may be but not limited to be applied to wake up an intelligent home device in an electricity-saving sleep state in an internet of things scene, where the intelligent home device in the electricity-saving sleep state may but is not limited to keep a voice signal receiving channel in an activated state, the voice signal receiving channel may receive surrounding sound signals in real time, and input a signal stream to a wake-up model module, and the wake-up model performs threshold determination on the signal to determine whether the signal is a voice, and if the signal is a voice, performs a series of preprocessing such as voice activation detection (vad), noise suppression, echo cancellation, gain control, and finally, determines whether the voice is a wake-up word according to the wake-up model type, and determines whether to perform a wake-up procedure. Because the configuration of intelligent terminal equipment product is limited, can't burn bigger, meticulous awakening model module of awaking, then awakening model module and can having the erroneous judgement of certain probability for the speech signal for awakening the word, lead to intelligent terminal equipment to take place the condition that non-user's wish awakened up at last, promptly: a false wake-up phenomenon. However, by using the device awakening method, on the basis of the first identification and verification executed by the original awakening model module, the server executes the second identification and verification, and because the server can have the awakening model module with higher identification precision, the misjudgment of the awakening words is compensated, and the awakening accuracy of the device is improved.
Optionally, in this embodiment, the target time interval may be, but not limited to, a system mode interval such as an undisturbed time interval, a night time interval, a working time interval, and the like, and may also be flexibly set by the user, which is only an example and is not limited herein. Because the requirement of the target time interval on the response time of the device awakening is low, or the user can accept a long time for awakening the device in the target time interval, for example, the awakening frequency of the device by the user in the night time interval is low, and the tolerance of the response efficiency of the awakening device is high, even if the response time of the awakening device is long, the experience of the user is not greatly reduced, but the tolerance of the user to the false awakening phenomenon in the night time interval is low, and if the false awakening phenomenon really occurs and the modification of the user is disturbed, the experience of the user is greatly reduced. In summary, the tolerance of the user to the response efficiency of the wake-up device in the target time interval is high, or the tolerance of the user to the occurrence of the false wake-up phenomenon in the target time interval is low.
Optionally, in this embodiment, the wake-up keyword may be but is not limited to wake-up equipment to indicate that the equipment enters a voice interaction state, for example, when the wake-up keyword is "small a", and when the target equipment detects an audio signal carrying the "small a", the voice interaction state of the target equipment is started, and further, based on the detected audio signal, a corresponding instruction event is executed. Optionally, the wake-up keyword may also include, but is not limited to, at least one of the following: audio, timbre, tone, etc. For example, an audio signal triggered by user B, even if the audio signal includes "small a", but the audio, tone or pitch does not match the wake-up keyword, only if the audio signal "small a" triggered by user a is detected, can it be considered to match the wake-up keyword.
Optionally, in this embodiment, the target device determines the target audio signal as the candidate audio signal, but not limited to a first recognition check performed on the target audio signal by using an offline wakeup model in the target device, where a larger and finer offline wakeup model cannot be programmed because the configuration of the target device is limited, and therefore, a larger error rate still exists in the recognition check of the offline wakeup model on the wakeup keyword.
Optionally, in this embodiment, the candidate audio signal with a relatively large error rate is sent to the server, so that the server performs the second recognition and verification on the candidate audio signal, where, because the server does not have the problem of configuration limitation, a more refined model may be configured to perform the recognition and verification, thereby greatly improving the accuracy of the recognition and verification on the wake-up keyword,
It should be noted that a target audio signal is obtained, where the target audio signal is an audio signal collected by a target device in a target time interval, and a usage frequency of the target device in the target time interval is less than or equal to a target threshold; determining the target audio signal as a candidate audio signal under the condition that the target audio signal carries an awakening keyword, wherein the awakening keyword is used for starting the target equipment to enter a voice interaction state; sending the candidate audio signals to a server so that the server can identify and check the candidate awakening words in the candidate audio signals; and under the condition that the identification and verification result returned by the server indicates that the awakening keyword passes verification, determining to awaken the target equipment and controlling the target equipment to enter a voice interaction state.
For further illustration, an application scenario of the alternative device wake-up method is shown in fig. 3, and includes a target object 302, a target device 304, and a server 306, where the target device 304 acquires a continuous/discontinuous initial audio signal of the target object 302; performing, by the target device 304, a first recognition check to check whether the acquired initial audio signal includes a wake-up keyword, where in the case that the audio signal includes the wake-up keyword, the target device 304 processes the initial audio signal into a candidate audio signal, and sends the candidate audio signal to the server 306 through the network; a second identification check is performed by the server 306 to verify. Whether a wake-up keyword is included in the candidate audio signal.
According to the embodiment provided by the application, a target audio signal is obtained, wherein the target audio signal is an audio signal collected by target equipment in a target time interval, and the use frequency of the target equipment in the target time interval is less than or equal to a target threshold; determining the target audio signal as a candidate audio signal under the condition that the target audio signal carries an awakening keyword, wherein the awakening keyword is used for starting the target equipment to enter a voice interaction state; sending the candidate audio signals to a server so that the server can identify and check the candidate awakening words in the candidate audio signals; and determining to awaken the target equipment under the condition that the identification verification result returned by the server indicates that the awakening keyword passes the verification, controlling the target equipment to enter a voice interaction state, and performing secondary audio verification on whether the awakening word is included or not by using the server on the basis that whether the original primary audio verification is the awakening word or not so as to achieve the technical purpose of accurately awakening the equipment, thereby realizing the technical effect of improving the awakening accuracy of the equipment.
As an optional scheme, after acquiring the target audio signal, the method further includes:
s1, inputting a target audio signal into an offline awakening model, wherein the offline awakening model is used for identifying an awakening keyword;
s2, obtaining a first recognition result output by the offline awakening model, wherein the first recognition result is used for indicating whether the target audio signal carries an awakening keyword or not;
s3, determining the target audio signal as the candidate audio signal under the condition that the first identification result indicates that the target audio signal carries the awakening keyword.
Optionally, in this embodiment, before inputting the target audio signal into the offline wakeup model, it may be, but is not limited to, first determine whether the target device is in a non-voice interaction state (for example, a sleep state, a node state, and the like), and then input the target audio signal into the offline wakeup model of the target device when the target device is in the non-voice interaction state.
Optionally, in this embodiment, the target device in the non-voice interaction state may keep the voice signal receiving channel in an activated state, the voice signal receiving channel may receive the surrounding sound signals in real time, and input the signal stream to the offline wake-up model, the offline wake-up model performs a threshold on the signals to determine whether the signals are voice, if the signals are voice, a series of preprocessing such as vad, noise suppression, echo cancellation, gain control is performed, and finally, the offline wake-up model determines whether the collected sound signals include a wake-up keyword, and determines whether to perform a next sending procedure (for example, send the processed sound signals to a server).
Optionally, in this embodiment, the target audio signal is processed based on the first recognition result to obtain the candidate audio signal, which may be but is not limited to, packing and sorting (a segment or a part of) the audio signal of the wake-up keyword with the recognition rate greater than or equal to the preset threshold into the candidate audio signal, wherein even if the recognition accuracy of the offline wake-up model on the audio signal including the wake-up keyword is low, the recognition accuracy of the offline wake-up model on the audio signal not including the wake-up keyword is not low, and then the offline wake-up model can perform a preliminary screening on the audio signal to improve the efficiency of the subsequent verification recognition.
It should be noted that, the target audio signal is input into the offline wake-up model, where the offline wake-up model is used to identify a wake-up keyword; acquiring a first identification result output by an offline awakening model, wherein the first identification result is used for indicating whether a target audio signal carries an awakening keyword or not; and under the condition that the first identification result indicates that the target audio signal carries the awakening keyword, determining the target audio signal as the candidate audio signal.
According to the embodiment provided by the application, the target audio signal is input into the offline awakening model, wherein the offline awakening model is used for identifying the awakening keyword; acquiring a first identification result output by the offline awakening model, wherein the first identification result is used for indicating whether the target audio signal carries an awakening keyword; and under the condition that the first identification result indicates that the target audio signal carries the awakening keyword, determining the target audio signal as the candidate audio signal, so that the purpose of primarily screening the audio signal is achieved, and the effect of improving the efficiency of overall audio identification and verification is realized.
As an optional scheme, after determining to wake up the target device, the method further includes:
s1, playing response audio corresponding to the awakening keyword; or the like, or, alternatively,
s1, under the condition that the target audio signal also carries an execution keyword, playing response audio, and executing a target event corresponding to the execution keyword.
Optionally, in this embodiment, the response audio may be, but is not limited to, a fixed audio signal, such as "on wool". But not limited to, a plurality of wake-up keywords may be set for the flexibly set audio signal, and further, a response audio corresponding to the wake-up keywords, for example, a response audio "small a on wool" corresponding to the wake-up keyword "small a" may also be set.
Optionally, in this embodiment, in order to further improve the waking accuracy of the device, in the case that the wake-up keyword is identified, the corresponding response audio may not be played first, but the execution keyword is identified, and in the case that the execution keyword is identified, the corresponding response audio is played.
For further example, based on the scenario shown in fig. 3, continuing with the example shown in fig. 4, assuming that the wake-up keyword includes "small a", the execution keyword includes "turn on air conditioner", the target device 304 acquires the target audio signal 402 initiated by the target object 302, further, the target audio signal 402 passes through the recognition check of the target device 304 and the server 306 in sequence, determines that the target audio signal 402 includes the wake-up keyword "small a", and then starts the target device 304 to enter the voice interaction state, but does not play the response audio corresponding to the wake-up keyword, because the execution keyword is not recognized and checked in the target audio signal 402.
For further example, optionally based on the scenario shown in fig. 4, continuing with the scenario shown in fig. 5, assuming that the current target device 304 has an open voice interaction state, and based on this, the target device 304 acquires the audio signal 502 at the next time initiated by the target object 302 (i.e., the audio signal acquired at the next time of the acquisition time of the target audio signal 402), and the target device 304, or the target device 304 and the server 306 recognize and verify that the audio signal 502 at the next time includes an execution keyword, so as to instruct the target device 304 to execute a target event corresponding to the execution keyword, for example, the target device 304 is a smart air conditioner, and then start the target device 304 to enter an operating state. Optionally, the time length of the acquisition interval between the target audio signal 402 and the audio signal 502 at the next time is less than or equal to a preset time length, which may also be, but is not limited to, the duration of the start of the voice interaction state of the target device 304, or in other words, when the time length for starting the voice interaction state of the target device 304 reaches the preset time length, the voice interaction state of the target device 304 is closed.
It should be noted that, the response audio corresponding to the awakening keyword is played; or playing the response audio and executing the target event corresponding to the execution keyword under the condition that the target audio signal also carries the execution keyword.
According to the embodiment provided by the application, the response audio corresponding to the awakening keyword is played; or under the condition that the target audio signal also carries the execution keyword, the response audio is played, and the target event corresponding to the execution keyword is executed, so that the aim of reducing the indication error rate of equipment operation is fulfilled, and the effect of improving the indication accuracy of the equipment operation is realized.
As an optional scheme, before determining to wake up the target device, the method further includes:
s1, inputting a candidate audio signal into a first verification model of a server, wherein the first verification model is a neural network model which is obtained by training a plurality of sample audio data and is used for identifying awakening keywords;
s2, obtaining a second identification result of the first verification model, wherein the second identification result is used for indicating whether the candidate audio signal carries the awakening keyword;
and S3, determining an identification checking result based on the second identification result.
Optionally, in this implementation, the first verification model may be, but is not limited to, an automatic speech recognition technology (ASR) model invoked for a cloud (server), where the ASR model may include, but is not limited to, a plurality of sub-network models, such as a word network model, a word pronunciation network model, a demiphone network model, a phoneme network model, a tone network model, an audio network model, and the like.
It should be noted that, in the current scheme for waking up the voice intelligent terminal, policy judgment is mainly performed on the voice intelligent terminal, that is, after the offline wake-up model judges that the word is a wake-up word, the voice intelligent terminal activates the device according to the judgment result, responds to the device user, communicates with the cloud, and prepares for subsequent voice interaction. A common voice intelligent device is in a harsher condition than a technical research and development environment when being actually produced, so that a production device solution generally balances recognition effect and scene constraint, so that a relatively fine model cannot be burned under limited configuration (the model is larger if the model is finer, and the configuration is higher if the model is larger), false wake-up with a certain probability is caused, and the false wake-up causes relatively great trouble to a user at night;
optionally, in this implementation, a secondary judgment strategy is performed on the suspected awakening word by using a high-precision first verification model of the server, so as to reduce the false awakening rate, where since the configuration of the server may be understood as being less limited, a more detailed model is burned in the server as the first verification model, compared with an offline awakening model, the configuration of the first verification model is higher, and the recognition result is also more detailed.
The candidate audio signals are input into a first verification model of the server, wherein the first verification model is a neural network model for identifying the awakening keyword, and the neural network model is obtained by training a plurality of sample audio data; acquiring a second identification result of the first verification model, wherein the second identification result is used for indicating whether the candidate audio signal carries an awakening keyword; and determining an identification check result based on the second identification result.
According to the embodiment provided by the application, the candidate audio signals are input into a first verification model of the server, wherein the first verification model is a neural network model which is obtained by training a plurality of sample audio data and is used for identifying the awakening keyword; acquiring a second identification result of the first verification model, wherein the second identification result is used for indicating whether the candidate audio signal carries an awakening keyword; and determining the identification and verification result based on the second identification result, so that the aim of identifying and verifying the awakening keyword by using a relatively accurate model is fulfilled, and the effect of improving the identification and verification accuracy of the awakening keyword is realized.
As an optional scheme, before acquiring the target audio signal, the method further includes:
s1, obtaining a plurality of sample audio data;
s2, marking the type of the audio data in each sample audio data to obtain a plurality of marked sample audio data, wherein each marked sample audio data comprises a marked hot sound identifier and a marked homophone identifier, the hot sound identifier is used for marking the audio data with the recording times being more than or equal to a hot degree threshold value, and the homophone identifier is used for marking the audio data with the pronunciation similarity being more than or equal to a phoneme threshold value;
and S3, inputting the marked multiple sample audio data into an initial first verification model to train and obtain the first verification model.
Optionally, in this embodiment, the plurality of sample audio data may be, but not limited to, convert the target audio signal into an electrical signal through voiceprint recognition, and then determine the type of the audio data according to the characteristics of the electrical signal and further mark the type of the audio data. Training to arrive at the first verification model may be, but is not limited to, based on a CTC algorithm for results from one input sequence to one output sequence, in other words, the CTC algorithm is only for whether the predicted output sequence is close to or identical to the true sequence, and not for each result in the predicted output sequence whether it is exactly aligned with the input sequence at the point in time.
Optionally, in this embodiment, the audio data with the recording times being greater than or equal to the threshold of the degree of heat may be obtained and marked through a hotword list, and the audio data with the pronunciation similarity being greater than or equal to the threshold of the phoneme may be obtained and marked through a similar pronunciation word list, where the hotword list may store, but is not limited to store, a plurality of sets of audio data with the pronunciation similarity being greater than or equal to the threshold of the degree of heat, and the similar pronunciation word list stores a plurality of sets of audio data with the pronunciation similarity being greater than or equal to the threshold of the phoneme.
It should be noted that, a plurality of sample audio data are obtained; marking the type of the audio data in each sample audio data to obtain a plurality of marked sample audio data, wherein each marked sample audio data comprises a marked hot sound identifier and a marked homophone identifier, the hot sound identifier is used for marking the audio data with the recording times larger than or equal to a heat threshold, and the homophone identifier is used for marking the audio data with the pronunciation similarity larger than or equal to a phoneme threshold; and inputting the marked multiple sample audio data into an initial first verification model to train and obtain the first verification model. Optionally, the first verification model obtained by training performs recognition verification on the acquired audio signal through a two-layer mechanism of a hotword list and phoneme generalization.
According to the embodiment provided by the application, a plurality of sample audio data are obtained; marking the type of the audio data in each sample audio data to obtain a plurality of marked sample audio data, wherein each marked sample audio data comprises a marked hot sound identifier and a marked homophone identifier, the hot sound identifier is used for marking the audio data with the recording times larger than or equal to a heat threshold, and the homophone identifier is used for marking the audio data with the pronunciation similarity larger than or equal to a phoneme threshold; a plurality of marked sample audio data are input into the initial first check model to train and obtain the first check model, the purpose of determining the position of the awakening keyword by identifying different types of audio identifiers is achieved, and the effect of improving the model accuracy is achieved.
As an alternative, inputting the marked multiple sample audio data into an initial first verification model to train to obtain the first verification model, including:
s1, repeatedly executing the following steps until a first verification model is obtained:
s2, determining current sample audio data from the marked sample audio data, and determining a current first verification model, wherein the current sample audio data comprises a marked current hot tone identification and a marked current homophonic identification;
s3, outputting a current identification result through the current first verification model, wherein the current identification result is used for indicating whether the current sample audio data comprises the awakening keyword or not;
s4, under the condition that the current identification result does not reach the identification convergence condition, acquiring next sample audio data as the current sample audio data;
and S5, determining the current first verification model as the first verification model under the condition that the current recognition result reaches the recognition convergence condition.
It should be noted that, the following steps are repeatedly executed until the first verification model is obtained: determining current sample audio data from the marked sample audio data, and determining a current first verification model, wherein the current sample audio data comprises a marked current hot tone identifier and a marked current homophonic identifier; outputting a current identification result through a current first verification model, wherein the current identification result is used for indicating whether the current sample audio data comprises an awakening keyword or not; under the condition that the current identification result does not reach the identification convergence condition, acquiring next sample audio data as the current sample audio data; and under the condition that the current identification result reaches the identification convergence condition, determining that the current first verification model is the first verification model. Alternatively, the condition that the repeating step is stopped may be, but is not limited to, that the error rate of the first verification model is lower than a preset threshold.
By the embodiment provided by the application, the following steps are repeatedly executed until the first verification model is obtained: determining current sample audio data from the marked sample audio data, and determining a current first verification model, wherein the current sample audio data comprises a marked current hot tone identifier and a marked current homophonic identifier; outputting a current identification result through a current first verification model, wherein the current identification result is used for indicating whether the current sample audio data comprises an awakening keyword or not; under the condition that the current identification result does not reach the identification convergence condition, acquiring next sample audio data as the current sample audio data; under the condition that the current recognition result reaches the recognition convergence condition, the current first verification model is determined to be the first verification model, the aim of efficiently acquiring the trained first verification model is achieved, and the effect of improving the acquisition efficiency of the first verification model is achieved.
As an alternative, after inputting the marked multiple sample audio data into the initial first verification model to train and obtain the first verification model, the method includes:
and compressing the trained first verification model to obtain a compressed first verification model.
Optionally, in this embodiment, for a case that the feedback delay of the device wake-up method is long, the first verification model is compressed by using a case that the word of the wake-up keyword is relatively single, so as to improve the real-time rate, and thus reduce the overall feedback time.
Optionally, in this embodiment, after the trained first verification model is obtained, the compressed first verification model may be further used as a second verification model, and the marked multiple sample audio data are input into the initial second verification model to obtain the second verification model through training, so that redundant feedback time length caused by high precision is overcome in a compressed manner, and a certain precision is maintained while real-time rate is improved and overall feedback time is reduced in a secondary training manner.
It should be noted that, the trained first calibration model is compressed to obtain an initial second calibration model; and inputting the marked sample audio data into an initial second calibration model to train to obtain a second calibration model.
According to the embodiment provided by the application, the trained first check model is compressed to obtain an initial second check model; and inputting the marked sample audio data into the initial second check model to train to obtain the second check model, thereby achieving the purpose of improving the real-time rate and realizing the effect of reducing the overall feedback time.
As an alternative scheme, for convenience of understanding, an example of an awake scene of the intelligent terminal in the night mode is illustrated in fig. 6, and the specific steps are as follows:
step S602, acquiring a section of initial audio signal;
step S604, the intelligent terminal judges whether the intelligent terminal is in a night mode or not, if not, step 606 is executed, and if yes, step 608 is executed;
step S606, the intelligent terminal equipment identifies and verifies the initial audio signal;
step S608, the intelligent terminal device executes first identification and verification on the initial audio signal;
step S610, determining whether the initial audio signal includes a wake-up keyword, if not, performing step S602, and if so, performing step S612;
step S612, the intelligent terminal device processes the initial audio signal into a candidate audio signal based on the first identification and verification, and sends the candidate audio signal to the server;
step S614, the server executes second identification verification on the candidate audio signals;
step S616, determining whether the candidate audio signal includes a wake-up keyword, if not, performing step S602, and if not, performing step S618;
and step S618, starting the intelligent terminal equipment to start a voice interaction state.
It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the order of acts, as some steps may occur in other orders or concurrently in accordance with the invention. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required by the invention.
According to another aspect of the embodiment of the present invention, a device wake-up apparatus for implementing the above device wake-up method is also provided. As shown in fig. 7, the apparatus includes:
a first obtaining unit 702, configured to obtain a target audio signal, where the target audio signal is an audio signal collected by a target device in a target time interval, and a usage frequency of the target device in the target time interval is less than or equal to a target threshold;
a first determining unit 704, configured to determine a target audio signal as a candidate audio signal when the target audio signal carries a wake-up keyword, where the wake-up keyword is used to start a target device to enter a voice interaction state;
a sending unit 706, configured to send the candidate audio signal to a server, so that the server performs recognition and verification on the candidate wake-up word in the candidate audio signal;
a second determining unit 708, configured to determine to wake the target device and control the target device to enter a voice interaction state if the recognition and verification result returned by the server indicates that the wake-up keyword has passed the verification.
Optionally, in this embodiment, the device wake-up apparatus may be but not limited to be applied to an internet of things scene to wake up an intelligent home device in an electricity-saving sleep state, where the intelligent home device in the electricity-saving sleep state may but is not limited to keep a voice signal receiving channel in an activated state, the voice signal receiving channel may receive surrounding sound signals in real time, and input a signal stream to a wake-up model module, and the wake-up model module performs threshold determination on the signal to determine whether the signal is a voice, and if the signal is a voice, performs a series of preprocessing such as voice activation detection (vad), noise suppression, echo cancellation, gain control, and finally determines whether the voice is a wake-up word according to the wake-up model type, and determines whether to perform a wake-up procedure. Because the configuration of intelligent terminal equipment product is limited, can't burn bigger, meticulous awakening model module of awaking, then awakening model module and can having the erroneous judgement of certain probability for the speech signal for awakening the word, lead to intelligent terminal equipment to take place the condition that non-user's wish awakened up at last, promptly: a false wake-up phenomenon. However, by using the device awakening device, on the basis of the first identification and verification executed by the original awakening model module, the server executes the second identification and verification, and because the server can have the awakening model module with higher identification precision, the misjudgment of awakening words is compensated, and the awakening accuracy of the device is improved.
Optionally, in this embodiment, the target time interval may be, but not limited to, a system mode interval such as an undisturbed time interval, a night time interval, a working time interval, and the like, and may also be flexibly set by the user, which is only an example and is not limited herein. Because the requirement of the target time interval on the response time of the device awakening is low, or the user can accept a long time for awakening the device in the target time interval, for example, the awakening frequency of the device by the user in the night time interval is low, and the tolerance of the response efficiency of the awakening device is high, even if the response time of the awakening device is long, the experience of the user is not greatly reduced, but the tolerance of the user to the false awakening phenomenon in the night time interval is low, and if the false awakening phenomenon really occurs and the modification of the user is disturbed, the experience of the user is greatly reduced. In summary, the tolerance of the user to the response efficiency of the wake-up device is high in the target time interval, or the tolerance of the user to the occurrence of the false wake-up phenomenon is low in the target time interval.
Optionally, in this embodiment, the wake-up keyword may be, but is not limited to, wake-up the device to indicate that the device enters a voice interaction state, for example, when the wake-up keyword is "small a", and when the target device detects an audio signal carrying the "small a", the voice interaction state of the target device is started, and further, based on the detected audio signal, a corresponding instruction event is executed. Optionally, the wake-up keyword may also include, but is not limited to, at least one of the following: audio, timbre, tone, etc. For example, an audio signal triggered by user B, even if the audio signal includes "small a", but the audio, tone or pitch does not match the wake-up keyword, only if the audio signal "small a" triggered by user a is detected, can it be considered to match the wake-up keyword.
Optionally, in this embodiment, the target device determines the target audio signal as the candidate audio signal, but not limited to a first recognition check performed on the target audio signal by using an offline wakeup model in the target device, where a larger and finer offline wakeup model cannot be programmed because the configuration of the target device is limited, and therefore, a larger error rate still exists in the recognition check of the offline wakeup model on the wakeup keyword.
Optionally, in this embodiment, the candidate audio signal with a relatively large error rate is sent to the server, so that the server performs the second recognition and verification on the candidate audio signal, where, because the server does not have the problem of configuration limitation, a more refined model may be configured to perform the recognition and verification, thereby greatly improving the accuracy of the recognition and verification on the wake-up keyword,
It should be noted that a target audio signal is obtained, where the target audio signal is an audio signal acquired by a target device in a target time interval, and a frequency of use of the target device in the target time interval is less than or equal to a target threshold; determining the target audio signal as a candidate audio signal under the condition that the target audio signal carries an awakening keyword, wherein the awakening keyword is used for starting the target equipment to enter a voice interaction state; sending the candidate audio signals to a server so that the server can identify and check the candidate awakening words in the candidate audio signals; and under the condition that the identification and verification result returned by the server indicates that the awakening keyword passes verification, determining to awaken the target equipment and controlling the target equipment to enter a voice interaction state.
For a specific embodiment, reference may be made to the example shown in the above device wake-up apparatus, and details in this example are not described herein again.
According to the embodiment provided by the application, a target audio signal is obtained, wherein the target audio signal is an audio signal collected by target equipment in a target time interval, and the use frequency of the target equipment in the target time interval is less than or equal to a target threshold; determining the target audio signal as a candidate audio signal under the condition that the target audio signal carries an awakening keyword, wherein the awakening keyword is used for starting the target equipment to enter a voice interaction state; sending the candidate audio signals to a server so that the server can identify and check the candidate awakening words in the candidate audio signals; and determining to awaken the target equipment under the condition that the identification verification result returned by the server indicates that the awakening keyword passes the verification, controlling the target equipment to enter a voice interaction state, and performing secondary audio verification on whether the awakening word is included or not by using the server on the basis that whether the original primary audio verification is the awakening word or not so as to achieve the technical purpose of accurately awakening the equipment, thereby realizing the technical effect of improving the awakening accuracy of the equipment.
As an optional solution, the method further includes:
the device comprises a first input unit, a first output unit and a second input unit, wherein the first input unit is used for inputting a target audio signal into an offline awakening model after the target audio signal is acquired, and the offline awakening model is used for identifying an awakening keyword;
the second obtaining unit is used for obtaining a first recognition result output by the offline awakening model after obtaining the target audio signal, wherein the first recognition result is used for indicating whether the target audio signal carries an awakening keyword or not;
and the processing unit is used for determining the target audio signal as the candidate audio signal under the condition that the first identification result indicates that the target audio signal carries the awakening keyword after the target audio signal is obtained.
For a specific embodiment, reference may be made to the example shown in the above device wake-up method, and details in this example are not described herein again.
As an optional scheme, the method further comprises the following steps:
the first playing unit is used for playing response audio corresponding to the awakening keyword after the target equipment is determined to be awakened; or the like, or, alternatively,
and the second playing unit is used for playing the response audio and executing the target event corresponding to the execution keyword under the condition that the target audio signal also carries the execution keyword after the target equipment is determined to be awakened.
For a specific embodiment, reference may be made to the example shown in the above device wake-up method, and details in this example are not described herein again.
As an optional scheme, the method further comprises the following steps:
the second input unit is used for inputting the candidate audio signals into a first verification model of the server before the target equipment is confirmed to be awakened, wherein the first verification model is a neural network model which is obtained by training through a plurality of sample audio data and is used for identifying the awakening keywords;
the third obtaining unit is used for obtaining a second identification result of the first verification model before the target device is determined to be awakened, wherein the second identification result is used for indicating whether the candidate audio signal carries an awakening keyword or not;
and a third determining unit for determining the identification check result based on the second identification result.
For a specific embodiment, reference may be made to the example shown in the above device wake-up method, and details in this example are not described herein again.
As an optional scheme, the method further comprises the following steps:
a fourth acquisition unit configured to acquire a plurality of sample audio data before acquiring the target audio signal;
the system comprises a marking unit, a processing unit and a processing unit, wherein the marking unit is used for marking the type of audio data in each sample audio data before acquiring a target audio signal to obtain a plurality of marked sample audio data, each marked sample audio data comprises a marked hot sound identifier and a homophone identifier, the hot sound identifier is used for marking the audio data with the recording times being more than or equal to a hot degree threshold, and the homophone identifier is used for marking the audio data with the pronunciation similarity being more than or equal to a phoneme threshold;
and the third input unit is used for inputting the marked sample audio data into the initial first verification model before the target audio signal is obtained so as to train and obtain the first verification model.
For a specific embodiment, reference may be made to the example shown in the above device wake-up method, and details in this example are not described herein again.
As an alternative, the third input unit includes:
a repeating module for repeatedly executing the following steps until a first verification model is obtained:
the determining module is used for determining current sample audio data from the marked multiple sample audio data and determining a current first verification model, wherein the current sample audio data comprises a marked current hot tone identifier and a marked current homophone identifier;
the output module is used for outputting a current identification result through the current first verification model, wherein the current identification result is used for indicating whether the current sample audio data comprises the awakening keyword or not;
the acquisition module is used for acquiring next sample audio data as the current sample audio data under the condition that the current identification result does not reach the identification convergence condition;
and the determining module is used for determining that the current first verification model is the first verification model under the condition that the current identification result reaches the identification convergence condition.
For a specific embodiment, reference may be made to the example shown in the above device wake-up method, and details in this example are not described herein again.
As an alternative, the method comprises the following steps:
and a fifth obtaining unit, configured to compress the trained first verification model to obtain the compressed first verification model.
For a specific embodiment, reference may be made to the example shown in the above device wake-up method, which is not described herein again.
According to another aspect of the embodiments of the present invention, there is also provided an electronic apparatus for implementing the device wake-up method, as shown in fig. 8, the electronic apparatus includes a memory 802 and a processor 804, the memory 802 stores a computer program, and the processor 804 is configured to execute the steps in any of the method embodiments by the computer program.
Optionally, in this embodiment, the electronic apparatus may be located in at least one network device of a plurality of network devices of a computer network.
Optionally, in this embodiment, the processor may be configured to execute the following steps by a computer program:
s1, acquiring a target audio signal, wherein the target audio signal is an audio signal acquired by target equipment in a target time interval, and the use frequency of the target equipment in the target time interval is less than or equal to a target threshold;
s2, under the condition that the target audio signal carries an awakening keyword, determining the target audio signal as a candidate audio signal, wherein the awakening keyword is used for starting the target equipment to enter a voice interaction state;
s3, sending the candidate audio signals to a server so that the server can identify and check the candidate awakening words in the candidate audio signals;
and S4, under the condition that the identification and verification result returned by the server indicates that the awakening keyword passes the verification, determining to awaken the target equipment, and controlling the target equipment to enter a voice interaction state.
Alternatively, it can be understood by those skilled in the art that the structure shown in fig. 8 is only an illustration, and the electronic device may also be a terminal device such as a smart phone (e.g., an Android phone, an iOS phone, etc.), a tablet computer, a palm computer, a Mobile Internet Device (MID), a PAD, and the like. Fig. 8 is a diagram illustrating the structure of the electronic device. For example, the electronic device may also include more or fewer components (e.g., network interfaces, etc.) than shown in FIG. 8, or have a different configuration than shown in FIG. 8.
The memory 802 may be used to store software programs and modules, such as program instructions/modules corresponding to the device wake-up method and apparatus in the embodiments of the present invention, and the processor 804 executes various functional applications and data processing by running the software programs and modules stored in the memory 802, so as to implement the device wake-up method described above. The memory 802 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 802 can further include memory located remotely from the processor 804, which can be connected to the terminal over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof. The memory 802 may be specifically, but not limited to, used for storing information such as a target audio signal, a candidate audio signal, and a voice interaction state. As an example, as shown in fig. 8, the memory 802 may include, but is not limited to, a first obtaining unit 702, a first determining unit 704, a sending unit 706, and a second determining unit 708 in the device wake-up apparatus. In addition, the present invention may further include, but is not limited to, other module units in the device wake-up apparatus, which are not described in detail in this example.
Optionally, the transmitting device 806 is configured to receive or transmit data via a network. Examples of the network may include a wired network and a wireless network. In one example, the transmission device 806 includes a Network adapter (NIC) that can be connected to a router via a Network cable and can communicate with the internet or a local area Network. In one example, the transmission device 806 is a Radio Frequency (RF) module, which is used for communicating with the internet in a wireless manner.
In addition, the electronic device further includes: a display 808, configured to display information such as the target audio signal, the candidate audio signal, and a voice interaction state; and a connection bus 810 for connecting the respective module parts in the above-described electronic apparatus.
According to a further aspect of an embodiment of the present invention, there is also provided a computer-readable storage medium having a computer program stored thereon, wherein the computer program is arranged to perform the steps of any of the above method embodiments when executed.
Alternatively, in the present embodiment, the above-mentioned computer-readable storage medium may be configured to store a computer program for executing the steps of:
s1, acquiring a target audio signal, wherein the target audio signal is an audio signal acquired by target equipment in a target time interval, and the use frequency of the target equipment in the target time interval is less than or equal to a target threshold value;
s2, under the condition that the target audio signal carries an awakening keyword, determining the target audio signal as a candidate audio signal, wherein the awakening keyword is used for starting the target equipment to enter a voice interaction state;
s3, sending the candidate audio signals to a server so that the server can identify and check the candidate awakening words in the candidate audio signals;
and S4, under the condition that the identification and verification result returned by the server indicates that the awakening keyword passes the verification, determining to awaken the target equipment, and controlling the target equipment to enter a voice interaction state.
Alternatively, in this embodiment, a person skilled in the art may understand that all or part of the steps in the various methods in the foregoing embodiments may be implemented by a program instructing hardware related to the terminal device, where the program may be stored in a computer-readable storage medium, and the storage medium may include: flash disks, read-Only memories (ROMs), random Access Memories (RAMs), magnetic or optical disks, and the like.
The above-mentioned serial numbers of the embodiments of the present invention are only for description, and do not represent the advantages and disadvantages of the embodiments.
The integrated unit in the above embodiments, if implemented in the form of a software functional unit and sold or used as a separate product, may be stored in the above computer-readable storage medium. Based on such understanding, the technical solution of the present invention may be substantially or partially implemented in the prior art, or all or part of the technical solution may be embodied in the form of a software product stored in a storage medium, and including instructions for causing one or more computer devices (which may be personal computers, servers, or network devices) to execute all or part of the steps of the method according to the embodiments of the present invention.
In the above embodiments of the present invention, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
In the several embodiments provided in the present application, it should be understood that the disclosed client may be implemented in other manners. The above-described embodiments of the apparatus are merely illustrative, and for example, a division of a unit is merely a division of a logic function, and an actual implementation may have another division, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, units or modules, and may be in an electrical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one position, or may be distributed on multiple network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims (9)

1. A device wake-up method, comprising:
acquiring a target audio signal, wherein the target audio signal is an audio signal acquired by target equipment in a target time interval, and the use frequency of the target equipment in the target time interval is less than or equal to a target threshold value;
determining the target audio signal as a candidate audio signal under the condition that the target audio signal carries an awakening keyword, wherein the awakening keyword is used for starting the target equipment to enter a voice interaction state;
sending the candidate audio signals to a server so that the server can identify and check the candidate awakening words in the candidate audio signals;
under the condition that the identification and verification result returned by the server indicates that the awakening keyword passes verification, the target equipment is determined to be awakened, and the target equipment is controlled to enter the voice interaction state;
after the acquiring the target audio signal, further comprising:
inputting the target audio signal into an offline wake-up model, wherein the offline wake-up model is used for identifying the wake-up keyword;
acquiring a first identification result output by the offline awakening model, wherein the first identification result is used for indicating whether the target audio signal carries the awakening keyword or not;
determining the target audio signal as the candidate audio signal under the condition that the first identification result indicates that the target audio signal carries the awakening keyword;
before inputting the target audio signal into the offline wakeup model, whether the target device is in a non-voice interaction state is judged, and then the target audio signal is input into the offline wakeup model of the target device under the condition that the target device is in the non-voice interaction state.
2. The method of claim 1, after determining to wake up the target device, further comprising:
playing response audio corresponding to the awakening keyword; or the like, or, alternatively,
and under the condition that the target audio signal also carries an execution keyword, playing the response audio and executing a target event corresponding to the execution keyword.
3. The method of claim 1, further comprising, prior to the determining to wake the target device:
inputting the candidate audio signal into a first verification model of the server, wherein the first verification model is a neural network model which is obtained by training a plurality of sample audio data and is used for identifying the awakening keyword;
acquiring a second identification result of the first verification model, wherein the second identification result is used for indicating whether the candidate audio signal carries the awakening keyword;
and determining the identification checking result based on the second identification result.
4. The method of claim 3, further comprising, prior to said obtaining the target audio signal:
obtaining the plurality of sample audio data;
marking the type of the audio data in each sample audio data to obtain the marked multiple sample audio data, wherein each marked sample audio data comprises a marked hot sound identifier and a homophone identifier, the hot sound identifier is used for marking the audio data with the recording times larger than or equal to a hot degree threshold, and the homophone identifier is used for marking the audio data with the pronunciation similarity larger than or equal to a phoneme threshold;
and inputting the marked sample audio data into an initial first verification model to train to obtain the first verification model.
5. The method of claim 4, wherein the inputting the marked sample audio data into an initial first verification model to train the first verification model comprises:
repeatedly executing the following steps until the first verification model is obtained:
determining current sample audio data from the marked multiple sample audio data, and determining a current first verification model, wherein the current sample audio data comprises a marked current hot tone identifier and a marked current homophone identifier;
outputting a current identification result through the current first verification model, wherein the current identification result is used for indicating whether the wake-up keyword is included in the current sample audio data;
under the condition that the current identification result does not reach the identification convergence condition, acquiring next sample audio data as the current sample audio data;
and under the condition that the current identification result reaches the identification convergence condition, determining the current first verification model as the first verification model.
6. The method of claim 4, wherein after the inputting the marked sample audio data into an initial first verification model to train the first verification model, the method further comprises:
and compressing the trained first verification model to obtain the compressed first verification model.
7. An apparatus wake-up device, comprising:
under the condition that the current time is in a target time interval, acquiring a target audio signal triggered by a target object at the current time, wherein the use frequency of target equipment in the target time interval is less than or equal to a target threshold value;
determining the target audio signal as a candidate audio signal under the condition that the target audio signal carries an awakening keyword, wherein the awakening keyword is used for starting the target equipment to enter a voice interaction state;
sending the candidate audio signals to a server so that the server performs identification and verification on the candidate awakening words in the candidate audio signals;
when the recognition and verification result returned by the server indicates that the awakening keyword passes verification, the target equipment is confirmed to be awakened, and the target equipment is controlled to enter the voice interaction state;
the device further comprises:
the device comprises a first input unit, a first output unit and a second input unit, wherein the first input unit is used for inputting a target audio signal into an offline awakening model after the target audio signal is acquired, and the offline awakening model is used for identifying an awakening keyword;
the second obtaining unit is used for obtaining a first recognition result output by the offline awakening model after obtaining the target audio signal, wherein the first recognition result is used for indicating whether the target audio signal carries an awakening keyword or not;
the processing unit is used for determining a target audio signal as the candidate audio signal under the condition that the first identification result indicates that the target audio signal carries the awakening keyword after the target audio signal is obtained;
before inputting the target audio signal into the offline wakeup model, whether the target device is in a non-voice interaction state is judged, and then the target audio signal is input into the offline wakeup model of the target device under the condition that the target device is in the non-voice interaction state.
8. A computer-readable storage medium, comprising a stored program, wherein the program is operable to perform the method of any one of claims 1 to 6.
9. An electronic device comprising a memory and a processor, characterized in that the memory has stored therein a computer program, the processor being arranged to execute the method of any of claims 1 to 6 by means of the computer program.
CN202011635662.8A 2020-12-31 2020-12-31 Equipment awakening method and device, storage medium and electronic device Active CN112634897B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011635662.8A CN112634897B (en) 2020-12-31 2020-12-31 Equipment awakening method and device, storage medium and electronic device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011635662.8A CN112634897B (en) 2020-12-31 2020-12-31 Equipment awakening method and device, storage medium and electronic device

Publications (2)

Publication Number Publication Date
CN112634897A CN112634897A (en) 2021-04-09
CN112634897B true CN112634897B (en) 2022-10-28

Family

ID=75290428

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011635662.8A Active CN112634897B (en) 2020-12-31 2020-12-31 Equipment awakening method and device, storage medium and electronic device

Country Status (1)

Country Link
CN (1) CN112634897B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115482814A (en) * 2021-06-16 2022-12-16 青岛海尔洗衣机有限公司 Control method and device of household appliance and equipment
CN113628622A (en) * 2021-08-24 2021-11-09 北京达佳互联信息技术有限公司 Voice interaction method and device, electronic equipment and storage medium
CN114743542A (en) * 2022-04-29 2022-07-12 青岛海尔科技有限公司 Voice processing method and device, storage medium and electronic device

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107704275A (en) * 2017-09-04 2018-02-16 百度在线网络技术(北京)有限公司 Smart machine awakening method, device, server and smart machine
CN108564951A (en) * 2018-03-02 2018-09-21 北京云知声信息技术有限公司 The method that intelligence reduces voice control device false wake-up probability
CN111862965A (en) * 2019-04-28 2020-10-30 阿里巴巴集团控股有限公司 Awakening processing method and device, intelligent sound box and electronic equipment

Family Cites Families (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7725319B2 (en) * 2003-07-07 2010-05-25 Dialogic Corporation Phoneme lattice construction and its application to speech recognition and keyword spotting
CN103943110A (en) * 2013-01-21 2014-07-23 联想(北京)有限公司 Control method, device and electronic equipment
CN116364076A (en) * 2017-07-04 2023-06-30 阿里巴巴集团控股有限公司 Processing method, control method, identification method and device thereof, and electronic equipment
CN108564941B (en) * 2018-03-22 2020-06-02 腾讯科技(深圳)有限公司 Voice recognition method, device, equipment and storage medium
CN108847219B (en) * 2018-05-25 2020-12-25 台州智奥通信设备有限公司 Awakening word preset confidence threshold adjusting method and system
CN109545207A (en) * 2018-11-16 2019-03-29 广东小天才科技有限公司 A kind of voice awakening method and device
CN111261151B (en) * 2018-12-03 2022-12-27 中移(杭州)信息技术有限公司 Voice processing method and device, electronic equipment and storage medium
CN109618417B (en) * 2018-12-07 2021-08-06 歌尔科技有限公司 Interactive implementation method, system, accessory and storage medium
CN109859774B (en) * 2019-01-02 2021-04-02 珠海格力电器股份有限公司 Voice equipment and method and device for adjusting endpoint detection sensitivity thereof and storage medium
CN110727821A (en) * 2019-10-12 2020-01-24 深圳海翼智新科技有限公司 Method, apparatus, system and computer storage medium for preventing device from being awoken by mistake
CN110992953A (en) * 2019-12-16 2020-04-10 苏州思必驰信息科技有限公司 Voice data processing method, device, system and storage medium
CN111933116B (en) * 2020-06-22 2023-02-14 厦门快商通科技股份有限公司 Speech recognition model training method, system, mobile terminal and storage medium
CN111833902A (en) * 2020-07-07 2020-10-27 Oppo广东移动通信有限公司 Awakening model training method, awakening word recognition device and electronic equipment

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107704275A (en) * 2017-09-04 2018-02-16 百度在线网络技术(北京)有限公司 Smart machine awakening method, device, server and smart machine
CN108564951A (en) * 2018-03-02 2018-09-21 北京云知声信息技术有限公司 The method that intelligence reduces voice control device false wake-up probability
CN111862965A (en) * 2019-04-28 2020-10-30 阿里巴巴集团控股有限公司 Awakening processing method and device, intelligent sound box and electronic equipment

Also Published As

Publication number Publication date
CN112634897A (en) 2021-04-09

Similar Documents

Publication Publication Date Title
CN112634897B (en) Equipment awakening method and device, storage medium and electronic device
CN110534099A (en) Voice wakes up processing method, device, storage medium and electronic equipment
CN107147618A (en) A kind of user registering method, device and electronic equipment
CN107909998B (en) Voice instruction processing method and device, computer equipment and storage medium
CN111261151B (en) Voice processing method and device, electronic equipment and storage medium
CN109450750A (en) Sound control method, device, mobile terminal and the household appliance of equipment
CN110459222A (en) Sound control method, phonetic controller and terminal device
CN108447471A (en) Audio recognition method and speech recognition equipment
CN110544468B (en) Application awakening method and device, storage medium and electronic equipment
CN108632653B (en) Voice control method, smart television and computer readable storage medium
CN102404278A (en) Song request system based on voiceprint recognition and application method thereof
CN107506166A (en) Information cuing method and device, computer installation and readable storage medium storing program for executing
JP6391386B2 (en) Server, server control method, and server control program
CN111161714A (en) Voice information processing method, electronic equipment and storage medium
CN109036393A (en) Wake-up word training method, device and the household appliance of household appliance
JP6915637B2 (en) Information processing equipment, information processing methods, and programs
KR20190109916A (en) A electronic apparatus and a server for processing received data from the apparatus
CN112837686A (en) Wake-up response operation execution method and device, storage medium and electronic device
CN111462741A (en) Voice data processing method, device and storage medium
CN110473542B (en) Awakening method and device for voice instruction execution function and electronic equipment
CN108922522A (en) Control method, device, storage medium and the electronic device of equipment
CN111724781A (en) Audio data storage method and device, terminal and storage medium
CN108766443A (en) Method of adjustment, device, storage medium and the electronic equipment of matching threshold
KR20210001082A (en) Electornic device for processing user utterance and method for operating thereof
CN110580897A (en) audio verification method and device, storage medium and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant