CN110890093B

CN110890093B - Intelligent equipment awakening method and device based on artificial intelligence

Info

Publication number: CN110890093B
Application number: CN201911158856.0A
Authority: CN
Inventors: 陈杰; 苏丹; 金明杰; 朱振岭
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2019-11-22
Filing date: 2019-11-22
Publication date: 2024-02-09
Anticipated expiration: 2039-11-22
Also published as: CN110890093A

Abstract

The embodiment of the application discloses a method and a device for waking up intelligent equipment, wherein after the intelligent equipment collects audio data to be identified, acoustic characteristics determined according to the audio data to be identified are stored in the process of verifying whether the audio data to be identified contains wake-up words corresponding to the intelligent equipment; when the primary verification determines that the audio data to be identified contains the wake-up word, the secondary verification is performed firstly instead of waking up the intelligent device, a to-be-determined feature sequence is determined from the stored acoustic features, and whether the to-be-determined feature sequence meets the wake-up condition is determined according to the acoustic feature sequence of the wake-up word; after the undetermined feature sequence is determined to meet the wake-up condition, the audio data to be identified passes the primary and secondary verification, and at the moment, the intelligent equipment is waken up. And the acoustic features determined by the first-level verification are utilized, and the undetermined feature sequence is extracted according to the acoustic features to carry out the second-level verification, so that the false wake-up frequency of the intelligent equipment is effectively reduced.

Description

Intelligent equipment awakening method and device based on artificial intelligence

Technical Field

The application relates to the field of data processing, in particular to an intelligent device awakening method and device based on artificial intelligence.

Background

At present, intelligent equipment is more and more popular, and is widely applied to work and life of people.

Some intelligent devices are in a dormant state when not providing service, and when a user needs to use the intelligent devices, the user can speak a wake-up word in a voice mode to wake up the intelligent devices, for example, the user can wake up a dormant intelligent sound box through the wake-up word.

However, the related art has a high false wake-up rate, that is, some noise or speech errors of the non-wake-up word are recognized as speech of the wake-up word, and the intelligent device is awakened by mistake, so that the intelligent device is started suddenly under the condition that the user does not need to use the intelligent device, and bad use experience is caused to the user.

Disclosure of Invention

In order to solve the technical problems, the application provides an intelligent equipment awakening method, which utilizes the acoustic features determined by primary verification, extracts undetermined feature sequences according to the acoustic features to perform secondary verification, effectively reduces the false awakening frequency of intelligent equipment, and improves user experience.

The embodiment of the application discloses the following technical scheme:

in a first aspect, an embodiment of the present application provides an intelligent device wake-up method, where the method includes:

In the process of verifying whether the audio data to be identified contains wake-up words corresponding to the intelligent equipment, storing acoustic features determined according to the audio data to be identified, wherein the acoustic features are used for identifying acoustic features of the audio data to be identified;

if the audio data to be identified contains the wake-up word through the target audio frame in the audio data to be identified, determining a pending feature sequence from the stored acoustic features, wherein the pending feature sequence comprises acoustic features of a plurality of continuous audio frames in the audio data to be identified, and the plurality of continuous audio frames comprise the target audio frame;

determining whether the undetermined feature sequence meets a wake-up condition according to the acoustic feature sequence of the wake-up word;

and if yes, waking up the intelligent equipment.

In a second aspect, an embodiment of the present application provides an intelligent device wake-up apparatus, where the apparatus includes a first determining unit, a second determining unit, a third determining unit, and a wake-up unit:

the first determining unit is used for storing acoustic characteristics determined according to the audio data to be identified in the process of verifying whether the audio data to be identified contains wake-up words corresponding to the intelligent equipment, and the acoustic characteristics are used for identifying the acoustic characteristics of the audio data to be identified;

The second determining unit is configured to determine, if it is determined that the audio data to be identified includes the wake-up word according to a target audio frame in the audio data to be identified, a pending feature sequence from the stored acoustic features, where the pending feature sequence includes acoustic features of a plurality of continuous audio frames in the audio data to be identified, and the plurality of continuous audio frames includes the target audio frame;

the third determining unit is configured to determine, according to the acoustic feature sequence of the wake-up word, whether the pending feature sequence meets a wake-up condition;

and the awakening unit is used for awakening the intelligent equipment if the undetermined characteristic sequence meets the awakening condition.

In a third aspect, an embodiment of the present application provides a wake word updating method of an intelligent device, where the method includes:

acquiring text characteristics of wake-up words to be updated, which are sent by intelligent equipment;

generating the audio data of the wake-up word to be updated according to the text characteristics;

determining an acoustic feature sequence of the wake-up word to be updated according to the audio data; the acoustic feature sequence is used for performing secondary verification by the intelligent device in the process of verifying whether the audio data to be identified contains wake-up words corresponding to the intelligent device or not so as to determine whether the undetermined feature sequence of the audio data to be identified meets wake-up conditions or not; the undetermined characteristic sequence comprises acoustic characteristics of a plurality of continuous audio frames in the audio data to be identified, wherein the plurality of continuous audio frames comprise target audio frames when the audio data to be identified contains the wake-up word, and the acoustic characteristics are used for identifying acoustic characteristics of the audio data to be identified;

And returning the acoustic feature sequence to the intelligent device.

In a fourth aspect, an embodiment of the present application provides a wake-up word updating apparatus of an intelligent device, where the apparatus includes an obtaining unit, a generating unit, a determining unit, and a returning unit:

the acquisition unit is used for acquiring text characteristics of wake-up words to be updated, which are sent by the intelligent equipment;

the generating unit is used for generating the audio data of the wake-up word to be updated according to the text characteristics;

the determining unit is used for determining the acoustic feature sequence of the wake-up word to be updated according to the audio data; the acoustic feature sequence is used for performing secondary verification by the intelligent device in the process of verifying whether the audio data to be identified contains wake-up words corresponding to the intelligent device or not so as to determine whether the undetermined feature sequence of the audio data to be identified meets wake-up conditions or not; the undetermined characteristic sequence comprises acoustic characteristics of a plurality of continuous audio frames in the audio data to be identified, wherein the plurality of continuous audio frames comprise target audio frames when the audio data to be identified contains the wake-up word, and the acoustic characteristics are used for identifying acoustic characteristics of the audio data to be identified;

And the return unit is used for returning the acoustic characteristic sequence to the intelligent equipment.

In a fifth aspect, an embodiment of the present application provides a device for waking up an intelligent device, where the device includes a processor and a memory:

the memory is used for storing program codes and transmitting the program codes to the processor;

the processor is configured to perform the method for waking up a smart device according to the first aspect according to instructions in the program code.

In a sixth aspect, an embodiment of the present application provides a device for wake word update for a smart device, where the device includes a processor and a memory:

the processor is configured to execute the method for updating wake-up words of the smart device according to the third aspect according to the instructions in the program code.

In a seventh aspect, an embodiment of the present application provides a computer readable storage medium, where the computer readable storage medium is configured to store program code, where the program code is configured to perform a method for waking up a smart device according to the first aspect or a method for updating a wake-up word of a smart device according to the third aspect.

According to the technical scheme, in the process of verifying whether the audio data to be identified contains the wake-up word corresponding to the intelligent device, the acoustic characteristics determined according to the audio data to be identified are stored, and the acoustic characteristics are used for identifying the acoustic characteristics of the audio data to be identified. If the audio data to be identified contains wake-up words when the target audio frame is verified, determining a undetermined feature sequence comprising acoustic features corresponding to continuous audio frames from the stored acoustic features, wherein the continuous audio frames comprise the target audio frame. Since the wake-up word is determined to be included in the audio data to be identified when the target audio frame is verified, if the wake-up word is actually included, the predetermined feature sequence should embody the acoustic characteristics of the wake-up word. Based on the method, during secondary verification, whether the undetermined feature sequence meets the wake-up condition can be determined according to the actual acoustic feature sequence of the wake-up word, the fact that the audio data to be recognized actually contains the wake-up word is determined when the undetermined feature sequence meets the wake-up condition, and the intelligent device can be awakened, so that the false wake-up frequency of the intelligent device can be effectively reduced through the secondary verification, and the user experience is improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive faculty for a person skilled in the art.

Fig. 1 is an application scenario schematic diagram of an intelligent device wake-up method provided in an embodiment of the present application;

fig. 2 is a flowchart of an intelligent device wake-up method provided in an embodiment of the present application;

fig. 3 is a schematic diagram of a decoding network according to an embodiment of the present application;

fig. 4 is a flowchart of a wake-up word updating method of an intelligent device according to an embodiment of the present application;

fig. 5 is a flowchart of an intelligent device wake-up method in an application scenario provided in an embodiment of the present application;

fig. 6 is a schematic diagram of an intelligent device wake-up method in an application scenario provided in an embodiment of the present application;

fig. 7 is a flowchart of a wake-up word updating method of an intelligent device in an application scenario provided in an embodiment of the present application;

fig. 8 is a schematic diagram of a wake-up update method of an intelligent device in an application scenario provided in an embodiment of the present application;

fig. 9a is a block diagram of an intelligent device wake-up unit according to an embodiment of the present application;

fig. 9b is a block diagram of an intelligent device wake-up unit according to an embodiment of the present application;

fig. 10 is a block diagram of a wake-up word updating device of an intelligent device according to an embodiment of the present application;

fig. 11 is a block diagram of a device for waking up an intelligent device according to an embodiment of the present application;

Fig. 12 is a block diagram of a server according to an embodiment of the present application;

fig. 13 is a block diagram of a device for wake-up word update of an intelligent device according to an embodiment of the present application;

fig. 14 is a block diagram of a server according to an embodiment of the present application.

Detailed Description

Embodiments of the present application are described below with reference to the accompanying drawings.

In the existing intelligent equipment awakening technology, a single-model custom awakening technology is often adopted, namely, only one audio recognition method is used for carrying out audio recognition, for example, only a Keyword/hidden Ma Keer (Keyword/Filler HiddenMarkovModel, abbreviated as HMM) or only a Long Short memory feature extraction system (Long Short-term memory Feature Extractor System, abbreviated as LSTM Feature Extractor System) is adopted for carrying out audio recognition, and in any audio recognition technology, when intelligent equipment awakening is carried out independently, certain false awakening rate exists, and the false awakening rate is higher.

For example, the wake-up word of the intelligent device is "open sound box", because the intelligent device can collect audio data at any time, if the user chat with other people to send out voice "open music", the intelligent device can collect audio data "open music", so as to identify whether "open music" is the wake-up word. When the intelligent device recognizes the "open tone", the intelligent device may determine that the collected audio data includes a wake-up word, thereby entering a wake-up state. In practice, however, the "turn on music" is not the actual wake-up word "turn on sound box", so that the smart device is erroneously woken up, and suddenly started up in case the user does not need it.

In order to solve the technical problems, the embodiment of the application provides an intelligent device awakening method, which can perform voice recognition by adopting a multi-level verification mode, and realize the complementary advantages among different audio recognition technologies by using different audio recognition technologies in different levels, so that the false awakening rate when the intelligent device is awakened through audio recognition is reduced, and the user experience is improved.

It should be emphasized that the method for waking up an intelligent device according to the embodiments of the present application is implemented based on artificial intelligence, where artificial intelligence (Artificial Intelligence, AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand the intelligence of a person, sense the environment, acquire knowledge and use knowledge to obtain an optimal result. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.

The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

In the embodiments of the present application, the mainly related artificial intelligence software technology includes the above-mentioned speech processing technology, machine learning, and other directions.

For example, speech recognition techniques (Automatic Speech Recognition, ASR) in speech technology (speech technology) may be involved, including speech signal preprocessing (Speech signal preprocessing), speech signal frequency domain analysis (Speech signal frequency analyzing), speech signal feature extraction (Speech signal feature extraction), speech signal feature matching/recognition (Speech signal feature matching/recognition), training of speech (speech training), and the like.

For example, machine Learning (ML) may be involved, which is a multi-domain interdisciplinary, involving multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, and algorithm complexity theory. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine Learning typically includes Deep Learning (Deep Learning) techniques, including artificial neural networks (artificial neural network), such as convolutional neural networks (Convolutional Neural Network, CNN), recurrent neural networks (RecurrentNeural Network, RNN), deep neural networks (Deep neuralnetwork, DNN), and the like.

It is understood that the method can be applied to an intelligent device (Intelligent device), and the intelligent device can be any device with a voice wake-up function, for example, an intelligent terminal, an intelligent home device (such as an intelligent sound box, an intelligent washing machine, etc.), an intelligent wearing device (such as an intelligent watch, etc.).

The intelligent device can be provided with an automatic voice recognition technology in the voice technology, so that the intelligent device can hear, see and feel, and the intelligent device is a development direction of human-computer interaction in the future, wherein the voice becomes one of the best human-computer interaction modes in the future.

In the embodiment of the application, the intelligent device can perform acoustic feature extraction on the acquired audio data to be identified by implementing the voice technology, and determine the undetermined feature sequence according to the extracted acoustic features, so as to determine the similarity through the undetermined feature sequence and the acoustic feature sequence of the wake-up word; an acoustic model is trained by machine learning techniques for determining acoustic features from audio data acquired by the smart device.

Meanwhile, in the embodiment of the application, the server can acquire the text characteristics of the wake-up word to be updated sent by the intelligent device through implementing the voice technology, and generate the audio data of the wake-up word to be updated according to the text characteristics, so as to determine the acoustic characteristic sequence of the wake-up word to be updated according to the audio data; and training an acoustic model through a machine learning technology, wherein the acoustic model is used for determining an acoustic feature sequence of the wake-up word to be updated according to the audio data of the wake-up word to be updated generated by the server.

In order to facilitate understanding of the technical scheme of the application, the following describes the method for waking up the intelligent device provided by the embodiment of the application in combination with an actual application scene.

Referring to fig. 1, fig. 1 is an application scenario schematic diagram of an intelligent device wake-up method provided in an embodiment of the present application. The application scene comprises the intelligent device 101, and the intelligent device 101 can acquire the audio data to be identified which is input from the outside. Since the audio data to be identified is all the audio data that can be collected by the smart device 101, the audio data to be identified may include environmental noise, audio data related to wake-up words, audio data unrelated to wake-up words, and the like.

After acquiring the audio data to be identified, the intelligent device 101 performs primary verification on the audio data to be identified. The primary verification refers to verifying whether the audio data to be identified contains the wake-up word corresponding to the intelligent device 101. The main module responsible for primary verification in the intelligent device 101 is always in an on state when the intelligent device is turned on, and audio around the intelligent device is continuously collected and verified. The intelligent device 101 stores acoustic features determined according to the audio data to be identified in the process of performing primary verification on the audio data to be identified. Wherein the acoustic features are used to identify acoustic features of the audio data to be identified. The acoustic features of the audio data to be identified can embody the phoneme composition corresponding to the audio data to be identified, so that the pronunciation condition of the audio data to be identified can be embodied.

When the first-level verification is performed, the intelligent device 101 may perform frame-by-frame calculation on the audio data to be identified, if it is determined that the audio data to be identified includes a wake-up word when a certain frame of the audio data to be identified, for example, a target audio frame, is verified, the audio data to be identified passes the first-level verification, and at this time, the intelligent device 101 does not directly enter the wake-up state, but continues to perform the second-level verification on the audio data to be identified. The auxiliary module in the intelligent device 101 responsible for the second-level verification is in an off state at ordinary times, and enters an on-state auxiliary verification only when the audio data to be identified passes the first-level verification and the main module in the first-level verification is ready to wake up the intelligent device 101, so that the main module is prevented from waking up the intelligent device 101 by mistake.

The second-level verification is to verify whether the feature sequence to be determined meets the wake-up condition of the intelligent device. Upon performing the secondary verification, the smart device 101 determines a pending feature sequence including acoustic features corresponding to consecutive audio frames, from among the saved acoustic features, where the consecutive audio frames include the target audio frame. Since the wake-up word is determined to be included in the audio data to be identified when the target audio frame is verified, if the wake-up word is actually included, the predetermined feature sequence should embody the acoustic characteristics of the wake-up word. Because the acoustic features of the audio data to be identified can embody the phoneme composition corresponding to the audio data to be identified, the undetermined feature sequence determined from the acoustic features can embody the phoneme composition corresponding to a certain section of the audio data to be identified, so that the pronunciation condition of the section of the audio data to be identified can be embodied. The acoustic feature sequence of the wake-up word is composed of acoustic features corresponding to the wake-up word, so that the phoneme composition corresponding to the wake-up word can be embodied, and the pronunciation characteristics of the audio data of the wake-up word can be embodied. Since the undetermined feature sequence and the acoustic feature sequence corresponding to the wake-up word can both represent the phoneme composition and the pronunciation characteristics of the corresponding audio data, the intelligent device 101 can determine whether the undetermined feature sequence meets the wake-up condition according to the acoustic feature sequence of the wake-up word.

When the voice data to be identified pass the second-level verification, the voice data to be identified pass the first-level verification and the second-level verification at the same time, and the intelligent device 101 enters the wake-up state.

Because the intelligent device 101 can acquire and embody the acoustic features of the phonemes corresponding to the audio data to be recognized in the primary verification when verifying the audio data to be recognized, and compare the undetermined feature sequence determined according to the acoustic features with the acoustic feature sequence of the wake-up word in the secondary verification, the intelligent device is only awakened after the audio data to be recognized passes the primary verification and the secondary verification at the same time, thereby reducing the false wake-up rate of the intelligent device and improving the user experience.

Next, an intelligent device wake-up method provided in the embodiments of the present application will be described with reference to the accompanying drawings.

Referring to fig. 2, the flowchart of a wake-up method for an intelligent device provided in an embodiment of the present application includes the following steps:

s201: and in the process of verifying whether the audio data to be identified contains the wake-up words corresponding to the intelligent equipment, storing the acoustic characteristics determined according to the audio data to be identified.

After the intelligent device obtains the audio data to be identified from the outside, verifying whether the audio data to be identified contains wake-up words corresponding to the intelligent device, wherein the verification can be primary verification. The audio data to be identified can comprise audio data related to wake-up words, audio data unrelated to the wake-up words, environmental noise data and the like.

In the process of primary verification, the intelligent device determines acoustic features according to the audio data to be identified. The acoustic feature may be any feature that characterizes sound. The acoustic features can represent the phonemic composition of the audio data to be identified, i.e. represent the pronunciation of the audio data.

It should be noted that, in this embodiment, the first-level verification may be implemented by two parts of an acoustic model and a confidence deciding module, and the acoustic model may be any type of model.

In one possible implementation, the acoustic features may be determined by an acoustic model, the acoustic model base structure comprising: input layer, hidden layer and output layer. In one possible implementation, the acoustic features are output features of an hidden layer of the acoustic model. In the primary verification, after the intelligent device acquires the audio data to be identified, the audio data to be identified is substituted into an input layer of the acoustic model for calculation, and acoustic characteristics output by any hidden layer are stored. The output characteristics (hidden layer characteristics for short) of any hidden layer in an acoustic model can be acoustic characteristics in the application, and in general, the closer the hidden layer is to the output layer, the better the acoustic characteristics the hidden layer characteristics show.

In addition, the acoustic features in the hidden layer output have robustness, namely the acoustic features in the hidden layer are not easily influenced by personalized pronunciation of a sound source, environmental noise and the like, and the degree of correspondence with the acoustic features of the audio data to be identified is high.

It will be appreciated that the length of the acoustic feature stored in the first level verification may be equal to or greater than the length of the acoustic feature of the wake-up word, so that the second level verification may be able to extract a pending feature sequence from the stored acoustic feature that may cover the length of the wake-up word acoustic feature sequence for verification. It can be understood that when the acoustic feature is saved, the length of the saved acoustic feature is fixed, and the first-in first-out principle is satisfied, that is, when the saved acoustic feature reaches the preset saving length, there is a new acoustic feature to be saved, and the first saved acoustic feature in the saved acoustic features is removed. Thus, when the target audio frame is verified in the first-stage verification, the most proximate audio frame before the target audio frame can be ensured to be stored in the stored acoustic characteristics.

The confidence judgment module can receive the acoustic characteristics of the audio data to be identified output by the acoustic model and judge whether the audio data to be identified contains wake-up words. The confidence judgment module verifies the audio frames in the audio data to be identified one by one to obtain the confidence that the audio data to be identified corresponding to each frame contains wake-up words, and the confidence of a plurality of audio frames can be accumulated in the verification process. When a certain audio frame, such as a target audio frame, is verified, the confidence coefficient of the to-be-identified audio data containing the wake-up word reaches a first threshold value of the confidence coefficient judgment module, and at the moment, the confidence coefficient judgment module determines that the to-be-identified audio data contains the wake-up word, and the to-be-identified audio data passes the first-stage verification.

For example, a first threshold value of the preset confidence judgment module is 0.95, the wake-up word is "open sound box", the audio data to be identified collected by the intelligent device is "open music", when the last audio frame corresponding to the "sound" is verified, the confidence obtained by the confidence judgment module is 0.98, and because 0.98 is greater than 0.95, the confidence of the wake-up word contained in the audio data to be identified reaches the first threshold value of the confidence judgment module, the confidence judgment module determines that the wake-up word is contained in the audio data to be identified, and the audio data to be identified passes the first-level verification. At this time, the last audio frame of the "tone" is the target audio frame.

In one implementation, the confidence decision module may be a decoding network, and verifying whether the wake-up word is included in the audio data to be identified may be determined by the decoding network. The decoding network may be an HMM decoding network, and the acoustic features output by the acoustic model include all possible pronunciation units, and the pronunciation units may select syllables or phonemes, and each pronunciation unit corresponds to an HMM state. The HMM decoding network is shown in fig. 3, and is composed of Keyword (Keyword) HMMs and Filler (Filler) HMMs, wherein the Keyword HMMs are composed of HMM state strings corresponding to all pronunciation units forming the wake-up corresponding to the intelligent device, and the Filler HMMs are composed of HMM states corresponding to a group of artificially well-designed non-wake-up word pronunciation units. In the process of verifying whether the audio data to be identified contains wake-up words, the acoustic features are sent into a decoding network according to the fixed window size, and an optimal decoding path is searched by utilizing a Viterbi decoding algorithm. The confidence judgment module can judge whether the audio data to be identified contains wake-up words according to whether the optimal decoding path passes through the Keyword HMM path. It will be appreciated that the confidence decision module may also make decisions by calculating more complex policies such as confidence.

S202: if the audio data to be identified is determined to contain wake-up words through the target audio frame in the audio data to be identified, determining a pending feature sequence from the stored acoustic features.

After the intelligent device determines that the audio data to be identified contains the wake-up word, the audio data to be identified passes the primary verification and enters the secondary verification. The smart device determines a pending feature sequence from the acoustic features saved in the step S201, where the pending feature sequence includes acoustic features of a plurality of consecutive audio frames in the audio data to be identified, and the plurality of audio frames includes the target audio frame.

Because the acoustic features stored in the primary verification can embody the phoneme composition of the audio data to be recognized, the undetermined feature sequence determined from the stored acoustic features can embody the phoneme composition of a certain section of the audio data to be recognized.

It can be appreciated that, in order to enable the acoustic features of the audio data to be identified to cover the acoustic features of the wake-up word, so as to improve the accuracy of the second-level verification, the number of acoustic features in the pending feature sequence is determined according to the length of the wake-up word, and the number of acoustic features in the pending feature sequence may be equal to or greater than the length of the wake-up word. Since the acoustic features which are described and stored are guaranteed to contain the target audio frame and a section of audio frame closest to the target audio frame, the acoustic features of the audio data which contains the wake-up word and is determined in the primary verification can be guaranteed to be contained in the undetermined feature sequence.

The manner of determining the undetermined feature sequence is different according to the difference of the length of the stored acoustic features. For example, when the length of the saved acoustic feature is equal to the length of the acoustic feature of the wake-up word, the saved acoustic feature may be directly mirrored as the undetermined feature sequence, where the number of acoustic features in the undetermined feature sequence is equal to the length of the wake-up word; when the length of the stored acoustic feature is greater than the length of the wake-up word acoustic feature, a pending feature sequence may be selected from the stored acoustic features, where the number of acoustic features in the pending feature sequence may be equal to or greater than the length of the wake-up word.

S203: and determining whether the undetermined feature sequence meets the wake-up condition according to the acoustic feature sequence of the wake-up word.

After the smart device determines the undetermined feature sequence from the stored acoustic features, since the undetermined feature sequence can embody the phoneme composition of the to-be-identified audio data fragment containing the wake-up word determined in the first verification, and the acoustic feature sequence of the wake-up word can embody the phoneme composition corresponding to the wake-up word audio data, the undetermined feature sequence and the acoustic feature sequence have the same embodiment form aiming at the audio data, whether the undetermined feature sequence meets the wake-up condition can be determined according to the acoustic feature sequence of the wake-up word preset in the smart device. In one possible implementation, the smart device may determine a degree of similarity between the acoustic feature sequence and the pending feature sequence of the wake-up word, and determine whether the wake-up condition is satisfied according to the degree of similarity.

The determining of the similarity degree between the acoustic feature sequence and the undetermined feature sequence of the wake-up word may be calculating cosine similarity between the acoustic feature sequence and the undetermined feature sequence, and comparing the calculated cosine similarity with a preset second threshold value to determine whether the wake-up condition is satisfied. When the cosine similarity between the two features reaches a threshold value, the similarity degree between the acoustic feature sequence of the wake-up word and the undetermined feature sequence is higher, the acoustic feature sequence of the wake-up word represents the phoneme composition of the audio data of the wake-up word, and the undetermined feature sequence represents the phoneme composition of a certain section of audio data to be recognized, so that when the similarity degree between the feature sequences is higher, the phoneme compositions of the two audio data are similar, and the fact that the section of audio data to be recognized contains the wake-up word in high probability is illustrated.

It should be noted that, because the acoustic feature may have robustness and is not affected by personalized pronunciation of the sound source, environmental noise, etc., the similarity between the acoustic feature sequence and the undetermined feature sequence of the wake-up word may be accurately determined for personalized pronunciation of different people or in different environments, so as to accurately determine whether the audio data to be identified may wake up the intelligent device.

S204: if yes, waking up the intelligent device.

After the intelligent equipment determines that the undetermined feature sequence meets the wake-up condition according to the acoustic feature sequence of the wake-up word, the audio data to be identified passes the second-level verification, and at the moment, the audio data to be identified passes the first-level verification and the second-level verification, and the intelligent equipment enters the working state from the dormant state. It can be understood that when the smart device determines that the pending feature sequence does not satisfy the wake-up condition, misjudgment occurs in the first-level verification, the audio data to be identified does not actually include the wake-up word, and the smart device maintains the sleep state.

In some cases, the wake words may be updated according to the user requirements, and it may be understood that the process of updating the wake words may be performed in a network or locally. When the wake-up word is updated in a networking manner, the text feature and the audio data can be converted only through the cloud server, and the process of determining the acoustic feature sequence of the wake-up word to be updated is still carried out locally; the conversion between text features and audio data can be performed in the cloud server, the acoustic feature sequence of the wake-up word to be updated is determined, and the determined acoustic feature sequence is directly sent to the intelligent device.

The first way to update wake words: wake word updates are performed locally only.

When the process of updating the wake-up word is only performed locally, because the acoustic feature sequence of the wake-up word is used for reflecting the phoneme composition and pronunciation condition of the wake-up word audio data and is irrelevant to other influence phonemes, and in the secondary verification, the acoustic feature sequence of the wake-up word is verified by comparing the acoustic feature sequence of the wake-up word with the undetermined feature sequence, the intelligent device can acquire the text feature of the wake-up word to be updated in the process of updating the wake-up word, so that the audio data of the wake-up word to be updated are generated according to the text feature, and the acoustic feature sequence of the wake-up word to be updated is determined according to the audio data.

For example, a text-to-speech (TTS) speech conversion technique may be used in the smart device. After the user inputs the text feature of the wake-up word to be updated, the intelligent device converts the input text feature into audio data of the wake-up word to be updated through a text-to-speech conversion module, determines an acoustic feature sequence of the wake-up word to be updated through an acoustic model according to the audio data of the wake-up word to be updated, and then stores the acoustic feature sequence in a secondary verification module for a subsequent audio recognition function. It can be understood that the number of the audio data generated by the TTS may be multiple, and the intelligent device determines primary acoustic feature sequences corresponding to the multiple audio data respectively through an acoustic model according to the multiple audio data; and determining the acoustic feature sequences of the wake-up words to be updated according to the plurality of primary acoustic feature sequences.

The second way to update wake words: updates are combined with local and networking.

When the process of updating the wake-up word needs to be performed in a networking manner, in a possible implementation manner, only the cloud server can be used for converting text features and audio data, and the process of determining the acoustic feature sequence is still performed locally. For example, a text-to-speech Server (TTS Server) may be used to update wake-up words. After the user of the intelligent device determines the wake-up word to be updated, the user can input the text feature of the wake-up word to be updated, the intelligent device can send the text feature of the wake-up word to the TTS Server of the cloud through the network, and the TTS Server obtains the text feature of the wake-up word to be updated, so that audio data of the wake-up word to be updated are generated according to the text feature. The TTS Server transmits the audio data to a primary verification module of the intelligent device through networking, and in the primary verification module, the acoustic feature sequence of the wake-up word to be updated can be determined through an acoustic model according to the audio data, and the acoustic feature sequence is stored for a subsequent audio recognition function. It can be understood that the audio data of the wake-up word to be updated generated by the TTS Server may also include a plurality of audio data, and according to the plurality of audio data, determining primary acoustic feature sequences corresponding to the plurality of audio data respectively by an acoustic model; and determining the acoustic feature sequences of the wake-up words to be updated according to the plurality of primary acoustic feature sequences.

A third way of updating wake words: and only carrying out wake-up word updating at the cloud.

In one possible implementation manner, when the wake-up word needs to be updated through networking, the cloud server can be utilized to directly determine the acoustic feature sequence of the wake-up word to be updated according to the text feature, and the acoustic feature sequence is returned to the intelligent device. Referring to fig. 4, the flowchart of a wake word updating method of an intelligent device according to an embodiment of the present application includes the following steps:

s401: and acquiring text characteristics of the wake-up word to be updated, which are sent by the intelligent equipment.

When the wake-up word is required to be updated, the audio generation server acquires text features of the wake-up word to be updated, which are sent by the intelligent equipment.

S402: and generating audio data of the wake-up word to be updated according to the text characteristics.

After the audio generation server acquires the text characteristics, generating audio data of the wake-up word to be updated according to the text characteristics.

S403: and determining an acoustic feature sequence of the wake-up word to be updated according to the audio data.

After the audio generation server generates the audio data of the wake-up word to be updated, determining an acoustic feature sequence of the wake-up word to be updated according to the audio data. The method comprises the steps that an acoustic feature sequence is used for performing secondary verification by the intelligent equipment in the process of verifying whether the audio data to be identified contain wake-up words corresponding to the intelligent equipment or not, so as to determine whether the undetermined feature sequence of the audio data to be identified meets wake-up conditions or not; the undetermined characteristic sequence comprises acoustic characteristics of a plurality of continuous audio frames in the audio data to be identified, the plurality of continuous audio frames comprise target audio frames when the audio data to be identified contains wake-up words, and the acoustic characteristics are used for identifying acoustic characteristics of the audio data to be identified. It will be appreciated that, as in the first two cases, the acoustic features of the audio data to be identified may also be determined by means of an acoustic model, i.e. the acoustic feature sequence of the wake-up word to be updated is determined from the audio data by means of an acoustic model.

In addition, the audio data of the wake-up word to be updated generated by the audio generation server according to the text can comprise a plurality of audio data, and primary acoustic feature sequences respectively corresponding to the plurality of audio data are determined through an acoustic model according to the plurality of audio data; and determining the acoustic feature sequences of the wake-up words to be updated according to the plurality of primary acoustic feature sequences.

S404: and returning the acoustic feature sequence to the intelligent device.

After the audio generation server generates the acoustic feature sequence of the wake-up word to be updated, the acoustic feature sequence is returned to the intelligent device, so that the wake-up word in the intelligent device is updated.

It can be understood that although networking is required in some processes of generating the acoustic feature sequence of the wake-up word to be updated, after the generation is finished, the acoustic feature sequence of the wake-up word to be updated can be stored in the intelligent device, so that the intelligent device can normally operate without network connection in the process of waking up.

It should be noted that, in order to ensure that the determination mode of the adopted acoustic feature sequence and the undetermined feature sequence is the same when determining the similarity degree between the acoustic feature sequence and the undetermined feature sequence of the wake-up word, and further ensure the accuracy of similarity degree calculation, if the acoustic feature of the audio data to be identified is determined through an acoustic model, the acoustic feature sequence of the wake-up word to be updated may also be determined through the acoustic model according to the audio data when determining the acoustic feature sequence of the wake-up word to be updated.

It can be understood that when the confidence decision module adopts the HMM decoding network in the first-level verification, after updating the wake-up word, the decoding network is updated according to the wake-up word to be updated, that is, the Keyword HMMs and the filehmms in the decoding network are updated, the Keyword HMMs are updated to be formed by serially connecting HMM states corresponding to all pronunciation units forming the wake-up word to be updated, and the filehmms are updated to be formed by HMM states corresponding to a group of carefully designed pronunciation units of non-wake-up words to be updated, so as to determine whether the audio data to be identified contains the wake-up word to be updated correctly.

Next, an intelligent device wake-up method provided in the embodiment of the present application will be described with reference to an actual application scenario. In the application scene, the intelligent equipment is an intelligent sound box, an acoustic model adopted by primary verification is an LSTM model, and a verification method adopted by the primary verification is a decoding network; the verification System adopted by the secondary verification is LSTM KWS System, and the wake-up word is "open sound box". When talking with other people beside the sound box, the user speaks the word of opening music. The method for waking up the intelligent device is shown in fig. 5, and the method comprises the following steps:

S501: and acquiring audio data to be identified, and determining acoustic characteristics according to the audio data to be identified.

As shown in fig. 6, fig. 6 is a model diagram of the audio recognition employed in the scene. The intelligent sound box collects audio data of surrounding environment, determines FBANK characteristics through an FBANK characteristic calculation function, and then inputs the audio data to be identified with the FBANK characteristics into an LSTM acoustic model for conversion to obtain acoustic characteristics output by an LSTM hidden layer and acoustic characteristics output by an output layer.

S502: the acoustic features determined from the audio data to be identified are saved.

The intelligent sound box stores the acoustic characteristics output by the LSTM hidden layer for subsequent secondary verification.

S503: and verifying whether the audio data to be identified contains a wake-up word of opening the sound box.

The intelligent sound box verifies whether the audio data to be identified contains wake-up words through the decoding network. Because the collected audio data to be identified contains the voice of 'open music', when the audio data to be identified is verified by a decoding network, the optimal decoding path of the acoustic characteristics of the audio data to be identified passes through a Keyword HMM path corresponding to a wake-up word 'open sound box', so that the data to be identified is verified to contain the wake-up word.

S504: after the fact that the to-be-identified audio data contain wake-up words is determined, a to-be-identified feature sequence is determined from the stored acoustic features.

After the intelligent sound box verifies that the audio data to be identified contains wake-up words through the decoding network, the audio data to be identified enters the second-level verification through the first-level verification. The intelligent sound box extracts undetermined feature sequences in acoustic features output by the stored LSTM hidden layer through the LSTM feature extractor for subsequent verification.

S505: and determining whether the undetermined feature sequence meets the wake-up condition according to the acoustic feature sequence of the wake-up word.

After the undetermined feature sequence is extracted, the intelligent sound equipment verifies whether the undetermined feature sequence meets the awakening condition by calculating cosine similarity of the undetermined feature sequence and an acoustic feature sequence of the awakening word and comparing the cosine similarity with a preset threshold value.

S506: if yes, waking up the intelligent sound equipment.

When the cosine similarity reaches a preset threshold value, the intelligent sound equipment enters an awakening state.

In addition, in the actual application scenario, when the wake-up word of the smart sound needs to be updated, as shown in fig. 7, the update may be performed by the following steps,

s701: and receiving wake-up word text to be updated, which is input by a user.

The intelligent sound equipment receives wake-up word text to be updated, which is input by a user, as shown in fig. 8, and fig. 8 is a model diagram for carrying out wake-up word updating in the application scene.

S702: and generating audio data corresponding to the wake-up word text to be updated through the TTS Server.

After receiving the text of the wake-up word to be updated, the intelligent sound equipment uploads the text to the TTS Server for generating audio data, and generates N different audio data of the wake-up word to be updated.

S703: and determining an acoustic feature sequence of the wake-up word to be updated according to the audio data.

After receiving the audio data generated by the TTS Server, converting the audio data into acoustic feature sequences of N wake-up words to be updated with M frame lengths through a function in primary verification, and averaging the N feature sequences to obtain the feature sequences of the wake-up words to be updated, so that secondary verification can be performed by utilizing the acoustic feature sequences of the wake-up words to be updated.

S704: and updating the decoding network according to the acoustic characteristics of the wake-up words to be updated.

After the acoustic characteristics of the wake-up word to be updated are acquired, a Keyword HMM and a Filler HMM in a decoding network are updated according to the acoustic characteristics, the Keyword HMM is formed by connecting HMMs corresponding to all pronunciation units forming the wake-up word to be updated in series, and the Filler HMM is formed by HMM states corresponding to a group of carefully designed pronunciation units of the non-wake-up word, so that the wake-up word to be updated can be verified in one stage.

Based on the foregoing embodiment of the method for waking up an intelligent device, this embodiment provides an intelligent device waking up apparatus 900, referring to fig. 9a, the apparatus 900 includes a first determining unit 901, a second determining unit 902, a third determining unit 903, and a waking up unit 904:

the first determining unit 901 is configured to store acoustic features determined according to the audio data to be identified, where the acoustic features are used to identify acoustic features of the audio data to be identified, in a process of verifying whether the audio data to be identified includes a wake-up word corresponding to the intelligent device;

a second determining unit 902, configured to determine, if it is determined that the audio data to be identified includes a wake-up word by using a target audio frame in the audio data to be identified, a pending feature sequence from the stored acoustic features, where the pending feature sequence includes acoustic features of a plurality of continuous audio frames in the audio data to be identified, and the plurality of continuous audio frames includes the target audio frame;

a third determining unit 903, configured to determine whether the undetermined feature sequence meets a wake-up condition according to the acoustic feature sequence of the wake-up word;

a wake-up unit 904, configured to wake up the intelligent device if the pending feature sequence satisfies a wake-up condition.

In one possible implementation manner, the third determining unit 903 is specifically configured to:

Determining the similarity degree between the acoustic feature sequence and the undetermined feature sequence of the wake-up word;

and determining whether the wake-up condition is met according to the similarity degree.

In one possible implementation, the acoustic features of the audio data to be identified are determined by an acoustic model, wherein the acoustic features are output features of an underlying layer of the acoustic model.

In one possible implementation, the number of acoustic features in the pending feature sequence is determined based on the length of the wake-up word.

In one possible implementation, referring to fig. 9b, the apparatus 900 further comprises an updating unit 905:

the updating unit 905 is configured to update the wake-up word of the intelligent device, and use the wake-up word to be updated as a wake-up word corresponding to the intelligent device;

the updating unit 905 is specifically configured to:

acquiring text characteristics of wake-up words to be updated;

generating audio data of wake-up words to be updated according to the text characteristics;

and the method is used for determining the acoustic feature sequence of the wake-up word to be updated according to the audio data.

In one possible implementation, the acoustic characteristics of the audio data to be identified are determined by an acoustic model, and the updating unit 905 is specifically configured to:

and determining the acoustic feature sequence of the wake-up word to be updated through the acoustic model according to the audio data.

In one possible implementation, the audio data of the wake word to be updated includes a plurality of audio data, and the updating unit 905 is specifically configured to:

determining primary acoustic feature sequences corresponding to the plurality of audio data respectively through an acoustic model according to the plurality of audio data;

and determining the acoustic feature sequences of the wake-up words to be updated according to the plurality of primary acoustic feature sequences.

In a possible implementation, the verifying whether the audio data to be identified contains a wake-up word is determined by the decoding network, and the updating unit 905 is further configured to:

updating the decoding network according to the wake-up word to be updated.

Based on the wake-up word updating method of the smart device provided in the foregoing embodiment, this embodiment provides a wake-up word updating apparatus 1000 of the smart device, referring to fig. 10, the apparatus 1000 includes an obtaining unit 1001, a generating unit 1002, a determining unit 1003, and a returning unit 1004:

an obtaining unit 1001, configured to obtain text features of a wake word to be updated sent by an intelligent device;

a generating unit 1002, configured to generate audio data of a wake-up word to be updated according to the text feature;

a determining unit 1003, configured to determine an acoustic feature sequence of a wake-up word to be updated according to the audio data; the acoustic feature sequence is used for performing secondary verification on the intelligent equipment in the process of verifying whether the audio data to be identified contain wake-up words corresponding to the intelligent equipment or not so as to determine whether the undetermined feature sequence of the audio data to be identified meets wake-up conditions or not; the undetermined characteristic sequence comprises acoustic characteristics of a plurality of continuous audio frames in the audio data to be identified, the plurality of continuous audio frames comprise target audio frames when the audio data to be identified contains the wake-up word, and the acoustic characteristics are used for identifying the acoustic characteristics of the audio data to be identified;

A return unit 1004 for returning the acoustic signature sequence to the smart device.

In one possible implementation, the determining unit 1003 is specifically configured to:

according to the audio data, determining an acoustic feature sequence of a wake-up word to be updated through an acoustic model; the acoustic model is the same as the acoustic model used by the intelligent device in verifying whether the audio data to be identified contains wake-up words corresponding to the intelligent device.

In a possible implementation manner, the audio data of the wake-up word to be updated includes a plurality of audio data, and the determining unit 1003 is specifically configured to:

The embodiment of the application also provides equipment for waking up the intelligent equipment, and the equipment for waking up the intelligent equipment is described below with reference to the accompanying drawings. Referring to fig. 11, an embodiment of the present application provides a device 1100 for waking up an intelligent device, where the device 1100 may also be a terminal device, and the terminal device may be any intelligent terminal including a mobile phone, a tablet computer, a personal digital assistant (Personal DigitalAssistant, PDA for short), a Point of sales (POS for short), a vehicle-mounted computer, and the like, taking the terminal device as an example of the mobile phone:

Fig. 11 is a block diagram showing a part of the structure of a mobile phone related to a terminal device provided in an embodiment of the present application. Referring to fig. 11, the mobile phone includes: radio Frequency (RF) circuitry 1110, memory 1120, input unit 1130, display unit 1140, sensors 1150, audio circuit 1160, wireless fidelity (wireless fidelity, wiFi) module 1170, processor 1180, power supply 1190, and the like. Those skilled in the art will appreciate that the handset configuration shown in fig. 11 is not limiting of the handset and may include more or fewer components than shown, or may combine certain components, or may be arranged in a different arrangement of components.

The following describes the components of the mobile phone in detail with reference to fig. 11:

the RF circuit 1110 may be used for receiving and transmitting signals during a message or a call, and in particular, after receiving downlink information of a base station, the downlink information is processed by the processor 1180; in addition, the data of the design uplink is sent to the base station. Generally, RF circuitry 1110 includes, but is not limited to, an antenna, at least one amplifier, a transceiver, a coupler, a Low Noise Amplifier (LNA), a duplexer, and the like. In addition, RF circuitry 1110 may also communicate with networks and other devices via wireless communications. The wireless communication may use any communication standard or protocol, including but not limited to global system for mobile communications (Global System ofMobile communication, GSM for short), general packet radio service (General Packet Radio Service, GPRS for short), code division multiple access (Code DivisionMultipleAccess, CDMA for short), wideband code division multiple access (Wideband Code DivisionMultipleAccess, WCDMA for short), long term evolution (LongTerm Evolution, LTE for short), email, short message service (ShortMessaging Service, SMS for short), and the like.

The memory 1120 may be used to store software programs and modules, and the processor 1180 executes the software programs and modules stored in the memory 1120 to perform various functional applications and data processing of the cellular phone. The memory 1120 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, phonebook, etc.) created according to the use of the handset, etc. In addition, memory 1120 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device.

The input unit 1130 may be used to receive input numeric or character information and to generate key signal inputs related to user settings and function control of the mobile phone. In particular, the input unit 1130 may include a touch panel 1131 and other input devices 1132. The touch panel 1131, also referred to as a touch screen, may collect touch operations thereon or thereabout by a user (e.g., operations of the user on the touch panel 1131 or thereabout using any suitable object or accessory such as a finger, stylus, etc.), and drive the corresponding connection device according to a predetermined program. Alternatively, the touch panel 1131 may include two parts, a touch detection device and a touch controller. The touch detection device detects the touch azimuth of a user, detects a signal brought by touch operation and transmits the signal to the touch controller; the touch controller receives touch information from the touch sensing device and converts it into touch point coordinates, which are then sent to the processor 1180, and can receive commands from the processor 1180 and execute them. In addition, the touch panel 1131 may be implemented in various types such as resistive, capacitive, infrared, and surface acoustic wave. The input unit 1130 may include other input devices 1132 in addition to the touch panel 1131. In particular, other input devices 1132 may include, but are not limited to, one or more of a physical keyboard, function keys (e.g., volume control keys, switch keys, etc.), a trackball, mouse, joystick, etc.

The display unit 1140 may be used to display information input by a user or information provided to the user as well as various menus of the mobile phone. The display unit 1140 may include a display panel 1141, and optionally, the display panel 1141 may be configured in the form of a liquid crystal display (Liquid Crystal Display, LCD), an Organic Light-Emitting Diode (OLED), or the like. Further, the touch panel 1131 may overlay the display panel 1141, and when the touch panel 1131 detects a touch operation thereon or thereabout, the touch panel is transferred to the processor 1180 to determine the type of touch event, and then the processor 1180 provides a corresponding visual output on the display panel 1141 according to the type of touch event. Although in fig. 11, the touch panel 1131 and the display panel 1141 are two separate components to implement the input and output functions of the mobile phone, in some embodiments, the touch panel 1131 may be integrated with the display panel 1141 to implement the input and output functions of the mobile phone.

The handset may also include at least one sensor 1150, such as a light sensor, motion sensor, and other sensors. Specifically, the light sensor may include an ambient light sensor and a proximity sensor, wherein the ambient light sensor may adjust the brightness of the display panel 1141 according to the brightness of ambient light, and the proximity sensor may turn off the display panel 1141 and/or the backlight when the mobile phone moves to the ear. As one of the motion sensors, the accelerometer sensor can detect the acceleration in all directions (generally three axes), and can detect the gravity and direction when stationary, and can be used for applications of recognizing the gesture of a mobile phone (such as horizontal and vertical screen switching, related games, magnetometer gesture calibration), vibration recognition related functions (such as pedometer and knocking), and the like; other sensors such as gyroscopes, barometers, hygrometers, thermometers, infrared sensors, etc. that may also be configured with the handset are not described in detail herein.

Audio circuitry 1160, speaker 1161, and microphone 1162 may provide an audio interface between a user and a cell phone. The audio circuit 1160 may transmit the received electrical signal converted from audio data to the speaker 1161, and may be converted into a sound signal by the speaker 1161 to be output; on the other hand, the microphone 1162 converts the collected sound signals into electrical signals, which are received by the audio circuit 1160 and converted into audio data, which are processed by the audio data output processor 1180 for transmission to, for example, another cell phone via the RF circuit 1110, or which are output to the memory 1120 for further processing.

WiFi belongs to a short-distance wireless transmission technology, and a mobile phone can help a user to send and receive emails, browse webpages, access streaming media and the like through a WiFi module 1170, so that wireless broadband Internet access is provided for the user. Although fig. 11 shows a WiFi module 1170, it is understood that it does not belong to the necessary constitution of the mobile phone, and can be omitted entirely as required within the scope of not changing the essence of the invention.

The processor 1180 is a control center of the handset, connects various parts of the entire handset using various interfaces and lines, performs various functions of the handset and processes data by running or executing software programs and/or modules stored in the memory 1120, and invoking data stored in the memory 1120. In the alternative, processor 1180 may include one or more processing units; preferably, the processor 1180 may integrate an application processor and a modem processor, wherein the application processor primarily handles operating systems, user interfaces, applications, etc., and the modem processor primarily handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 1180.

The handset further includes a power supply 1190 (e.g., a battery) for powering the various components, which may be logically connected to the processor 1180 via a power management system so as to provide for the management of charging, discharging, and power consumption by the power management system.

Although not shown, the mobile phone may further include a camera, a bluetooth module, etc., which will not be described herein.

In this embodiment, the processor 1180 included in the terminal device further has the following functions:

And if yes, waking up the intelligent equipment.

The embodiment of the present application further provides a server, as shown in fig. 12, fig. 12 is a block diagram of a server 1200 provided in the embodiment of the present application, where the server 1200 may have a relatively large difference due to different configurations or performances, and may include one or more central processing units (Central ProcessingUnits, abbreviated as CPUs) 1222 (e.g. one or more processors) and a memory 1232, one or more storage media 1230 (e.g. one or more mass storage devices) storing application programs 1242 or data 1244. Wherein memory 1232 and storage medium 1230 can be transitory or persistent. The program stored on the storage medium 1230 may include one or more modules (not shown), each of which may include a series of instruction operations on a server. Still further, the central processor 1222 may be configured to communicate with the storage medium 1230, executing a series of instruction operations on the storage medium 1230 on the server 1200.

The server 1200 may also include one or more power supplies 1226, one or more wired or wireless network interfaces 1250, one or more input/output interfaces 1258, and/or one or more operating systems 1241, such as Windows Server, mac OS XTM, unixTM, linuxTM, freeBSDTM, and the like.

The steps performed by the server in the above embodiments may be based on the server structure shown in fig. 12.

The embodiment of the application also provides a device for updating the wake-up words of the intelligent device, and the device for updating the wake-up words of the intelligent device is described below with reference to the accompanying drawings. Referring to fig. 13, an embodiment of the present application provides a device 1300 for updating wake-up words of an intelligent device, where the device 1300 may also be a terminal device, and the terminal device may be any intelligent terminal including a mobile phone, a tablet computer, a personal digital assistant (Personal DigitalAssistant, abbreviated as PDA), a Point of sales (POS), a vehicle-mounted computer, and the like, and the terminal device is taken as an example of the mobile phone:

fig. 13 is a block diagram showing a part of the structure of a mobile phone related to a terminal device provided in an embodiment of the present application. Referring to fig. 13, the mobile phone includes: radio Frequency (RF) circuitry 1310, memory 1320, input unit 1330, display unit 1340, sensors 1350, audio circuitry 1360, wireless fidelity (wireless fidelity, wiFi) modules 1370, processor 1380, and power supply 1390. It will be appreciated by those skilled in the art that the handset construction shown in fig. 13 is not limiting of the handset and may include more or fewer components than shown, or may combine certain components, or a different arrangement of components.

The following describes the components of the mobile phone in detail with reference to fig. 13:

the RF circuit 1310 may be used for receiving and transmitting signals during a message or a call, and in particular, after receiving downlink information of a base station, the RF circuit may process the downlink information for the processor 1380; in addition, the data of the design uplink is sent to the base station. In general, RF circuitry 1310 includes, but is not limited to, an antenna, at least one amplifier, a transceiver, a coupler, a Low Noise Amplifier (LNA), a duplexer, and the like. In addition, the RF circuitry 1310 may also communicate with networks and other devices via wireless communications. The wireless communication may use any communication standard or protocol, including but not limited to global system for mobile communications (Global System ofMobile communication, GSM for short), general packet radio service (General Packet Radio Service, GPRS for short), code division multiple access (Code DivisionMultiple Access, CDMA for short), wideband code division multiple access (Wideband Code DivisionMultipleAccess, WCDMA for short), long term evolution (Long Term Evolution, LTE for short), email, short message service (ShortMessaging Service, SMS for short), and the like.

The memory 1320 may be used to store software programs and modules, and the processor 1380 performs various functional applications and data processing of the handset by executing the software programs and modules stored in the memory 1320. The memory 1320 may mainly include a storage program area that may store an operating system, application programs required for at least one function (such as a sound playing function, an image playing function, etc.), and a storage data area; the storage data area may store data (such as audio data, phonebook, etc.) created according to the use of the handset, etc. In addition, memory 1320 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device.

The input unit 1330 may be used to receive input numeric or character information and to generate key signal inputs related to user settings and function control of the handset. In particular, the input unit 1330 may include a touch panel 1331 and other input devices 1332. Touch panel 1331, also referred to as a touch screen, may collect touch operations thereon or thereabout by a user (e.g., operations of the user on touch panel 1331 or thereabout using any suitable object or accessory such as a finger, stylus, etc.) and actuate the corresponding connection device according to a predetermined program. Alternatively, the touch panel 1331 may include two parts of a touch detection device and a touch controller. The touch detection device detects the touch azimuth of a user, detects a signal brought by touch operation and transmits the signal to the touch controller; the touch controller receives touch information from the touch detection device and converts it into touch point coordinates, which are then sent to the processor 1380, and can receive commands from the processor 1380 and execute them. In addition, the touch panel 1331 may be implemented in various types of resistive, capacitive, infrared, surface acoustic wave, and the like. The input unit 1330 may include other input devices 1332 in addition to the touch panel 1331. In particular, other input devices 1332 may include, but are not limited to, one or more of a physical keyboard, function keys (e.g., volume control keys, switch keys, etc.), a trackball, mouse, joystick, etc.

The display unit 1340 may be used to display information input by a user or information provided to the user as well as various menus of the mobile phone. The display unit 1340 may include a display panel 1341, and the display panel 1341 may be optionally configured in the form of a liquid crystal display (Liquid Crystal Display, LCD) or an Organic Light-Emitting Diode (OLED) or the like. Further, the touch panel 1331 may overlay the display panel 1341, and when the touch panel 1331 detects a touch operation thereon or thereabout, the touch panel is transferred to the processor 1380 to determine the type of touch event, and the processor 1380 then provides a corresponding visual output on the display panel 1341 according to the type of touch event. Although in fig. 13, the touch panel 1331 and the display panel 1341 are two independent components for implementing the input and output functions of the mobile phone, in some embodiments, the touch panel 1331 may be integrated with the display panel 1341 to implement the input and output functions of the mobile phone.

The handset may also include at least one sensor 1350, such as a light sensor, a motion sensor, and other sensors. Specifically, the light sensor may include an ambient light sensor that may adjust the brightness of the display panel 1341 according to the brightness of ambient light, and a proximity sensor that may turn off the display panel 1341 and/or the backlight when the phone is moved to the ear. As one of the motion sensors, the accelerometer sensor can detect the acceleration in all directions (generally three axes), and can detect the gravity and direction when stationary, and can be used for applications of recognizing the gesture of a mobile phone (such as horizontal and vertical screen switching, related games, magnetometer gesture calibration), vibration recognition related functions (such as pedometer and knocking), and the like; other sensors such as gyroscopes, barometers, hygrometers, thermometers, infrared sensors, etc. that may also be configured with the handset are not described in detail herein.

Audio circuitry 1360, speaker 1361, microphone 1362 may provide an audio interface between the user and the handset. The audio circuit 1360 may transmit the received electrical signal after audio data conversion to the speaker 1361, where the electrical signal is converted to a sound signal by the speaker 1361 and output; on the other hand, the microphone 1362 converts the collected sound signals into electrical signals, which are received by the audio circuit 1360 and converted into audio data, which are processed by the audio data output processor 1380 for transmission to, for example, another cell phone via the RF circuit 1310, or for output to the memory 1320 for further processing.

WiFi belongs to a short-distance wireless transmission technology, and a mobile phone can help a user to send and receive emails, browse webpages, access streaming media and the like through a WiFi module 1370, so that wireless broadband Internet access is provided for the user. Although fig. 13 shows a WiFi module 1370, it is understood that it does not belong to the necessary constitution of the mobile phone, and can be omitted entirely as required within a range that does not change the essence of the invention.

Processor 1380 is a control center of the handset, connecting various portions of the entire handset using various interfaces and lines, performing various functions of the handset and processing data by running or executing software programs and/or modules stored in memory 1320, and invoking data stored in memory 1320. Optionally, processor 1380 may include one or more processing units; preferably, processor 1380 may integrate an application processor primarily handling operating systems, user interfaces, applications, etc., with a modem processor primarily handling wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 1380.

The handset further includes a power supply 1390 (e.g., a battery) for powering the various components, which may be logically connected to the processor 1380 through a power management system, such as to provide for managing charging, discharging, and power consumption by the power management system.

In this embodiment, the processor 1380 included in the terminal device further has the following functions:

And returning the acoustic feature sequence to the intelligent device.

The embodiment of the present application further provides a server, as shown in fig. 14, fig. 14 is a block diagram of a server 1400 provided in the embodiment of the present application, where the server 1400 may have a relatively large difference due to different configurations or performances, and may include one or more central processing units (Central Processing Units, abbreviated as CPUs) 1422 (e.g., one or more processors) and a memory 1432, one or more storage media 1430 (e.g., one or more mass storage devices) storing application programs 1442 or data 1444. Wherein the memory 1432 and storage medium 1430 can be transitory or persistent storage. The program stored in the storage medium 1430 may include one or more modules (not shown), each of which may include a series of instruction operations on a server. Further, the central processor 1422 may be provided in communication with a storage medium 1430 to perform a series of instruction operations in the storage medium 1430 on the server 1400.

The server 1400 may also include one or more power supplies 1426, one or more wired or wireless network interfaces 1450, one or more input/output interfaces 1458, and/or one or more operating systems 1441, such as Windows Server, mac OS XTM, unixTM, linuxTM, freeBSDTM, and the like.

The steps performed by the server in the above embodiments may be based on the server structure shown in fig. 14.

The embodiment of the application further provides a computer readable storage medium, configured to store program code, where the program code is configured to execute any implementation manner of the wake-up method of the intelligent device and the wake-up word updating method of the intelligent device.

Those of ordinary skill in the art will appreciate that: all or part of the steps for implementing the above method embodiments may be implemented by hardware related to program instructions, where the above program may be stored in a computer readable storage medium, and when the program is executed, the program performs steps including the above method embodiments; and the aforementioned storage medium may be at least one of the following media: read-only memory (ROM), RAM, magnetic disk or optical disk, etc., which can store program codes.

It should be noted that, in the present specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment is mainly described in a different point from other embodiments. In particular, for the apparatus and system embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, with reference to the description of the method embodiments in part. The apparatus and system embodiments described above are merely illustrative, in which elements illustrated as separate elements may or may not be physically separate, and elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

The foregoing is merely one specific embodiment of the present application, but the protection scope of the present application is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present application should be covered in the protection scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. An intelligent device wake-up method, comprising:

performing primary verification on the obtained audio data to be identified, wherein the primary verification refers to verifying whether the audio data to be identified contains wake-up words corresponding to intelligent equipment;

if the target audio frames in the audio data to be identified are used for determining that the audio data to be identified contains the wake-up words, performing secondary verification on the audio data to be identified, and determining a feature sequence to be determined from the stored acoustic features, wherein the feature sequence to be determined comprises acoustic features of a plurality of continuous audio frames in the audio data to be identified, the plurality of continuous audio frames comprise the target audio frames, and the secondary verification refers to verifying whether the feature sequence to be determined meets the wake-up conditions of intelligent equipment;

Determining the similarity degree between the acoustic feature sequence of the wake-up word and the undetermined feature sequence, wherein the acoustic feature sequence of the wake-up word consists of acoustic features corresponding to the wake-up word;

determining whether a wake-up condition is met according to the similarity degree;

if yes, waking up the intelligent device, wherein the intelligent device is woken up after the audio data to be identified passes the primary verification and the secondary verification, performing voice recognition in a multi-stage verification mode, and using different audio recognition technologies in different stages.

2. The method of claim 1, wherein the number of acoustic features in the pending feature sequence is determined based on a length of the wake-up word.

3. Method according to claim 1 or 2, characterized in that the acoustic features of the audio data to be identified are determined by means of an acoustic model, wherein the acoustic features are output features of an underlying layer of the acoustic model.

4. The method according to claim 1, wherein the method further comprises:

the wake-up word updating is carried out on the intelligent equipment, and wake-up words to be updated are used as wake-up words corresponding to the intelligent equipment, wherein the wake-up word updating comprises:

Acquiring text characteristics of the wake-up word to be updated;

acquiring the text characteristics to generate audio data of the wake-up word to be updated;

and determining the acoustic feature sequence of the wake-up word to be updated according to the audio data.

5. The method of claim 4, wherein the acoustic features of the audio data to be identified are determined by an acoustic model, and wherein the determining the sequence of acoustic features of the wake word to be updated from the audio data comprises:

6. The method of claim 5, wherein the audio data of the wake word to be updated comprises a plurality of the wake word to be updated, and wherein the determining the acoustic feature sequence of the wake word to be updated from the audio data by the acoustic model comprises:

determining primary acoustic feature sequences corresponding to the audio data respectively through the acoustic model according to the audio data;

7. The method according to any of claims 4-6, wherein verifying whether the audio data to be identified contains the wake-up word is determined by a decoding network, the wake-up word update further comprising updating the decoding network based on the wake-up word to be updated.

8. An intelligent device wake-up apparatus is characterized in that the apparatus comprises a first determining unit, a second determining unit, a third determining unit and a wake-up unit:

the first determining unit is used for performing primary verification on the obtained audio data to be identified, wherein the primary verification refers to verification on whether the audio data to be identified contains wake-up words corresponding to intelligent equipment; in the process of verifying whether the audio data to be identified contains wake-up words corresponding to the intelligent equipment, storing acoustic features determined according to the audio data to be identified, wherein the acoustic features are used for identifying acoustic features of the audio data to be identified;

the second determining unit is configured to perform secondary verification on the audio data to be identified if it is determined that the audio data to be identified includes the wake-up word through a target audio frame in the audio data to be identified, determine a pending feature sequence from the stored acoustic features, where the pending feature sequence includes acoustic features of a plurality of continuous audio frames in the audio data to be identified, and the plurality of continuous audio frames includes the target audio frame;

the third determining unit is configured to determine a degree of similarity between the acoustic feature sequence of the wake-up word and the undetermined feature sequence, where the acoustic feature sequence of the wake-up word is composed of acoustic features corresponding to the wake-up word; determining whether a wake-up condition is met according to the similarity degree;

The wake-up unit is used for waking up the intelligent device if the undetermined feature sequence meets wake-up conditions, wherein the intelligent device is woken up after the audio data to be identified passes primary verification and secondary verification, voice recognition is performed in a multi-level verification mode, and different audio recognition technologies are used in different levels.

9. The apparatus of claim 8, wherein the number of acoustic features in the sequence of undetermined features is determined based on a length of the wake-up word.

10. The apparatus according to claim 8 or 9, wherein the acoustic features of the audio data to be identified are determined by an acoustic model, wherein the acoustic features are output features of an underlying layer of the acoustic model.

11. The apparatus of claim 8, wherein the apparatus further comprises an updating unit;

the updating unit is used for updating the wake-up words of the intelligent equipment, and the wake-up words to be updated are used as the wake-up words corresponding to the intelligent equipment;

wherein, wake-up word updating in the updating unit includes:

acquiring text characteristics of the wake-up word to be updated;

12. The apparatus according to claim 11, wherein the acoustic characteristics of the audio data to be identified are determined by means of an acoustic model, the updating unit being specifically adapted to:

13. The apparatus of claim 12, wherein the audio data of the wake word to be updated comprises a plurality of wake words, and the updating unit is specifically configured to:

14. The apparatus according to any of claims 11-13, wherein verifying whether the audio data to be identified contains the wake-up word is determined by a decoding network, the updating unit being further configured to update the decoding network according to the wake-up word to be updated.

15. A wake word updating method for an intelligent device, the method comprising:

determining an acoustic feature sequence of the wake-up word to be updated according to the audio data; the acoustic feature sequence is used for performing secondary verification by the intelligent device in the process of verifying whether the audio data to be identified contains wake-up words corresponding to the intelligent device, so as to determine whether the undetermined feature sequence of the audio data to be identified meets the wake-up conditions, and determining whether the undetermined feature sequence of the audio data to be identified meets the wake-up conditions comprises: determining the similarity degree between the acoustic feature sequence of the wake-up word and the undetermined feature sequence, wherein the acoustic feature sequence of the wake-up word consists of acoustic features corresponding to the wake-up word; determining whether a wake-up condition is met according to the similarity degree; the method comprises the steps that a to-be-determined characteristic sequence comprises acoustic characteristics of a plurality of continuous audio frames in audio data to be identified, the plurality of continuous audio frames comprise target audio frames when determining that the audio data to be identified contains wake-up words, the acoustic characteristics are used for identifying the acoustic characteristics of the audio data to be identified, the to-be-determined characteristic sequence is a characteristic sequence determined from the acoustic characteristics of the audio data to be identified in secondary verification, the acoustic characteristics of the audio data to be identified are acoustic characteristics determined according to the audio data to be identified are stored in a first-stage verification process of the audio data to be identified, the first-stage verification is to verify whether the audio data to be identified contains the wake-up words corresponding to intelligent equipment, and the second-stage verification is to verify whether the to-be-determined characteristic sequence meets the wake-up conditions of the intelligent equipment;

And returning the acoustic feature sequence to the intelligent equipment, wherein the intelligent equipment performs voice recognition on the audio data to be recognized in a multi-level verification mode, different audio recognition technologies are used in different levels, and the intelligent equipment is awakened after the audio data to be recognized passes the primary verification and the secondary verification.

16. The method of claim 15, wherein the determining the sequence of acoustic features of the wake word to be updated from the audio data comprises:

according to the audio data, determining an acoustic feature sequence of the wake-up word to be updated through an acoustic model; the acoustic model is the same as the acoustic model used by the intelligent device in the process of verifying whether the audio data to be identified contains wake-up words corresponding to the intelligent device.

17. The method of claim 16, wherein the audio data of the wake word to be updated includes a plurality of the determining, from the audio data, the acoustic feature sequence of the wake word to be updated by an acoustic model, comprising:

18. The wake-up word updating device of the intelligent equipment is characterized by comprising an acquisition unit, a generation unit, a determination unit and a return unit:

the determining unit is used for determining the acoustic feature sequence of the wake-up word to be updated according to the audio data; the acoustic feature sequence is used for performing secondary verification by the intelligent device in the process of verifying whether the audio data to be identified contains wake-up words corresponding to the intelligent device, so as to determine whether the undetermined feature sequence of the audio data to be identified meets the wake-up conditions, and determining whether the undetermined feature sequence of the audio data to be identified meets the wake-up conditions comprises: determining the similarity degree between the acoustic feature sequence of the wake-up word and the undetermined feature sequence, wherein the acoustic feature sequence of the wake-up word consists of acoustic features corresponding to the wake-up word; determining whether a wake-up condition is met according to the similarity degree; the method comprises the steps that a to-be-determined characteristic sequence comprises acoustic characteristics of a plurality of continuous audio frames in audio data to be identified, the plurality of continuous audio frames comprise target audio frames when determining that the audio data to be identified contains wake-up words, the acoustic characteristics are used for identifying the acoustic characteristics of the audio data to be identified, the to-be-determined characteristic sequence is a characteristic sequence determined from the acoustic characteristics of the audio data to be identified in secondary verification, the acoustic characteristics of the audio data to be identified are acoustic characteristics determined according to the audio data to be identified are stored in a first-stage verification process of the audio data to be identified, the first-stage verification is to verify whether the audio data to be identified contains the wake-up words corresponding to intelligent equipment, and the second-stage verification is to verify whether the to-be-determined characteristic sequence meets the wake-up conditions of the intelligent equipment;

The return unit is used for returning the acoustic feature sequence to the intelligent equipment, wherein the intelligent equipment performs voice recognition on the audio data to be recognized in a multi-level verification mode, different audio recognition technologies are used in different levels, and the intelligent equipment is awakened after the audio data to be recognized passes the primary verification and the secondary verification.

19. The apparatus according to claim 18, wherein the determining unit is specifically configured to:

20. The apparatus of claim 19, wherein the audio data of the wake word to be updated comprises a plurality of wake words, and the determining unit is specifically configured to:

21. A device for waking up a smart device, the device comprising a processor and a memory:

the processor is configured to perform the method of waking up a smart device according to any one of claims 1-7 according to instructions in the program code.

22. A device for wake word update for a smart device, the device comprising a processor and a memory:

the processor is configured to perform the method of wake-up word updating of a smart device according to any of the claims 15-17 according to instructions in the program code.

23. A computer readable storage medium for storing program code for performing the smart device wake-up method of any of claims 1-7 or performing the wake-up word update method of the smart device of any of claims 15-17.