CN110890093A

CN110890093A - Intelligent device awakening method and device based on artificial intelligence

Info

Publication number: CN110890093A
Application number: CN201911158856.0A
Authority: CN
Inventors: 陈杰; 苏丹; 金明杰; 朱振岭
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2019-11-22
Filing date: 2019-11-22
Publication date: 2020-03-17
Anticipated expiration: 2039-11-22
Also published as: CN110890093B

Abstract

The embodiment of the application discloses a method and a device for awakening intelligent equipment, wherein after the intelligent equipment collects audio data to be identified, acoustic characteristics determined according to the audio data to be identified are stored in the process of verifying whether the audio data to be identified contains awakening words corresponding to the intelligent equipment; after the primary verification determines that the audio data to be recognized contains the awakening words, the intelligent equipment is not awakened, the secondary verification is firstly carried out, the undetermined characteristic sequence is determined from the stored acoustic characteristics, and whether the undetermined characteristic sequence meets the awakening condition or not is determined according to the acoustic characteristic sequence of the awakening words; after the undetermined characteristic sequence is confirmed to meet the awakening condition, the audio data to be identified passes primary and secondary verification, and the intelligent equipment is awakened at the moment. By utilizing the acoustic characteristics determined by the primary verification and extracting the undetermined characteristic sequence according to the acoustic characteristics to perform secondary verification, the error awakening frequency of the intelligent equipment is effectively reduced.

Description

Intelligent device awakening method and device based on artificial intelligence

Technical Field

The present application relates to the field of data processing, and in particular, to an intelligent device wake-up method and apparatus based on artificial intelligence.

Background

At present, intelligent equipment is more and more popular and is widely applied to work and life of people.

Some intelligent devices are in a dormant state when not providing services, and when a user needs to use such intelligent devices, the user can speak a wakeup word in a voice mode to wake up the intelligent devices, for example, the user can wake up a dormant intelligent sound box through the wakeup word.

However, the related art at present has a high false wake-up rate, that is, some noises or voices of non-wake-up words are mistakenly recognized as voices of wake-up words, and the intelligent device is mistakenly woken up, so that the intelligent device is suddenly started under the condition that the user does not need the intelligent device, and bad use experience is caused to the user.

Disclosure of Invention

In order to solve the technical problem, the application provides an intelligent device awakening method, which utilizes the acoustic characteristics determined by primary verification and extracts undetermined characteristic sequences according to the acoustic characteristics to perform secondary verification, so that the error awakening frequency of the intelligent device is effectively reduced, and the user experience is improved.

The embodiment of the application discloses the following technical scheme:

in a first aspect, an embodiment of the present application provides a method for waking up an intelligent device, where the method includes:

in the process of verifying whether the audio data to be recognized contains the awakening words corresponding to the intelligent equipment, storing acoustic characteristics determined according to the audio data to be recognized, wherein the acoustic characteristics are used for identifying the acoustic characteristics of the audio data to be recognized;

if the audio data to be recognized is determined to contain the awakening words through a target audio frame in the audio data to be recognized, determining a pending feature sequence from the stored acoustic features, wherein the pending feature sequence comprises the acoustic features of a plurality of continuous audio frames in the audio data to be recognized, and the plurality of continuous audio frames comprise the target audio frame;

determining whether the undetermined characteristic sequence meets an awakening condition or not according to the acoustic characteristic sequence of the awakening word;

and if so, awakening the intelligent equipment.

In a second aspect, an embodiment of the present application provides an apparatus for waking up an intelligent device, where the apparatus includes a first determining unit, a second determining unit, a third determining unit, and a waking unit:

the first determining unit is used for storing acoustic characteristics determined according to the audio data to be recognized in the process of verifying whether the audio data to be recognized contains a wakeup word corresponding to the intelligent device, wherein the acoustic characteristics are used for identifying the acoustic characteristics of the audio data to be recognized;

the second determining unit is configured to determine, if it is determined that the audio data to be recognized includes the wakeup word through a target audio frame in the audio data to be recognized, an undetermined feature sequence from the stored acoustic features, where the undetermined feature sequence includes acoustic features of multiple consecutive audio frames in the audio data to be recognized, and the multiple consecutive audio frames include the target audio frame;

the third determining unit is configured to determine whether the undetermined feature sequence meets an awakening condition according to the acoustic feature sequence of the awakening word;

and the awakening unit is used for awakening the intelligent equipment if the undetermined characteristic sequence meets the awakening condition.

In a third aspect, an embodiment of the present application provides a method for updating a wakeup word of an intelligent device, where the method includes:

acquiring text characteristics of a wakeup word to be updated, which are sent by intelligent equipment;

generating audio data of the awakening words to be updated according to the text characteristics;

determining an acoustic feature sequence of the awakening word to be updated according to the audio data; the acoustic feature sequence is used for secondary verification of the intelligent equipment in the process of verifying whether the audio data to be recognized contains the awakening words corresponding to the intelligent equipment so as to determine whether the undetermined feature sequence of the audio data to be recognized meets the awakening condition; the undetermined feature sequence comprises acoustic features of a plurality of continuous audio frames in the audio data to be recognized, the plurality of continuous audio frames comprise target audio frames when the audio data to be recognized contains the awakening words, and the acoustic features are used for identifying the acoustic features of the audio data to be recognized;

returning the acoustic feature sequence to the smart device.

In a fourth aspect, an embodiment of the present application provides an apparatus for updating a wakeup word of an intelligent device, where the apparatus includes an obtaining unit, a generating unit, a determining unit, and a returning unit:

the acquiring unit is used for acquiring the text characteristics of the awakening words to be updated sent by the intelligent equipment;

the generating unit is used for generating the audio data of the awakening words to be updated according to the text characteristics;

the determining unit is used for determining the acoustic feature sequence of the awakening word to be updated according to the audio data; the acoustic feature sequence is used for secondary verification of the intelligent equipment in the process of verifying whether the audio data to be recognized contains the awakening words corresponding to the intelligent equipment so as to determine whether the undetermined feature sequence of the audio data to be recognized meets the awakening condition; the undetermined feature sequence comprises acoustic features of a plurality of continuous audio frames in the audio data to be recognized, the plurality of continuous audio frames comprise target audio frames when the audio data to be recognized contains the awakening words, and the acoustic features are used for identifying the acoustic features of the audio data to be recognized;

the return unit is configured to return the acoustic feature sequence to the smart device.

In a fifth aspect, an embodiment of the present application provides a device for smart device wake-up, where the device includes a processor and a memory:

the memory is used for storing program codes and transmitting the program codes to the processor;

the processor is configured to perform the method for smart device wake-up according to the instructions in the program code.

In a sixth aspect, an embodiment of the present application provides a device for updating a wake word of a smart device, where the device includes a processor and a memory:

the processor is configured to execute the method for updating the wake-up word of the smart device according to the third aspect according to the instructions in the program code.

In a seventh aspect, an embodiment of the present application provides a computer-readable storage medium, where the computer-readable storage medium is configured to store a program code, where the program code is configured to perform the method for waking up a smart device in the first aspect or the method for updating a wake-up word of the smart device in the third aspect.

According to the technical scheme, in the process of verifying whether the audio data to be recognized contains the awakening words corresponding to the intelligent equipment, the acoustic characteristics determined according to the audio data to be recognized are stored, and the acoustic characteristics are used for identifying the acoustic characteristics of the audio data to be recognized. And if the audio data to be identified contains the awakening words when the target audio frame is verified, determining a pending feature sequence comprising the acoustic features corresponding to the continuous audio frames from the stored acoustic features, wherein the continuous audio frames comprise the target audio frames. Since it is determined that the audio data to be recognized contains the wake-up word when the target audio frame is verified, if the audio data to be recognized does contain the wake-up word, the pending feature sequence should embody the acoustic characteristics of the wake-up word. Based on the method, during secondary verification, whether the undetermined feature sequence meets the awakening condition or not can be determined according to the actual acoustic feature sequence of the awakening word, and when the undetermined feature sequence meets the awakening condition, the audio data to be recognized indeed contains the awakening word, so that the intelligent device can be awakened, the error awakening frequency of the intelligent device can be effectively reduced through the secondary verification, and the user experience is improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without inventive exercise.

Fig. 1 is a schematic view of an application scenario of an intelligent device wake-up method according to an embodiment of the present application;

fig. 2 is a flowchart of an intelligent device wake-up method according to an embodiment of the present disclosure;

fig. 3 is a schematic diagram of a decoding network according to an embodiment of the present application;

fig. 4 is a flowchart of a method for updating a wakeup word of an intelligent device according to an embodiment of the present application;

fig. 5 is a flowchart of a method for waking up an intelligent device in an application scenario according to an embodiment of the present application;

fig. 6 is a schematic diagram of an intelligent device wake-up method in an application scenario according to an embodiment of the present application;

fig. 7 is a flowchart of a method for updating a wakeup word of an intelligent device in an application scenario according to an embodiment of the present application;

fig. 8 is a schematic diagram of a wake-up updating method for an intelligent device in an application scenario according to an embodiment of the present application;

fig. 9a is a structural diagram of an intelligent device wake-up apparatus according to an embodiment of the present application;

fig. 9b is a structural diagram of an intelligent device wake-up apparatus according to an embodiment of the present application;

fig. 10 is a structural diagram of a wakeup word update apparatus of an intelligent device according to an embodiment of the present application;

fig. 11 is a structural diagram of a device for smart device wake-up according to an embodiment of the present disclosure;

fig. 12 is a block diagram of a server according to an embodiment of the present application.

Fig. 13 is a structural diagram of a device for updating a wakeup word of an intelligent device according to an embodiment of the present application;

fig. 14 is a block diagram of a server according to an embodiment of the present application.

Detailed Description

Embodiments of the present application are described below with reference to the accompanying drawings.

In the existing smart device wake-up technology, a single-Model custom wake-up technology is often adopted, that is, only one audio recognition method is used for audio recognition, for example, only a Keyword/Hidden Markov Model (HMM) or only a Long-Short memorial Feature extraction System (LSTM Feature Extractor System) is used for audio recognition, and no matter which audio recognition technology is used, when the smart device wake-up is performed alone, a certain false wake-up rate exists, and the false wake-up rate is high.

For example, the smart device wakes up for "open speaker", and because the smart device can collect audio data at any time, if the user chats with another person to make a voice "open music", then the smart device can collect the audio data "open music", so as to identify whether "open music" is a wake up word. When the smart device recognizes "turn on tone", the smart device may determine that the acquired audio data includes a wakeup word, thereby entering a wakeup state. In practice, however, "turning on music" is not the actual wake word "turning on the speaker" so that the smart device is woken up incorrectly, suddenly starting the smart device without the user needing it.

In order to solve the above technical problem, an embodiment of the present application provides an intelligent device wake-up method, which may perform voice recognition by using a multi-level verification method, and use different audio recognition technologies in different levels to achieve advantage complementation between different audio recognition technologies, thereby reducing a false wake-up rate when waking up an intelligent device by audio recognition, and improving user experience.

It is emphasized that the method for waking up an intelligent device provided in the embodiments of the present application is implemented based on Artificial Intelligence (AI), which is a theory, method, technique and application system that simulates, extends and expands human Intelligence, senses environment, acquires knowledge and uses knowledge to obtain the best result by using a digital computer or a machine controlled by a digital computer. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

In the embodiment of the present application, the artificial intelligence software technology mainly involved includes the above-mentioned voice processing technology and machine learning and other directions.

For example, the present invention may relate to a Speech Recognition Technology (ASR) in Speech Technology (Speech Technology), including Speech signal preprocessing (Speech signal preprocessing), Speech signal frequency domain analysis (Speech signal analysis), Speech signal feature extraction (Speech signal feature extraction), Speech signal feature matching/Recognition (Speech signal formatting/Recognition), training of Speech (Speech training), and the like.

For example, Machine Learning (ML) may be involved, which is a multi-domain cross discipline, involving probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, and so on. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine Learning generally includes techniques such as Deep Learning (Deep Learning), which includes artificial Neural networks (artificial Neural networks), such as Convolutional Neural Networks (CNN), Recurrent Neural Networks (RNN), Deep Neural Networks (DNN), and the like.

It can be understood that the method can be applied to an Intelligent device (Intelligent device), and the Intelligent device can be any device with a voice wake-up function, for example, an Intelligent terminal, an Intelligent home device (such as an Intelligent sound box, an Intelligent washing machine, and the like), an Intelligent wearable device (such as an Intelligent watch), and the like.

The intelligent device can have the function of implementing automatic voice recognition technology in the voice technology to enable the intelligent device to listen, see and feel, and is the development direction of future human-computer interaction, wherein voice becomes one of the best viewed human-computer interaction modes in the future.

In the embodiment of the application, the intelligent device can extract acoustic features of the acquired audio data to be recognized by implementing the voice technology, determine the undetermined feature sequence according to the extracted acoustic features, and further determine the similarity through the undetermined feature sequence and the acoustic feature sequence of the awakening word; an acoustic model is trained through machine learning techniques, and the acoustic model is used for determining acoustic features according to audio data acquired by the intelligent device.

Meanwhile, in the embodiment of the application, the server can acquire the text characteristics of the awakening words to be updated sent by the intelligent device by implementing the voice technology, generate the audio data of the awakening words to be updated according to the text characteristics, and further determine the acoustic characteristic sequence of the awakening words to be updated according to the audio data; and training an acoustic model through a machine learning technology, wherein the acoustic model is used for determining an acoustic feature sequence of the awakening word to be updated according to the audio data of the awakening word to be updated, which is generated by the server.

In order to facilitate understanding of the technical scheme of the present application, the following describes an intelligent device wake-up method provided in the embodiment of the present application in combination with an actual application scenario.

Referring to fig. 1, fig. 1 is a schematic view of an application scenario of an intelligent device wake-up method provided in an embodiment of the present application. The application scene comprises the intelligent device 101, and the intelligent device 101 can obtain the audio data to be identified input from the outside. Since the audio data to be recognized is all the audio data that can be collected by the smart device 101, the audio data to be recognized may include environmental noise, audio data related to the wakeup word, audio data unrelated to the wakeup word, and the like.

After acquiring the audio data to be identified, the smart device 101 performs primary verification on the audio data to be identified. The primary verification is to verify whether the audio data to be recognized contains a wakeup word corresponding to the intelligent device 101. The main module in the smart device 101, which is responsible for primary verification, is always in an on state when the smart device is turned on, and continuously collects and verifies audio around the smart device. The intelligent device 101 stores the acoustic features determined according to the audio data to be identified in the process of performing primary verification on the audio data to be identified. The acoustic features are used for identifying acoustic characteristics of the audio data to be identified. The acoustic features of the audio data to be recognized can reflect the composition of phonemes corresponding to the audio data to be recognized, so that the pronunciation condition of the audio data to be recognized can be reflected.

During the primary verification, the intelligent device 101 may perform frame-by-frame calculation on the audio data to be recognized, and if it is determined that the audio data to be recognized includes a wakeup word when a certain frame of the audio data to be recognized, for example, a target audio frame, is verified, the audio data to be recognized passes the primary verification, at this time, the intelligent device 101 does not directly enter a wakeup state, but continues to perform the secondary verification on the audio data to be recognized. The auxiliary module in the intelligent device 101, which is responsible for the secondary verification, is normally in a closed state, and enters an open state for auxiliary verification only when the audio data to be identified passes the primary verification and the main module in charge of the primary verification is ready to wake up the intelligent device 101, so that the main module is prevented from mistakenly waking up the intelligent device 101.

The secondary verification means verifying whether the undetermined characteristic sequence meets the awakening condition of the intelligent equipment. During the secondary verification, the smart device 101 determines a pending feature sequence including acoustic features corresponding to consecutive audio frames from the stored acoustic features, where the consecutive audio frames include a target audio frame. Since it is determined that the audio data to be recognized contains the wake-up word when the target audio frame is verified, if the audio data to be recognized does contain the wake-up word, the pending feature sequence should embody the acoustic characteristics of the wake-up word. The acoustic features of the audio data to be recognized can represent the phoneme composition corresponding to the audio data to be recognized, so that the undetermined feature sequence determined from the acoustic features can represent the phoneme composition corresponding to a certain section of audio data to be recognized, and the pronunciation condition of the section of audio data to be recognized can be represented. The acoustic feature sequence of the awakening word is composed of acoustic features corresponding to the awakening word and can embody the composition of phonemes corresponding to the awakening word, so that the pronunciation characteristics of the audio data of the awakening word can be embodied. Since the undetermined feature sequence and the acoustic feature sequence corresponding to the wakeup word can both reflect the phoneme composition and pronunciation characteristics of the corresponding audio data, the intelligent device 101 can determine whether the undetermined feature sequence meets the wakeup condition according to the acoustic feature sequence of the wakeup word.

When the two-level verification is satisfied, the audio data to be recognized passes the second-level verification, the audio data to be recognized simultaneously passes the first-level verification and the second-level verification, and the intelligent device 101 enters an awakening state.

When the intelligent device 101 verifies the audio data to be recognized, the acoustic features of phonemes corresponding to the audio data to be recognized can be obtained and embodied in the primary verification, the undetermined feature sequence determined according to the acoustic features is compared with the acoustic feature sequence of the awakening word in the secondary verification, and the intelligent device is awakened only after the audio data to be recognized simultaneously passes the primary verification and the secondary verification, so that the false awakening rate of the intelligent device is reduced, and the user experience is improved.

Next, a smart device wake-up method provided in an embodiment of the present application will be described with reference to the drawings.

Referring to fig. 2, the figure is a flowchart of a method for waking up an intelligent device according to an embodiment of the present application, where the method includes the following steps:

s201: and in the process of verifying whether the audio data to be identified contains the awakening words corresponding to the intelligent equipment, storing the acoustic characteristics determined according to the audio data to be identified.

After the intelligent device obtains the audio data to be identified from the outside, it is verified whether the audio data to be identified contains the awakening word corresponding to the intelligent device, and the verification may be primary verification. The audio data to be recognized may include audio data related to the wakeup word, audio data unrelated to the wakeup word, ambient noise data, and the like.

In the process of primary verification, the intelligent device determines the acoustic characteristics according to the audio data to be identified. The acoustic feature may be any feature that represents a sound characteristic. The acoustic features can represent the phoneme composition of the audio data to be recognized, namely the pronunciation condition of the audio data.

It should be noted that, in this embodiment, the primary verification may be implemented by two parts, namely, an acoustic model and a confidence level decision module, and the acoustic model may be any type of model.

In one possible implementation, the acoustic features may be determined by an acoustic model, the acoustic model infrastructure comprising: an input layer, a hidden layer, and an output layer. In one possible implementation, the acoustic features are output features of a hidden layer of the acoustic model. In the primary verification, after the intelligent device obtains the audio data to be recognized, the audio data to be recognized is substituted into the input layer of the acoustic model for calculation, and the acoustic characteristics output by any hidden layer are stored. The output characteristic (hidden layer characteristic for short) of any hidden layer in an acoustic model can be the acoustic characteristic in the application, and generally speaking, the closer the hidden layer is to the output layer, the better the acoustic characteristic embodied by the hidden layer characteristic is.

In addition, the acoustic features in the hidden layer output have robustness, namely the acoustic features in the hidden layer are not easily affected by personalized pronunciation of a sound source, environmental noise and the like, and the corresponding degree of the acoustic features of the audio data to be recognized is high.

It can be understood that the length of the acoustic feature stored in the primary verification may be equal to the length of the acoustic feature of the wake-up word, or may be greater than the length of the acoustic feature of the wake-up word, so that a pending feature sequence that can cover the length of the acoustic feature sequence of the wake-up word can be extracted from the stored acoustic feature in the secondary verification for verification. It can be understood that, when the acoustic features are saved, the length of the saved acoustic features is fixed, and the first-in first-out principle is satisfied, that is, when the saved acoustic features reach the preset saving length, there is a new acoustic feature to be saved, and the acoustic feature that is saved first in the saved acoustic features is removed. Therefore, when the target audio frame is verified in the primary verification, the stored acoustic characteristics can be ensured to store a section of audio frame which is closest to the target audio frame before.

The confidence coefficient judging module can receive the acoustic characteristics of the audio data to be recognized output by the acoustic model and judge whether the audio data to be recognized contains the awakening words. The confidence coefficient judging module verifies the audio frames in the audio data to be recognized one by one to obtain the confidence coefficient that the audio data to be recognized corresponding to each frame contains the awakening word, and in the verification process, the confidence coefficients of a plurality of audio frames can be accumulated. When a certain audio frame, for example, a target audio frame, is verified, the confidence level that the audio data to be recognized contains the wakeup word reaches the first threshold of the confidence level judgment module, and at this time, the confidence level judgment module determines that the audio data to be recognized contains the wakeup word, and the audio data to be recognized passes the primary verification.

For example, the first threshold of the confidence level judgment module is preset to be 0.95, the awakening word is "open sound box", the audio data to be recognized collected by the intelligent device is "open music", when the last audio frame corresponding to the "sound" is verified, the confidence level obtained by the confidence level judgment module is 0.98, since 0.98 is greater than 0.95, that is, the confidence level of the audio data to be recognized including the awakening word reaches the first threshold of the confidence level judgment module, the confidence level judgment module determines that the audio data to be recognized includes the awakening word, and the audio data to be recognized passes first-level verification. At this time, the last audio frame of the "tone" is the target audio frame.

In one implementation, the confidence level decision module may be a decoding network, and whether the audio data to be recognized contains the wake-up word or not may be determined by the decoding network. The decoding network can be an HMM decoding network, the acoustic features output by the acoustic model comprise all possible pronunciation units, the pronunciation units can select syllables or phonemes and the like, and each pronunciation unit corresponds to one HMM state. The HMM decoding network is composed of a Keyword (Keyword) HMM and a Filler (Filler) HMM as shown in fig. 3, where the Keyword HMM is composed of HMM state strings corresponding to all pronunciation units that constitute the corresponding arousals of the smart device, and the Filler HMM is composed of a group of HMM states corresponding to manually elaborated non-arousal word pronunciation units. And in the process of verifying whether the audio data to be identified contains the awakening words, sending the acoustic characteristics into a decoding network according to the size of a fixed window, and searching for an optimal decoding path by utilizing a Viterbi decoding algorithm. The confidence coefficient judging module can judge whether the audio data to be recognized contains the awakening words according to whether the optimal decoding path passes through the keyword HMM path. It is understood that the confidence decision module may also make the decision by calculating more complex confidence and other strategies.

S202: and if the audio data to be recognized contains the awakening words, determining the sequence of the characteristics to be recognized from the stored acoustic characteristics.

After the intelligent device determines that the audio data to be recognized contains the awakening words, the audio data to be recognized passes the primary verification and enters the secondary verification. The intelligent device determines a pending feature sequence from the acoustic features stored in step S201, where the pending feature sequence includes acoustic features of multiple consecutive audio frames in the audio data to be identified, and the multiple audio frames include a target audio frame.

Since the acoustic features stored in the primary verification can represent the phoneme composition of the audio data to be recognized, the sequence of the features to be determined from the stored acoustic features can represent the phoneme composition of a certain section of audio data to be recognized.

It can be understood that, in order to enable the acoustic features of the audio data to be recognized to cover the acoustic features of the wake-up word and improve the accuracy of the secondary verification, the number of the acoustic features in the pending feature sequence is determined according to the length of the wake-up word, and the number of the acoustic features in the pending feature sequence may be equal to or greater than the length of the wake-up word. Since the target audio frame and the segment of the audio frame closest to the target audio frame are ensured to be contained in the acoustic features which are already stored, the acoustic features of the audio data containing the wake-up word determined in the primary verification can be ensured to be contained in the pending feature sequence.

It should be noted that, the way of determining the sequence of the pending features is different according to the length of the stored acoustic features. For example, when the length of the stored acoustic features is equal to the length of the acoustic features of the wake-up word, the stored acoustic features can be directly mirrored to serve as a to-be-determined feature sequence, and at this time, the number of the acoustic features in the to-be-determined feature sequence is equal to the length of the wake-up word; when the length of the stored acoustic features is greater than that of the acoustic features of the wake-up word, the undetermined feature sequence can be selected from the stored acoustic features, and at this time, the number of the acoustic features in the undetermined feature sequence can be equal to or greater than that of the wake-up word.

S203: and determining whether the undetermined characteristic sequence meets the awakening condition or not according to the acoustic characteristic sequence of the awakening word.

After the intelligent device determines the undetermined feature sequence from the stored acoustic features, the undetermined feature sequence can embody the phoneme composition of the audio data segment to be recognized, which is determined in the first verification and contains the awakening word, while the acoustic feature sequence of the awakening word can embody the phoneme composition corresponding to the awakening word audio data, and the two have the same embodying form for the audio data, so that whether the undetermined feature sequence meets the awakening condition or not can be determined according to the acoustic feature sequence of the awakening word preset in the intelligent device. In one possible implementation manner, the smart device may determine a degree of similarity between the acoustic feature sequence of the wakeup word and the to-be-determined feature sequence, and determine whether the wakeup condition is satisfied according to the degree of similarity.

The determining of the similarity degree between the acoustic feature sequence of the awakening word and the undetermined feature sequence may be calculating cosine similarity between the acoustic feature sequence and the undetermined feature sequence, and comparing the cosine similarity obtained through calculation with a preset second threshold to determine whether the awakening condition is met. When the cosine similarity between the acoustic feature sequence and the undetermined feature sequence reaches a threshold value, the similarity between the acoustic feature sequence of the awakening word and the undetermined feature sequence is higher, the acoustic feature sequence of the awakening word represents the phoneme composition of the audio data of the awakening word, and the undetermined feature sequence represents the phoneme composition of a certain section of audio data to be recognized, so that when the similarity between the feature sequences is higher, the phoneme compositions of the two pieces of audio data are similar, and the fact that the section of audio data to be recognized contains the awakening word is probably explained.

It should be noted that, since the acoustic features may have robustness and are not affected by personalized pronunciation of the sound source, environmental noise, and the like, the similarity between the acoustic feature sequence of the wakeup word and the feature sequence to be determined may be accurately determined for personalized pronunciation of different people or under different environments, so as to accurately determine whether the audio data to be recognized may wake up the smart device.

S204: and if so, awakening the intelligent equipment.

After the intelligent device determines that the to-be-determined characteristic sequence meets the awakening condition according to the acoustic characteristic sequence of the awakening word, the audio data to be identified passes the secondary verification, at the moment, the audio data to be identified passes the primary verification and the secondary verification, and the intelligent device enters a working state from a dormant state. It can be understood that when the intelligent device determines that the pending feature sequence does not satisfy the wake-up condition, it indicates that a false judgment occurs in the primary verification, the audio data to be recognized does not actually contain a wake-up word, and the intelligent device remains in a sleep state.

In some cases, the wake-up word may be updated according to the user's requirement, and it is understood that the process of updating the wake-up word may be performed in a network or locally. When the awakening words are updated in a network, the conversion between the text characteristics and the audio data can be performed only through the cloud server, and the process of determining the acoustic characteristic sequence of the awakening words to be updated is still performed locally; the conversion between the text features and the audio data and the determination of the acoustic feature sequence of the awakening word to be updated can be performed in the cloud server, and the determined acoustic feature sequence is directly sent to the intelligent device.

The first way of updating the awakening words is as follows: the wake word update is only done locally.

When the process of updating the awakening word is only locally performed, the acoustic feature sequence of the awakening word is used for reflecting the phoneme composition and pronunciation condition of the audio data of the awakening word and is irrelevant to other influence phonemes, and in the secondary verification, the acoustic feature sequence of the awakening word is verified by comparing the acoustic feature sequence of the awakening word with the undetermined feature sequence, so that the intelligent device can obtain the text feature of the awakening word to be updated in the process of updating the awakening word, the audio data of the awakening word to be updated is generated according to the text feature, and the acoustic feature sequence of the awakening word to be updated is determined according to the audio data.

For example, the smart device may have a text-to-speech conversion module, which uses text-to-speech (TTS) speech conversion technology. After a user inputs text characteristics of the awakening words to be updated, the intelligent device converts the input text characteristics into audio data of the awakening words to be updated through the character voice conversion module, determines an acoustic characteristic sequence of the awakening words to be updated through the acoustic model according to the audio data of the awakening words to be updated, and then stores the acoustic characteristic sequence in the secondary verification module for a subsequent audio recognition function. It can be understood that a plurality of audio data generated by TTS may be provided, and the intelligent device determines, according to the plurality of audio data, primary acoustic feature sequences corresponding to the plurality of audio data respectively through the acoustic model; and determining the acoustic feature sequence of the awakening word to be updated according to the plurality of primary acoustic feature sequences.

The second way of updating the awakening words: local and networked combined updates are utilized.

When the process of updating the wake-up word needs to be performed in a networked manner, in a possible implementation manner, the conversion between the text feature and the audio data can be performed only by using the cloud server, and the process of determining the acoustic feature sequence is still performed locally. For example, a text-to-speech Server (TTS Server) may be used to update the wakeup word. After the user of the intelligent device determines the awakening words to be updated, the user can input the text features of the awakening words to be updated, the intelligent device can send the text features of the awakening words to be updated to the TTS Server at the cloud end through the internet, and the TTSServer obtains the text features of the awakening words to be updated, so that the audio data of the awakening words to be updated are generated according to the text features. The TTS Server transmits the audio data to a primary verification module of the intelligent device through the network, and in the primary verification module, the acoustic feature sequence of the awakening word to be updated can be determined through an acoustic model according to the audio data, and the acoustic feature sequence is stored for a subsequent audio recognition function. It can be understood that the audio data of the to-be-updated wake word generated by the TTS Server may also include a plurality of audio data, and the primary acoustic feature sequences corresponding to the plurality of audio data are determined by the acoustic model according to the plurality of audio data; and determining the acoustic feature sequence of the awakening word to be updated according to the plurality of primary acoustic feature sequences.

The third way to update the wake word: and only updating the awakening words in the cloud.

In a possible implementation manner, when the awakening word needs to be updated through networking, the cloud server can be used for directly determining the acoustic feature sequence of the awakening word to be updated according to the text features, and the acoustic feature sequence is returned to the intelligent device. Referring to fig. 4, which is a flowchart of a method for updating a wakeup word of an intelligent device according to an embodiment of the present application, the method includes the following steps:

s401: and acquiring the text characteristics of the awakening words to be updated, which are sent by the intelligent equipment.

When the awakening words need to be updated, the audio generation server obtains the text characteristics of the awakening words to be updated, which are sent by the intelligent equipment.

S402: and generating audio data of the awakening words to be updated according to the text characteristics.

And after the audio generation server acquires the text characteristics, generating audio data of the awakening words to be updated according to the text characteristics.

S403: and determining an acoustic feature sequence of the awakening word to be updated according to the audio data.

And after the audio generation server generates the audio data of the awakening words to be updated, determining the acoustic characteristic sequence of the awakening words to be updated according to the audio data. The acoustic feature sequence is used for secondary verification of the intelligent equipment in the process of verifying whether the audio data to be recognized contains the awakening words corresponding to the intelligent equipment or not so as to determine whether the undetermined feature sequence of the audio data to be recognized meets the awakening condition or not; the undetermined feature sequence comprises acoustic features of a plurality of continuous audio frames in the audio data to be recognized, the plurality of continuous audio frames comprise target audio frames when the audio data to be recognized contains the awakening words, and the acoustic features are used for identifying the acoustic features of the audio data to be recognized. It is understood that, as in the first two cases, the acoustic features of the audio data to be recognized may also be determined by an acoustic model, that is, the acoustic feature sequence of the wake-up word to be updated is determined by the acoustic model according to the audio data.

In addition, the audio data of the to-be-updated awakening words generated by the audio generation server according to the text can include a plurality of audio data, and primary acoustic feature sequences corresponding to the plurality of audio data are determined through the acoustic model according to the plurality of audio data; and determining the acoustic feature sequence of the awakening word to be updated according to the plurality of primary acoustic feature sequences.

S404: and returning the acoustic feature sequence to the intelligent device.

And after the audio generation server generates the acoustic feature sequence of the awakening word to be updated, the acoustic feature sequence is returned to the intelligent equipment, so that the awakening word in the intelligent equipment is updated.

It can be understood that although networking is required in some processes of generating the acoustic feature sequence of the wake-up word to be updated, after the generation is completed, the acoustic feature sequence of the wake-up word to be updated can be stored in the intelligent device, so that the intelligent device can normally operate without network connection in the process of waking up the intelligent device.

It should be noted that, in order to ensure that the determination method of the acoustic feature sequence and the undetermined feature sequence is the same when determining the similarity between the acoustic feature sequence of the wakeup word and the undetermined feature sequence, and further ensure the accuracy of the calculation of the similarity, if the acoustic feature of the audio data to be recognized is determined by the acoustic model, the acoustic feature sequence of the wakeup word to be updated may also be determined by the acoustic model according to the audio data when determining the acoustic feature sequence of the wakeup word to be updated.

It can be understood that, in the first-level verification, when the confidence level decision module adopts an HMM decoding network, after the wakeup word is updated, the decoding network needs to be updated according to the wakeup word to be updated, that is, the key HMMs and the Filler HMMs in the decoding network are updated, the key HMMs are updated to be composed of HMM states in series corresponding to all pronunciation units constituting the wakeup word to be updated, and the Filler HMMs are updated to be composed of HMM states corresponding to a group of elaborately designed pronunciation units of the non-wakeup word to be updated, so that it can be determined whether the audio data to be recognized contains the wakeup word to be updated correctly.

Next, a method for waking up an intelligent device provided in the embodiment of the present application will be described in combination with an actual application scenario. In the application scene, the intelligent equipment is an intelligent sound box, the acoustic model adopted by the primary verification is an LSTM model, and the adopted verification method is a decoding network; the verification System adopted by the secondary verification is an LSTM KWS System, and the awakening word is 'sound box on'. When a user talks to others at the speaker, the user speaks the word "open music". The smart device wake-up method is shown in fig. 5, and the method includes:

s501: and acquiring audio data to be identified, and determining acoustic characteristics according to the audio data to be identified.

As shown in fig. 6, fig. 6 is a model diagram of audio recognition employed in the scene. The intelligent sound box collects audio data of the surrounding environment, FBANK features are determined through an FBANK feature calculation function, then the audio data to be recognized with the FBANK features are input into an LSTM acoustic model to be converted, and the acoustic features output by an LSTM hidden layer and the acoustic features output by an output layer are obtained.

S502: the acoustic features determined from the audio data to be recognized are saved.

The intelligent sound box stores the acoustic characteristics output by the LSTM hidden layer for subsequent secondary verification.

S503: and verifying whether the audio data to be identified contains a wake-up word 'turn on the loudspeaker box'.

And the intelligent sound box verifies whether the audio data to be identified contains the awakening words or not through the decoding network. Because the collected audio data to be recognized contains the voice of 'open music', when the voice data to be recognized is verified by the decoding network, the optimal decoding path of the acoustic features of the audio data to be recognized passes through the Keyword HMM path corresponding to the awakening word 'open sound box', and therefore the data to be recognized is verified to contain the awakening word.

S504: and after determining that the audio data to be recognized contains the awakening words, determining a characteristic sequence to be determined from the stored acoustic characteristics.

After the intelligent sound box verifies that the audio data to be recognized contains the awakening words through the decoding network, the audio data to be recognized enters a second-level verification through a first-level verification. And the intelligent sound box extracts the undetermined characteristic sequence in the acoustic characteristics output by the stored LSTM hidden layer through the LSTM characteristic extractor for subsequent verification.

S505: and determining whether the undetermined characteristic sequence meets the awakening condition or not according to the acoustic characteristic sequence of the awakening word.

After the undetermined feature sequence is extracted, the intelligent sound equipment verifies whether the undetermined feature sequence meets the awakening condition or not by calculating cosine similarity of the undetermined feature sequence and the acoustic feature sequence of the awakening word and comparing the cosine similarity with a preset threshold value.

S506: if so, awakening the intelligent sound box.

And when the cosine similarity reaches a preset threshold value, the intelligent sound enters an awakening state.

In addition, in the practical application scenario, when the wake-up word of the smart sound needs to be updated, as shown in fig. 7, the update can be performed through the following steps,

s701: and receiving a text of the wakeup word to be updated, which is input by a user.

The smart audio receives a text of a wakeup word to be updated, which is input by a user, as shown in fig. 8, fig. 8 is a model diagram of updating the wakeup word in the application scenario.

S702: and generating audio data corresponding to the text of the awakening word to be updated through the TTS Server.

After receiving the text of the awakening words to be updated, the intelligent sound uploads the text to a TTS Server for audio data generation, and N different awakening words to be updated are generated.

S703: and determining an acoustic feature sequence of the awakening word to be updated according to the audio data.

After receiving the audio data generated by the TTS Server, the audio data is converted into N acoustic feature sequences of M-frame-length awakening words to be updated through the function in primary verification, and the N feature sequences are averaged to obtain the feature sequences of the awakening words to be updated, so that the secondary verification can be performed by using the acoustic feature sequences of the awakening words to be updated.

S704: and updating the decoding network according to the acoustic characteristics of the awakening words to be updated.

After the acoustic features of the awakening words to be updated are obtained, a Keyword HMM and a Filler HMM in a decoding network are updated according to the acoustic features, the Keyword HMM is formed by serially connecting HMMs corresponding to all pronunciation units forming the awakening words to be updated, and the Filler HMM is formed by a group of HMM states corresponding to well-designed non-awakening word pronunciation units, so that primary verification can be performed on the awakening words to be updated.

Based on the method for waking up an intelligent device provided in the foregoing embodiment, this embodiment provides an apparatus 900 for waking up an intelligent device, referring to fig. 9a, the apparatus 900 includes a first determining unit 901, a second determining unit 902, a third determining unit 903, and a waking unit 904:

a first determining unit 901, configured to store an acoustic feature determined according to the audio data to be recognized in a process of verifying whether the audio data to be recognized includes a wakeup word corresponding to the intelligent device, where the acoustic feature is used to identify an acoustic feature of the audio data to be recognized;

a second determining unit 902, configured to determine, if it is determined that the audio data to be recognized includes a wakeup word through a target audio frame in the audio data to be recognized, an undetermined feature sequence from the stored acoustic features, where the undetermined feature sequence includes acoustic features of multiple consecutive audio frames in the audio data to be recognized, and the multiple consecutive audio frames include the target audio frame;

a third determining unit 903, configured to determine whether the undetermined feature sequence meets the wake-up condition according to the acoustic feature sequence of the wake-up word;

and a waking unit 904, configured to wake up the smart device if the pending feature sequence satisfies a waking condition.

In a possible implementation manner, the third determining unit 903 is specifically configured to:

determining the similarity degree between the acoustic characteristic sequence of the awakening word and the undetermined characteristic sequence;

and determining whether the awakening condition is met according to the similarity degree.

In one possible implementation, the acoustic features of the audio data to be recognized are determined by an acoustic model, wherein the acoustic features are output features of a hidden layer of the acoustic model.

In one possible implementation, the number of acoustic features in the sequence of pending features is determined according to the length of the wake-up word.

In one possible implementation, referring to fig. 9b, the apparatus 900 further includes an updating unit 905:

the updating unit 905 is configured to update the wakeup word for the intelligent device, and use the wakeup word to be updated as the wakeup word corresponding to the intelligent device;

the updating unit 905 is specifically configured to:

acquiring text characteristics of the awakening words to be updated;

and the acoustic feature sequence is used for determining the acoustic feature sequence of the awakening word to be updated according to the audio data.

In a possible implementation manner, the acoustic features of the audio data to be recognized are determined by an acoustic model, and the updating unit 905 is specifically configured to:

and according to the audio data, determining the acoustic feature sequence of the awakening word to be updated through the acoustic model.

In a possible implementation manner, the audio data of the wakeup word to be updated includes a plurality of audio data, and the updating unit 905 is specifically configured to:

determining primary acoustic feature sequences corresponding to the plurality of audio data through an acoustic model according to the plurality of audio data;

and determining the acoustic feature sequence of the awakening word to be updated according to the plurality of primary acoustic feature sequences.

In a possible implementation manner, the verification whether the audio data to be recognized contains the wakeup word is determined by the decoding network, and the updating unit 905 is further configured to:

and updating the decoding network according to the awakening words to be updated.

Based on the method for updating a wakeup word of an intelligent device provided in the foregoing embodiment, this embodiment provides a wakeup word updating apparatus 1000 of an intelligent device, referring to fig. 10, the apparatus 1000 includes an obtaining unit 1001, a generating unit 1002, a determining unit 1003, and a returning unit 1004:

an obtaining unit 1001, configured to obtain a text feature of a wakeup word to be updated, where the text feature is sent by an intelligent device;

the generating unit 1002 is configured to generate audio data of the wakeup word to be updated according to the text feature;

a determining unit 1003, configured to determine, according to the audio data, an acoustic feature sequence of the wakeup word to be updated; the acoustic feature sequence is used for secondary verification of the intelligent equipment in the process of verifying whether the audio data to be recognized contains the awakening words corresponding to the intelligent equipment or not so as to determine whether the undetermined feature sequence of the audio data to be recognized meets the awakening condition or not; the undetermined feature sequence comprises acoustic features of a plurality of continuous audio frames in the audio data to be recognized, the plurality of continuous audio frames comprise target audio frames when the audio data to be recognized contains the awakening words, and the acoustic features are used for identifying the acoustic features of the audio data to be recognized;

a returning unit 1004 for returning the acoustic feature sequence to the smart device.

In a possible implementation manner, the determining unit 1003 is specifically configured to:

determining an acoustic feature sequence of the awakening word to be updated through an acoustic model according to the audio data; the acoustic model is the same as the acoustic model used by the intelligent device in the process of verifying whether the audio data to be recognized contains the awakening word corresponding to the intelligent device.

In a possible implementation manner, the audio data of the wakeup word to be updated includes a plurality of audio data, and the determining unit 1003 is specifically configured to:

The embodiment of the present application further provides a device for waking up an intelligent device, and the following describes the device for waking up the intelligent device with reference to the accompanying drawings. Referring to fig. 11, an embodiment of the present application provides a device 1100 for waking up an intelligent device, where the device 1100 may also be a terminal device, and the terminal device may be any intelligent terminal including a mobile phone, a tablet computer, a Personal Digital Assistant (PDA), a Point of Sales (POS), a vehicle-mounted computer, and the terminal device is a mobile phone:

fig. 11 is a block diagram illustrating a partial structure of a mobile phone related to a terminal device provided in an embodiment of the present application. Referring to fig. 11, the cellular phone includes: a Radio Frequency (RF) circuit 1110, a memory 1120, an input unit 1130, a display unit 1140, a sensor 1150, an audio circuit 1160, a wireless fidelity (WiFi) module 1170, a processor 1180, and a power supply 1190. Those skilled in the art will appreciate that the handset configuration shown in fig. 11 is not intended to be limiting and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components.

The following describes each component of the mobile phone in detail with reference to fig. 11:

RF circuit 1110 may be used for receiving and transmitting signals during a message transmission or call, and in particular, for receiving downlink messages from a base station and then processing the received downlink messages to processor 1180; in addition, the data for designing uplink is transmitted to the base station. In general, RF circuit 1110 includes, but is not limited to, an antenna, at least one amplifier, a transceiver, a coupler, a Low Noise Amplifier (LNA), a duplexer, and the like. In addition, the RF circuitry 1110 may also communicate with networks and other devices via wireless communications. The wireless communication may use any communication standard or protocol, including but not limited to Global System for Mobile communication (GSM), General Packet Radio Service (GPRS), Code Division Multiple Access (CDMA), Wideband Code Division Multiple Access (WCDMA), Long Term Evolution (LTE), email, Short Message Service (SMS), and the like.

The memory 1120 may be used to store software programs and modules, and the processor 1180 may execute various functional applications and data processing of the mobile phone by operating the software programs and modules stored in the memory 1120. The memory 1120 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the cellular phone, and the like. Further, the memory 1120 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device.

The input unit 1130 may be used to receive input numeric or character information and generate key signal inputs related to user settings and function control of the cellular phone. Specifically, the input unit 1130 may include a touch panel 1131 and other input devices 1132. Touch panel 1131, also referred to as a touch screen, can collect touch operations of a user on or near the touch panel 1131 (for example, operations of the user on or near touch panel 1131 by using any suitable object or accessory such as a finger or a stylus pen), and drive corresponding connection devices according to a preset program. Alternatively, the touch panel 1131 may include two parts, namely, a touch detection device and a touch controller. The touch detection device detects the touch direction of a user, detects a signal brought by touch operation and transmits the signal to the touch controller; the touch controller receives touch information from the touch sensing device, converts the touch information into touch point coordinates, sends the touch point coordinates to the processor 1180, and can receive and execute commands sent by the processor 1180. In addition, the touch panel 1131 can be implemented by using various types, such as resistive, capacitive, infrared, and surface acoustic wave. The input unit 1130 may include other input devices 1132 in addition to the touch panel 1131. In particular, other input devices 1132 may include, but are not limited to, one or more of a physical keyboard, function keys (e.g., volume control keys, switch keys, etc.), a trackball, a mouse, a joystick, and the like.

The display unit 1140 may be used to display information input by the user or information provided to the user and various menus of the cellular phone. The Display unit 1140 may include a Display panel 1141, and optionally, the Display panel 1141 may be configured in the form of a Liquid Crystal Display (LCD), an Organic Light-Emitting Diode (OLED), or the like. Further, the touch panel 1131 can cover the display panel 1141, and when the touch panel 1131 detects a touch operation on or near the touch panel, the touch panel is transmitted to the processor 1180 to determine the type of the touch event, and then the processor 1180 provides a corresponding visual output on the display panel 1141 according to the type of the touch event. Although in fig. 11, the touch panel 1131 and the display panel 1141 are two independent components to implement the input and output functions of the mobile phone, in some embodiments, the touch panel 1131 and the display panel 1141 may be integrated to implement the input and output functions of the mobile phone.

The handset may also include at least one sensor 1150, such as a light sensor, motion sensor, and other sensors. Specifically, the light sensor may include an ambient light sensor and a proximity sensor, wherein the ambient light sensor may adjust the brightness of the display panel 1141 according to the brightness of ambient light, and the proximity sensor may turn off the display panel 1141 and/or the backlight when the mobile phone moves to the ear. As one of the motion sensors, the accelerometer sensor can detect the magnitude of acceleration in each direction (generally, three axes), can detect the magnitude and direction of gravity when stationary, and can be used for applications of recognizing the posture of a mobile phone (such as horizontal and vertical screen switching, related games, magnetometer posture calibration), vibration recognition related functions (such as pedometer and tapping), and the like; as for other sensors such as a gyroscope, a barometer, a hygrometer, a thermometer, and an infrared sensor, which can be configured on the mobile phone, further description is omitted here.

Audio circuitry 1160, speakers 1161, and microphone 1162 may provide an audio interface between a user and a cell phone. The audio circuit 1160 may transmit the electrical signal converted from the received audio data to the speaker 1161, and convert the electrical signal into a sound signal for output by the speaker 1161; on the other hand, the microphone 1162 converts the collected sound signals into electrical signals, which are received by the audio circuit 1160 and converted into audio data, which are then processed by the audio data output processor 1180, and then transmitted to, for example, another cellular phone via the RF circuit 1110, or output to the memory 1120 for further processing.

WiFi belongs to short-distance wireless transmission technology, and the cell phone can help a user to receive and send e-mails, browse webpages, access streaming media and the like through the WiFi module 1170, and provides wireless broadband internet access for the user. Although fig. 11 shows the WiFi module 1170, it is understood that it does not belong to the essential constitution of the handset, and can be omitted entirely as needed within the scope not changing the essence of the invention.

The processor 1180 is a control center of the mobile phone, and is connected to various parts of the whole mobile phone through various interfaces and lines, and executes various functions of the mobile phone and processes data by operating or executing software programs and/or modules stored in the memory 1120 and calling data stored in the memory 1120, thereby performing overall monitoring of the mobile phone. Optionally, processor 1180 may include one or more processing units; preferably, the processor 1180 may integrate an application processor, which mainly handles operating systems, user interfaces, application programs, etc., and a modem processor, which mainly handles wireless communications. It will be appreciated that the modem processor described above may not be integrated within processor 1180.

The phone also includes a power supply 1190 (e.g., a battery) for powering the various components, and preferably, the power supply may be logically connected to the processor 1180 via a power management system, so that the power management system may manage charging, discharging, and power consumption management functions.

Although not shown, the mobile phone may further include a camera, a bluetooth module, etc., which are not described herein.

In this embodiment, the processor 1180 included in the terminal device further has the following functions:

and if so, awakening the intelligent equipment.

Referring to fig. 12, fig. 12 is a block diagram of a server 1200 provided in this embodiment, and the server 1200 may have a relatively large difference due to different configurations or performances, and may include one or more Central Processing Units (CPUs) 1222 (e.g., one or more processors) and a memory 1232, and one or more storage media 1230 (e.g., one or more mass storage devices) storing an application program 1242 or data 1244. Memory 1232 and storage media 1230 can be, among other things, transient storage or persistent storage. The program stored in the storage medium 1230 may include one or more modules (not shown), each of which may include a series of instruction operations for the server. Still further, the central processor 1222 may be configured to communicate with the storage medium 1230, to execute a series of instruction operations in the storage medium 1230 on the server 1200.

The server 1200 may also include one or more power supplies 1226, one or more wired or wireless network interfaces 1250, one or more input-output interfaces 1258, and/or one or more operating systems 1241, such as Windows Server, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, etc.

The steps performed by the server in the above embodiment may be based on the server structure shown in fig. 12.

The embodiment of the present application further provides a device for updating a wake-up word of an intelligent device, and the following introduces the device for updating a wake-up word of an intelligent device with reference to the accompanying drawings. Referring to fig. 13, an embodiment of the present application provides a device 1300 for updating a wake-up word of an intelligent device, where the device 1300 may also be a terminal device, and the terminal device may be any intelligent terminal including a mobile phone, a tablet computer, a Personal Digital Assistant (PDA for short), a Point of Sales (POS for short), a vehicle-mounted computer, and the terminal device is taken as a mobile phone as an example:

fig. 13 is a block diagram illustrating a partial structure of a mobile phone related to a terminal device provided in an embodiment of the present application. Referring to fig. 13, the handset includes: a Radio Frequency (RF) circuit 1310, a memory 1320, an input unit 1330, a display unit 1340, a sensor 1350, an audio circuit 1360, a wireless fidelity (WiFi) module 1370, a processor 1380, and a power supply 1390. Those skilled in the art will appreciate that the handset configuration shown in fig. 13 is not intended to be limiting and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components.

The following describes each component of the mobile phone in detail with reference to fig. 13:

RF circuit 1310 may be used for receiving and transmitting signals during a message transmission or call, and in particular, for processing received downlink information of a base station by processor 1380; in addition, the data for designing uplink is transmitted to the base station. In general, the RF circuit 1310 includes, but is not limited to, an antenna, at least one amplifier, a transceiver, a coupler, a Low Noise Amplifier (LNA), a duplexer, and the like. In addition, RF circuit 1310 may also communicate with networks and other devices via wireless communication. The wireless communication may use any communication standard or protocol, including but not limited to Global System for Mobile communication (GSM), General Packet Radio Service (GPRS), Code Division Multiple Access (CDMA), Wideband Code Division Multiple Access (WCDMA), Long Term Evolution (LTE), email, Short Message Service (SMS), and the like.

The memory 1320 may be used to store software programs and modules, and the processor 1380 executes various functional applications and data processing of the cellular phone by operating the software programs and modules stored in the memory 1320. The memory 1320 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the cellular phone, and the like. Further, the memory 1320 may include high speed random access memory and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device.

The input unit 1330 may be used to receive input numeric or character information and generate key signal inputs related to user settings and function control of the cellular phone. Specifically, the input unit 1330 may include a touch panel 1331 and other input devices 1332. Touch panel 1331, also referred to as a touch screen, can collect touch operations by a user (e.g., operations by a user on or near touch panel 1331 using any suitable object or accessory such as a finger, a stylus, etc.) and drive the corresponding connection device according to a preset program. Alternatively, the touch panel 1331 may include two portions of a touch detection device and a touch controller. The touch detection device detects the touch direction of a user, detects a signal brought by touch operation and transmits the signal to the touch controller; the touch controller receives touch information from the touch sensing device, converts the touch information into touch point coordinates, and sends the touch point coordinates to the processor 1380, where the touch controller can receive and execute commands sent by the processor 1380. In addition, the touch panel 1331 may be implemented by various types, such as a resistive type, a capacitive type, an infrared ray, and a surface acoustic wave. The input unit 1330 may include other input devices 1332 in addition to the touch panel 1331. In particular, other input devices 1332 may include, but are not limited to, one or more of a physical keyboard, function keys (such as volume control keys, switch keys, etc.), a trackball, a mouse, a joystick, and the like.

The display unit 1340 may be used to display information input by a user or information provided to the user and various menus of the cellular phone. The Display unit 1340 may include a Display panel 1341, and optionally, the Display panel 1341 may be configured in the form of a Liquid Crystal Display (LCD), an Organic Light-Emitting Diode (OLED), or the like. Further, touch panel 1331 can overlay display panel 1341, and when touch panel 1331 detects a touch operation on or near touch panel 1331, processor 1380 can be configured to determine the type of touch event, and processor 1380 can then provide a corresponding visual output on display panel 1341 based on the type of touch event. Although in fig. 13, the touch panel 1331 and the display panel 1341 are two independent components to implement the input and output functions of the mobile phone, in some embodiments, the touch panel 1331 and the display panel 1341 may be integrated to implement the input and output functions of the mobile phone.

The handset may also include at least one sensor 1350, such as light sensors, motion sensors, and other sensors. Specifically, the light sensor may include an ambient light sensor that adjusts the brightness of the display panel 1341 according to the brightness of ambient light, and a proximity sensor that turns off the display panel 1341 and/or the backlight when the mobile phone is moved to the ear. As one of the motion sensors, the accelerometer sensor can detect the magnitude of acceleration in each direction (generally, three axes), can detect the magnitude and direction of gravity when stationary, and can be used for applications of recognizing the posture of a mobile phone (such as horizontal and vertical screen switching, related games, magnetometer posture calibration), vibration recognition related functions (such as pedometer and tapping), and the like; as for other sensors such as a gyroscope, a barometer, a hygrometer, a thermometer, and an infrared sensor, which can be configured on the mobile phone, further description is omitted here.

The audio circuit 1360, speaker 1361, microphone 1362 may provide an audio interface between the user and the handset. The audio circuit 1360 may transmit the electrical signal converted from the received audio data to the speaker 1361, and the electrical signal is converted into a sound signal by the speaker 1361 and output; on the other hand, the microphone 1362 converts the collected sound signal into an electric signal, converts the electric signal into audio data after being received by the audio circuit 1360, and then processes the audio data by the audio data output processor 1380, and then sends the audio data to, for example, another cellular phone via the RF circuit 1310, or outputs the audio data to the memory 1320 for further processing.

WiFi belongs to short-distance wireless transmission technology, and the mobile phone can help a user to receive and send e-mails, browse webpages, access streaming media and the like through the WiFi module 1370, and provides wireless broadband internet access for the user. Although fig. 13 shows the WiFi module 1370, it is understood that it does not belong to the essential constitution of the handset, and may be omitted entirely as needed within the scope not changing the essence of the invention.

The processor 1380 is a control center of the mobile phone, connects various parts of the entire mobile phone using various interfaces and lines, and performs various functions of the mobile phone and processes data by operating or executing software programs and/or modules stored in the memory 1320 and calling data stored in the memory 1320, thereby integrally monitoring the mobile phone. Optionally, processor 1380 may include one or more processing units; preferably, the processor 1380 may integrate an application processor, which handles primarily operating systems, user interfaces, application programs, etc., and a modem processor, which handles primarily wireless communications. It will be appreciated that the modem processor described above may not be integrated within processor 1380.

The handset also includes a power supply 1390 (e.g., a battery) to supply power to the various components, which may preferably be logically coupled to the processor 1380 via a power management system to manage charging, discharging, and power consumption management functions via the power management system.

In this embodiment, the processor 1380 included in the terminal device further has the following functions:

returning the acoustic feature sequence to the smart device.

Referring to fig. 14, fig. 14 is a block diagram of a server 1400 provided in this embodiment, where the server 1400 may have a relatively large difference due to different configurations or performances, and may include one or more Central Processing Units (CPUs) 1422 (e.g., one or more processors) and a memory 1432, and one or more storage media 1430 (e.g., one or more mass storage devices) for storing applications 1442 or data 1444. Memory 1432 and storage media 1430, among other things, may be transient or persistent storage. The program stored on storage medium 1430 may include one or more modules (not shown), each of which may include a sequence of instructions operating on a server. Still further, a central processor 1422 may be disposed in communication with storage medium 1430 for executing a series of instruction operations on storage medium 1430 on server 1400.

The server 1400 may also include one or more power supplies 1426, one or more wired or wireless network interfaces 1450, one or more input-output interfaces 1458, and/or one or more operating systems 1441, such as Windows Server, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, etc.

The steps performed by the server in the above embodiment may be based on the server structure shown in fig. 14.

The present application further provides a computer-readable storage medium for storing a program code, where the program code is configured to execute any one implementation of the intelligent device wake-up method and the intelligent device wake-up word update method described in the foregoing embodiments.

Those of ordinary skill in the art will understand that: all or part of the steps for realizing the method embodiments can be completed by hardware related to program instructions, the program can be stored in a computer readable storage medium, and the program executes the steps comprising the method embodiments when executed; and the aforementioned storage medium may be at least one of the following media: various media that can store program codes, such as read-only memory (ROM), RAM, magnetic disk, or optical disk.

It should be noted that, in the present specification, all the embodiments are described in a progressive manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the apparatus and system embodiments, since they are substantially similar to the method embodiments, they are described in a relatively simple manner, and reference may be made to some of the descriptions of the method embodiments for related points. The above-described embodiments of the apparatus and system are merely illustrative, and the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

The above description is only one specific embodiment of the present application, but the scope of the present application is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present application should be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A smart device wake-up method, the method comprising:

and if so, awakening the intelligent equipment.

2. The method of claim 1, wherein the determining whether the pending signature sequence satisfies a wake condition according to the acoustic signature sequence of the wake word comprises:

determining the similarity degree between the acoustic feature sequence of the awakening word and the undetermined feature sequence;

3. The method of claim 1, wherein the number of acoustic features in the sequence of pending features is determined according to a length of the wake-up word.

4. The method according to claims 1-3, characterized in that the acoustic features of the audio data to be recognized are determined by an acoustic model, wherein the acoustic features are output features of a hidden layer of the acoustic model.

5. The method of claim 1, further comprising:

the method comprises the following steps of updating a wakeup word of the intelligent device, and using the wakeup word to be updated as the wakeup word corresponding to the intelligent device, wherein the wakeup word updating comprises the following steps:

acquiring text characteristics of the awakening words to be updated;

acquiring the text characteristics to generate audio data of the awakening words to be updated;

and determining the acoustic characteristic sequence of the awakening word to be updated according to the audio data.

6. The method according to claim 5, wherein the acoustic features of the audio data to be recognized are determined by an acoustic model, and the determining the acoustic feature sequence of the wake word to be updated according to the audio data comprises:

7. The method according to claim 6, wherein the audio data of the wake word to be updated includes a plurality of audio data, and the determining the acoustic feature sequence of the wake word to be updated through the acoustic model according to the audio data includes:

according to the plurality of audio data, determining primary acoustic feature sequences respectively corresponding to the plurality of audio data through the acoustic model;

8. The method according to any one of claims 5-7, wherein verifying whether the audio data to be recognized contains the wake word is determined by a decoding network, and wherein the wake word updating further comprises updating the decoding network according to the wake word to be updated.

9. An intelligent device awakening device is characterized by comprising a first determining unit, a second determining unit, a third determining unit and an awakening unit:

10. A wake word updating method for an intelligent device, the method comprising:

returning the acoustic feature sequence to the smart device.

11. The method of claim 10, wherein the determining the acoustic feature sequence of the wake word to be updated according to the audio data comprises:

determining an acoustic feature sequence of the awakening word to be updated through an acoustic model according to the audio data; the acoustic model is the same as the acoustic model used by the intelligent device in the process of verifying whether the audio data to be identified contains the awakening words corresponding to the intelligent device.

12. The method according to claim 11, wherein the audio data of the wake word to be updated includes a plurality of audio data, and the determining the acoustic feature sequence of the wake word to be updated through an acoustic model according to the audio data includes:

13. An apparatus for updating a wake-up word of an intelligent device, the apparatus comprising an obtaining unit, a generating unit, a determining unit, and a returning unit:

14. A device for smart device wake up, the device comprising a processor and a memory:

the processor is configured to perform the method for smart device wake-up according to any one of claims 1-8 according to instructions in the program code.

15. A device for wake word update for a smart device, the device comprising a processor and a memory:

the processor is configured to perform the method for wake up word update of a smart device according to any one of claims 10-12 according to instructions in the program code.