CN111048068A

CN111048068A - Voice wake-up method, device and system and electronic equipment

Info

Publication number: CN111048068A
Application number: CN201811186019.4A
Authority: CN
Inventors: 曹元斌; 张智超; 风翮; 王刚
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2018-10-11
Filing date: 2018-10-11
Publication date: 2020-04-21
Anticipated expiration: 2038-10-11
Also published as: CN111048068B; WO2020073839A1

Abstract

The embodiment of the invention provides a voice awakening method, a voice awakening device, a voice awakening system and electronic equipment, wherein the method comprises the following steps: acquiring a first voice signal; identifying a pinyin final signal contained in the first voice signal to obtain a first final signal sequence corresponding to the first voice signal; comparing the first final signal sequence with a second final signal sequence of a preset awakening word to extract a third final signal sequence with the same content as the second final signal sequence from the first final signal sequence; and carrying out automatic voice recognition processing on the full-spelling voice signal of the third vowel signal sequence corresponding to the first voice signal, and determining whether the full-spelling voice signal is the voice signal corresponding to the awakening word. The scheme of the embodiment of the invention can quickly and accurately identify the awakening words and improve the awakening speed of the equipment.

Description

Voice wake-up method, device and system and electronic equipment

Technical Field

The present application relates to the field of speech recognition technologies, and in particular, to a method, an apparatus, a system, and an electronic device for voice wake-up.

Background

With the more and more intensive development of the related application of artificial intelligence, the voice recognition technology plays an increasingly important role as a basic interaction mode of intelligent equipment. The voice recognition technology involves many aspects including waking up a device through voice instructions, controlling the operation of the device, conducting a man-machine conversation with the device, and controlling voice instructions for a plurality of devices. Efficient and accurate voice recognition technology and fast and convenient wake-up mode are important development directions of intelligent equipment.

At present, the main performance bottleneck of the custom wake-up is that the computation resources on the terminal (terminal device) are limited, and the classifier of the core part directly influences the speed and accuracy of the wake-up on the number of classes of the voice features. The traditional classification strategy of pinyin granularity is to take the complete pinyin of common Chinese characters as classification, more than 1200 tones are provided, more than 400 tones are removed, and the accuracy rate can reach about 80%. However, to achieve higher accuracy, the on-end computational performance needs to be improved and much post-processing effort needs to be perfected.

Disclosure of Invention

The invention provides a voice awakening method, a voice awakening device, a voice awakening system and electronic equipment, which can quickly and accurately identify awakening words and improve the awakening speed of the equipment.

In order to achieve the above purpose, the embodiment of the invention adopts the following technical scheme:

in a first aspect, a voice wake-up method is provided, including:

acquiring a first voice signal;

identifying a pinyin final signal contained in the first voice signal to obtain a first final signal sequence corresponding to the first voice signal;

comparing the first final signal sequence with a second final signal sequence of a preset awakening word to extract a third final signal sequence with the same content as the second final signal sequence from the first final signal sequence;

and carrying out automatic voice recognition processing on the full-spelling voice signal of the third vowel signal sequence corresponding to the first voice signal, and determining whether the full-spelling voice signal is the voice signal corresponding to the awakening word.

In a second aspect, another voice wake-up method is provided, including:

acquiring a first voice signal;

identifying vowel signals contained in the first voice signal to obtain a first vowel signal sequence corresponding to the first voice signal;

comparing the first vowel signal sequence with a second vowel signal sequence of a preset awakening word to extract a third vowel signal sequence with the same content as the second vowel signal sequence from the first vowel signal sequence;

and carrying out automatic voice recognition processing on the full voice signal of the third voice signal sequence corresponding to the first voice signal, and determining whether the full voice signal is the voice signal corresponding to the awakening word.

In a third aspect, a voice wake-up apparatus is provided, including:

the signal acquisition module is used for acquiring a first voice signal;

the signal identification module is used for identifying pinyin final signals contained in the first voice signal to obtain a first final signal sequence corresponding to the first voice signal;

the signal comparison module is used for comparing the first final part signal sequence with a second final part signal sequence of a preset awakening word so as to extract a third final part signal sequence with the same content as the second final part signal sequence from the first final part signal sequence;

and the voice recognition module is used for carrying out automatic voice recognition processing on the full-spelling voice signal of the third vowel signal sequence corresponding to the first voice signal and determining whether the full-spelling voice signal is the voice signal corresponding to the awakening word.

In a fourth aspect, another voice wake-up apparatus is provided, including:

the signal acquisition module is used for acquiring a first voice signal;

the signal identification module is used for identifying a vowel signal contained in the first voice signal to obtain a first vowel signal sequence corresponding to the first voice signal;

the signal comparison module is used for comparing the first vowel signal sequence with a second vowel signal sequence of a preset awakening word so as to extract a third vowel signal sequence with the same content as the second vowel signal sequence from the first vowel signal sequence;

and the voice recognition module is used for carrying out automatic voice recognition processing on the full voice signal of the third voice signal sequence corresponding to the first voice signal and determining whether the full voice signal is the voice signal corresponding to the awakening word.

In a fifth aspect, a voice wake-up system is provided, including:

the terminal is used for acquiring a first voice signal; identifying a pinyin final signal contained in the first voice signal to obtain a first final signal sequence corresponding to the first voice signal; comparing the first final signal sequence with a second final signal sequence of a preset awakening word to extract a third final signal sequence with the same content as the second final signal sequence from the first final signal sequence; sending the full spelling voice signal corresponding to the third final signal sequence to a server;

and the server is used for carrying out automatic voice recognition processing on the full-spelling voice signal of the third vowel signal sequence corresponding to the first voice signal and determining whether the full-spelling voice signal is the voice signal corresponding to the awakening word.

In a sixth aspect, a voice wake-up method is provided, including:

the terminal acquires a first voice signal; identifying a pinyin final signal contained in the first voice signal to obtain a first final signal sequence corresponding to the first voice signal; comparing the first final signal sequence with a second final signal sequence of a preset awakening word to extract a third final signal sequence with the same content as the second final signal sequence from the first final signal sequence; sending the full spelling voice signal corresponding to the third final signal sequence to a server;

and the server performs automatic voice recognition processing on the full-spelling voice signal of the third vowel signal sequence corresponding to the first voice signal, and determines whether the full-spelling voice signal is the voice signal corresponding to the awakening word.

In a seventh aspect, an electronic device is provided, including:

a memory for storing a program;

a processor, coupled to the memory, for executing the program for:

acquiring a first voice signal;

In an eighth aspect, another electronic device is provided, including:

a memory for storing a program;

a processor, coupled to the memory, for executing the program for:

acquiring a first voice signal;

The invention provides a voice awakening method, a device, a system and electronic equipment, wherein after a first voice signal to be identified is obtained, a pinyin final signal contained in the first voice signal is identified to obtain a first final signal sequence corresponding to the first voice signal; then, comparing the first final signal sequence with a second final signal sequence of a preset awakening word to extract a third final signal sequence with the same content as the second final signal sequence from the first final signal sequence; and finally, carrying out automatic voice recognition processing on the full-spelling voice signal of the third vowel signal sequence in the first voice signal, determining whether the full-spelling voice signal is the voice signal corresponding to the awakening word, and further recognizing whether the first voice signal contains the awakening word. The scheme compares the final part signal in the voice signal to be recognized with the final part of the awakening word, extracts the voice signal part of which the final part signal is the same as the final part of the awakening word in the voice signal to be recognized, and then determines whether the awakening word is contained in the voice signal part by the automatic voice recognition processing aiming at the voice signal part, so that the awakening word can be recognized quickly and accurately, and the awakening speed of the equipment is improved.

The foregoing description is only an overview of the technical solutions of the present application, and the present application can be implemented according to the content of the description in order to make the technical means of the present application more clearly understood, and the following detailed description of the present application is given in order to make the above and other objects, features, and advantages of the present application more clearly understandable.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the application. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:

FIG. 1 is a logic diagram of a basic flow of voice wake-up;

FIG. 2 is a schematic diagram of the processing logic of the wake-on-end engine in the basic flow of voice wake-up;

FIG. 3 is a schematic diagram of processing logic of a wake engine according to an embodiment of the present invention;

FIG. 4 is a diagram of a voice wake-up system according to an embodiment of the present invention;

FIG. 5 is a flowchart of a voice wake-up method according to an embodiment of the present invention;

FIG. 6 is a flowchart of a voice wake-up method according to an embodiment of the present invention;

FIG. 7 is a flowchart of a vowel classification training method according to an embodiment of the present invention;

FIG. 8 is a flowchart of a vowel classification training method according to an embodiment of the present invention;

FIG. 9 is a first block diagram of a voice wake-up apparatus according to an embodiment of the present invention;

FIG. 10 is a block diagram of a voice wake-up apparatus according to an embodiment of the present invention;

FIG. 11 is a first structural diagram of a vowel classification training apparatus according to an embodiment of the present invention;

FIG. 12 is a second structural diagram of a vowel classification training apparatus according to an embodiment of the present invention;

fig. 13 is a flowchart of a voice wake-up method according to an embodiment of the present invention;

FIG. 14 is a first schematic structural diagram of an electronic device according to an embodiment of the invention;

fig. 15 is a schematic structural diagram of an electronic device according to an embodiment of the invention.

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

As shown in fig. 1, which is a basic flow of voice wake-up, after receiving a voice signal, a voice device performs signal processing (mainly including noise reduction and echo cancellation) and feature extraction on the voice signal, so as to convert an original input audio signal into a feature (i.e., a spectrum signal of a voice) that can be recognized by an on-end (terminal) wake-up engine; inputting the characteristics into a wake-up engine to compare and identify wake-up words; when the awakening word is hit, the server is continuously instructed to execute subsequent instructions, such as playing songs, making a sound and the like.

In the basic flow of voice wake-up shown in fig. 1, the wake-up engine may be considered as a core part of performing wake-up. As shown in fig. 2, the wake-on-end engine mainly includes two parts: a classifier and a post-processing section.

The classifier is used for converting continuous voice features into different categories, the calculation of the part is usually the part which consumes the most computing resources by all awakening work, and the classification number output by the last layer of the neural network directly determines the computing scale of the whole network. The method is based on a traditional Hidden Markov Model-Deep Neural Network (HMM-DNN), a Probability Density Function (PDF) of sound speed (phone) is adopted for modeling, and at least 6000-8000 classifications are needed to achieve a production available state; and the pinyin is adopted for classification, and more than 1200-400 classifications are also needed.

Secondly, post-processing, namely, the post-processing part exists in the detection of the awakening words, the whole words are detected by the traditional method, and the voice output by the classifier can be subjected to smoothing (smooth) and then identified whether to be the same as the awakening words by adopting a Dynamic Time Warping (DTW) algorithm; automatic Speech Recognition (ASR) technology can also be used to recognize whether the speech hits the wake word.

By adopting the classifier in the end-on wake-up engine, the classification network is huge due to more finally used classifications, and the end-on configuration is required to have higher computing performance.

The embodiment of the invention overcomes the defect that voice awakening can be accurately and quickly executed only by configuring higher computing resources on the terminal due to the huge classification network in the prior art, and has the core idea that the core part for executing the voice awakening is divided into two identification processes of awakening words. The first awakening word recognition process is completed on the terminal, and only the pinyin vowel part of the awakening word is classified and recognized in the process, so that the initial recognition process of the voice signal to be recognized is completed. And then, sending the preliminarily screened full voice signals corresponding to the rhyme signals which are the same as the awakening word rhyme signals to the cloud, identifying the whole voice signals by the cloud again, and determining whether the voice signals hit the awakening words or not.

Fig. 3 is a schematic diagram of processing logic of a wake-up engine according to an embodiment of the present invention, which relates to two main bodies of a device side (e.g., a terminal capable of receiving and recognizing voice, such as a smart speaker) and a cloud side (provided with a server) for performing voice wake-up.

On the equipment end, a voice signal to be recognized is firstly subjected to first awakening word recognition, and the recognition process is only used for carrying out vowel classification recognition on a vowel signal of the voice signal through a classifier generated by pre-training; and then, comparing the recognized vowel signal sequence with the vowel of the wake-up word through post-processing to judge whether the vowel of the wake-up word is hit in the voice signal or not, and transmitting the full voice signal of the vowel of the hit wake-up word to the cloud.

At the cloud end, the speech signals to be recognized are full speech signals with the same vowel signals and the same vowel signals as the wake-up words, the speech signals are subjected to secondary wake-up word recognition (secondary inspection), the whole speech signals are recognized in the recognition process, for example, the ASR technology is adopted, and whether the speech signals hit the wake-up words or not is recognized.

Based on the above-mentioned voice wake-up concept, fig. 4 is a structure diagram of a voice wake-up system provided in an embodiment of the present invention. As shown in fig. 4, the system includes a terminal 410 and a server 420, wherein:

the terminal 410 includes:

the signal acquisition module is used for acquiring a first voice signal, wherein the first voice signal is a Chinese voice signal for example;

the server 420 includes:

The technical solution of the present application is further illustrated by the following examples.

Example one

Based on the above-mentioned voice wake-up concept, as shown in fig. 5, which is a flowchart of a voice wake-up method shown in the embodiment of the present invention, an execution main body of the method may be a module deployed in the terminal 410 and the server 420 in fig. 4. The steps S510-530 can be executed on-site (terminal), and the step S540 can be executed in cloud (server). As shown in fig. 5, the voice wake-up method includes the following steps:

s510, a first voice signal is obtained, and the first voice signal is a Chinese voice signal, for example.

The first voice signal may be a voice signal received through a voice device, and the target device is further awakened by recognizing an awakening word of the voice signal.

S520, identifying the pinyin final signal contained in the first voice signal to obtain a first final signal sequence corresponding to the first voice signal.

In this step, the initial part and the final part of the pinyin are separated: such as tian- > t, ian; mao- > m, ao. In the daily dialogue of Chinese, the pinyin initial part (abbreviated as "initial part") such as t and m is often a plosive, and from the characteristic spectrogram of a voice signal, the initial part is a short peak or valley, and basically all the prolonged sounds are in the pinyin final part (abbreviated as "final part"). In the traditional triphone modeling, the front phone and the rear phone are often combined to achieve good recognition accuracy, and under the condition of terminal calculation priority, the scheme removes the recognition of the vocal part and only recognizes the rhyme signal contained in the first voice signal, so that the first rhyme signal sequence corresponding to the first voice signal is obtained. The first final signal sequence comprises a time sequence and a final signal of each time point on the time sequence.

And S530, comparing the first final signal sequence with a second final signal sequence of a preset awakening word to extract a third final signal sequence with the same content as the second final signal sequence from the first final signal sequence.

The traditional awakening word identification method is to detect the whole word, and in order to reduce the calculation amount on the terminal, the scheme only identifies the final part of each word on the terminal, namely, the first final part signal sequence is compared with the second final part signal sequence of the preset awakening word, so as to extract a third final part signal sequence with the same content as the second final part signal sequence from the first final part signal sequence.

For example, assuming that the preset wake-up word is "hello", the corresponding second vowel signal sequence is "ǐ, a/o", and the signal sequence in the first vowel signal sequence is "ǐ, a/o", which can be used as the third vowel signal sequence.

And S540, carrying out automatic voice recognition processing on the full-spelling voice signal in the first voice signal corresponding to the third final signal sequence, and determining whether the full-spelling voice signal is the voice signal corresponding to the awakening word.

In the practical application scenario, only the vowel is verified on the terminal, such as: "good", "old" and "examination" can be successfully classified by having the same final part, and can be used as the initial result of the final part to identify the awakening word. Therefore, a second verification needs to be performed on the cloud, and an automatic speech recognition ASR process is performed on the full-spelling speech signal (including the initial part signal) of the first speech signal corresponding to the third final part signal sequence, so as to determine whether the full-spelling speech signal is a speech signal corresponding to the wakeup word.

The secondary verification of this link just filters the voice signal part that the vocal part is different with the word vocal part of awaking, and the benefit of doing so is served and has filtered most non-word of awaking, and the high in the clouds only need do last verification and just can discern real word of awaking, so balanced serve with the calculation of server, both can have very high rate of accuracy, do not have simultaneously because the high time delay that the model is too big brought of serving.

After a first voice signal to be recognized is obtained, firstly recognizing a pinyin final signal contained in the first voice signal to obtain a first final signal sequence corresponding to the first voice signal; then, comparing the first final signal sequence with a second final signal sequence of a preset awakening word to extract a third final signal sequence with the same content as the second final signal sequence from the first final signal sequence; and finally, carrying out automatic voice recognition processing on the full-spelling voice signal of the third vowel signal sequence in the first voice signal, determining whether the full-spelling voice signal is the voice signal corresponding to the awakening word, and further recognizing whether the first voice signal contains the awakening word. The scheme compares the final part signal in the voice signal to be recognized with the final part of the awakening word, extracts the voice signal part of which the final part signal is the same as the final part of the awakening word in the voice signal to be recognized, and then determines whether the awakening word is contained in the voice signal part by the automatic voice recognition processing aiming at the voice signal part, so that the awakening word can be recognized quickly and accurately, and the awakening speed of the equipment is improved.

Example two

Fig. 6 shows a flowchart of a voice wake-up method according to an embodiment of the present invention. On the basis of the method shown in the previous embodiment, a preprocessing step is added, and steps S520 and S530 are refined. As shown in fig. 6, the voice wake-up method includes the following steps:

s610, a first voice signal is obtained, wherein the first voice signal is a Chinese voice signal, for example.

Step S610 is the same as step S510.

S620, denoising the first voice signal.

After the first voice signal is acquired, preprocessing such as noise reduction and echo cancellation can be performed on the first voice signal, so that the effective signal proportion in the first voice signal is reserved to the greatest extent.

S630, acquiring the feature spectrum of the preprocessed first voice signal.

The characteristic spectrum refers to a spectrum signal which is required to be converted into a signal meeting a certain characteristic requirement when a classification recognition or a classification training is performed.

For example, when performing classification recognition on a first speech signal, after converting the first speech signal to be recognized into a spectrum signal, the audio is cut into frame spectrum signals of, for example, about 20ms according to a fixed time length, and the frame spectrum signals are used as feature spectra during subsequent classification recognition.

And S640, performing classification calculation on the characteristic frequency spectrum of the first voice signal by using a vowel classifier to obtain a first vowel signal sequence corresponding to the first voice signal.

The vowel classifier can be a speech classification model generated by pre-training, but the speech classification model only classifies vowel signals in the speech signals and outputs sequence values of corresponding vowel signals.

Steps S630 to S640 are the refinements of step S520 described above.

Further, a final classification training method as shown in fig. 7 may be adopted to train and generate the final classifier, where the method includes:

and S710, acquiring a characteristic spectrum of the voice signal for model training.

S720, marking the pinyin final part signals in the characteristic frequency spectrum.

Generally, the same vowels are affected by the vocal part signals in the pronunciation process, and the appearances of the same vowels in the characteristic spectrum are not completely the same. Through supervised learning, the characteristic forms of the rhyme signals corresponding to different rhymes can be locked quickly and accurately.

And S730, taking the marked pinyin vowel signal as a training sample, and training to generate a vowel classifier by adopting a neural network algorithm and a joint model algorithm connected with time sequence classification.

The training process mainly comprises two processing links, namely how to accurately classify characteristic frequency spectrum signals of different rhymes; the other is how to place the vowels of the well-classified category in the correct position in the speech signal.

When the two problems are solved, a neural network algorithm can be adopted to realize accurate vowel Classification on the feature spectrum signals of different vowels, and a connection time sequence Classification (CTC) algorithm is combined to lock the correct positions of the vowels classified into the categories in the voice signals. And performing joint modeling by adopting the two model algorithms to train and generate the final classifier based on the training sample.

Further, a final classification training method as shown in fig. 8 may also be adopted to train and generate the final classifier, where the method includes:

and S810, acquiring a characteristic spectrum of the voice signal for model training.

And S820, marking the pinyin final part signals in the characteristic frequency spectrum.

And S830, taking the marked pinyin final signal as a training sample, and training by adopting a hidden Markov model and a deep neural network combined model algorithm to generate a final classifier.

When the two problems are solved, a hidden markov model (HMM-DNN) two model algorithm can be adopted for joint modeling, so as to train and generate the final classifier based on the training sample.

Different from the prior art, the classifier in the scheme is a rhyme classifier for classifying pinyin rhyme.

And S650, adopting a dynamic time warping algorithm to correspondingly compare the first final signal sequence with a second final signal sequence of a preset awakening word according to time sequence, so as to extract a third final signal sequence with the same content as the second final signal sequence from the first final signal sequence.

When the first final signal sequence is compared with the second final signal sequence of the preset wake-up word, the two compared signal sequences can be aligned in position by adopting a Dynamic Time Warping (DTW) algorithm, and then the two compared signal sequences are correspondingly compared according to Time sequence, so that a third final signal sequence with the same content as the second final signal sequence is extracted from the first final signal sequence.

And S660, carrying out automatic voice recognition processing on the full-spelling voice signal of the third vowel signal sequence in the first voice signal, and determining whether the full-spelling voice signal is the voice signal corresponding to the awakening word.

Step S660 is the same as step S540.

The voice wake-up method provided by the invention is expanded on the basis of the first embodiment:

firstly, after the first voice signal is acquired, the first voice signal is preprocessed, so that the effective signal proportion in the first voice signal is reserved to the maximum extent.

Secondly, a vowel classifier trained in advance is used for identifying pinyin vowel signals contained in the first voice signal to obtain a first vowel signal sequence corresponding to the first voice signal, so that rapid identification is achieved. When the final classifier is trained, a neural network algorithm and a combined model algorithm connected with time sequence classification are adopted for training and modeling, or a hidden Markov model and a combined model algorithm of a deep neural network are adopted for training and modeling, so that the accuracy of the trained final classifier is ensured.

And finally, comparing the first final signal sequence with a second final signal sequence of a preset awakening word by adopting a dynamic time warping algorithm so as to quickly and accurately obtain a third final signal sequence.

EXAMPLE III

As shown in fig. 9, a first structure of a voice wake-up apparatus according to an embodiment of the present invention is a structure of a voice wake-up apparatus, which can be disposed in the voice wake-up apparatus system shown in fig. 4, and is configured to perform the method steps shown in fig. 5, including:

a signal obtaining module 910, configured to obtain a first voice signal;

the signal identification module 920 is configured to identify a pinyin final signal included in the first voice signal to obtain a first final signal sequence corresponding to the first voice signal;

a signal comparison module 930, configured to compare the first final signal sequence with a second final signal sequence of a preset wake-up word, so as to extract a third final signal sequence having the same content as the second final signal sequence from the first final signal sequence;

and a speech recognition module 940, configured to perform automatic speech recognition processing on the full-spelling speech signal in the first speech signal corresponding to the third final signal sequence, and determine whether the full-spelling speech signal is a speech signal corresponding to a wakeup word.

Further, as shown in fig. 10, in the voice wake-up apparatus, the signal recognition module 920 may include:

a feature obtaining unit 101, configured to obtain a feature spectrum of a first speech signal;

the signal identification unit 102 is configured to perform classification calculation on the feature spectrum of the first speech signal by using a final classifier to obtain a first final signal sequence corresponding to the first speech signal.

Further, the voice wake-up apparatus shown in fig. 10 may further include:

the preprocessing module 103 is configured to perform denoising preprocessing on the first speech signal.

Further, the signal comparison module 930 may be specifically configured to,

and adopting a dynamic time warping algorithm to correspondingly compare the first final signal sequence with a second final signal sequence of a preset awakening word according to time sequence so as to extract a third final signal sequence with the same content as the second final signal sequence from the first final signal sequence.

The voice wake-up unit of fig. 10 may be used to perform the method steps of fig. 6.

Further, as shown in fig. 11, the voice wake-up apparatus may further include:

a first spectrum obtaining module 111, configured to obtain a feature spectrum of a speech signal for model training;

a first signal labeling module 112, configured to label a pinyin final part signal in the characteristic frequency spectrum;

the first training module 113 is configured to train and generate a final classifier by using the labeled pinyin final signal as a training sample and using a neural network algorithm and a joint model algorithm connected with time sequence classification.

Further, as shown in fig. 12, the voice wake-up apparatus may further include:

a second spectrum obtaining module 121, configured to obtain a feature spectrum of a speech signal used for model training;

a second signal labeling module 122, configured to label a pinyin final part signal in the characteristic frequency spectrum;

and the second training module 123 is configured to train and generate the final classifier by using the labeled pinyin final signal as a training sample and using a joint model algorithm of a hidden markov model and a deep neural network.

The apparatus of fig. 11 and 12 may be used to perform the method steps of fig. 7 and 8, respectively.

After a first voice signal to be recognized is obtained, a pinyin final signal contained in the first voice signal is recognized to obtain a first final signal sequence corresponding to the first voice signal; then, comparing the first final signal sequence with a second final signal sequence of a preset awakening word to extract a third final signal sequence with the same content as the second final signal sequence from the first final signal sequence; and finally, carrying out automatic voice recognition processing on the full-spelling voice signal of the third vowel signal sequence in the first voice signal, determining whether the full-spelling voice signal is the voice signal corresponding to the awakening word, and further recognizing whether the first voice signal contains the awakening word. The scheme compares the final part signal in the voice signal to be recognized with the final part of the awakening word, extracts the voice signal part of which the final part signal is the same as the final part of the awakening word in the voice signal to be recognized, and then determines whether the awakening word is contained in the voice signal part by the automatic voice recognition processing aiming at the voice signal part, so that the awakening word can be recognized quickly and accurately, and the awakening speed of the equipment is improved.

Further, after the first voice signal is acquired, the first voice signal is preprocessed, so that the effective signal proportion in the first voice signal is reserved to the maximum extent.

Furthermore, a vowel classifier trained in advance is used for identifying pinyin vowel signals contained in the first voice signal to obtain a first vowel signal sequence corresponding to the first voice signal, so that rapid identification is achieved. When the final classifier is trained, a neural network algorithm and a combined model algorithm connected with time sequence classification are adopted for training and modeling, or a hidden Markov model and a combined model algorithm of a deep neural network are adopted for training and modeling, so that the accuracy of the trained final classifier is ensured.

Further, the first final signal sequence is compared with a second final signal sequence of a preset awakening word by adopting a dynamic time warping algorithm so as to quickly and accurately obtain a third final signal sequence.

Example four

Based on the above-mentioned voice wake-up concept, as shown in fig. 13, which is a flowchart of a voice wake-up method shown in the embodiment of the present invention, an execution main body of the method may be a module disposed in the terminal 410 and the server 420 in fig. 4. The steps S131-133 can be executed on-site (terminal), and the step S134 can be executed in cloud (server). As shown in fig. 13, the voice wake-up method includes the following steps:

s131, acquiring a first voice signal.

In this step, the language type of the first speech signal is not limited, and may be, for example, chinese, english, japanese, or the like. The first voice signal may be a voice signal received through a voice device, and the target device is further awakened by recognizing an awakening word of the voice signal.

S132, recognizing the vowel signal included in the first voice signal to obtain a first vowel signal sequence corresponding to the first voice signal.

The natural voice is divided according to the category of pronunciation, and can be vowels and consonants, for example, in the Chinese, vowels correspond to vowels in pinyin, and consonants correspond to initial parts in pinyin; for another example, in english, 5 vowels are included: a. e, i, o, u, 21 consonants; for example, in Japanese, 5 vowels are included and expressed by five kana, あ, い, う, え, お, and the pronunciation of the vowels is close to [ a ] in phonology][i]

[e][o]. The consonants include consonants of "か · さ · た · な · は · ま · や · ら · わ", consonants of "a voiced sound" - "が·ざ· だ · ば", and consonants of "a half voiced sound" - "ぱ". The vowel signals contained in the first voice signal of any language type are identified, and a first vowel signal sequence corresponding to the first voice signal can be obtained. For example, when the first speech signal is a chinese speech signal, the first vowel signal sequence corresponding to the first speech signal may be the first vowel signal in the method shown in fig. 5.

And S133, comparing the first vowel signal sequence with a second vowel signal sequence of a preset awakening word to extract a third vowel signal sequence with the same content as the second vowel signal sequence from the first vowel signal sequence.

For example, when the processing object is a chinese speech signal, the content of step S530 may be executed to compare the first final signal with the second final signal sequence of the wake-up word, so as to extract a third final signal sequence with the same content as the second final signal sequence from the first final signal sequence.

And S134, carrying out automatic voice recognition processing on the full voice signal of the third voice signal sequence corresponding to the first voice signal, and determining whether the full voice signal is the voice signal corresponding to the awakening word.

The third voice signal sequence corresponds to the full voice signal in the first voice signal, and the full voice signal is all the voice signals in the interval range of the third voice signal sequence corresponding to the first voice signal. When the speech signal is a Chinese speech signal, the full-scale speech signal is a full-spelling speech signal of the third vowel signal sequence corresponding to the first speech signal.

Further, according to the different language types to which the first speech signal belongs, the vowel signal included in the first speech signal may be specifically a speech signal corresponding to a vowel in a monosyllable included in the language type to which the first speech signal belongs.

For example, when the first speech signal is a chinese speech signal, the vowel signal included in the first speech signal is a speech signal corresponding to a vowel in a single word included in a chinese language.

EXAMPLE five

An embodiment of the present invention provides a voice wake-up apparatus, which may include all the modules shown in fig. 9, and is configured to execute the method steps shown in fig. 13, where the method steps include:

a signal obtaining module 910, configured to obtain a first voice signal;

a signal identification module 920, configured to identify a vowel signal included in the first voice signal to obtain a first vowel signal sequence corresponding to the first voice signal;

a signal comparison module 930, configured to compare the first vowel signal sequence with a second vowel signal sequence of a preset wake-up word, so as to extract a third vowel signal sequence having the same content as the second vowel signal sequence from the first vowel signal sequence;

and a voice recognition module 940, configured to perform automatic voice recognition processing on the full-scale voice signal corresponding to the third voice signal sequence in the first voice signal, and determine whether the full-scale voice signal is a voice signal corresponding to the wakeup word.

Further, the vowel signal included in the first speech signal may be a speech signal corresponding to a vowel in a monosyllable included in a language type to which the first speech signal belongs.

For example, the language type to which the first speech signal belongs may include: chinese, english, japanese, etc. When the first speech signal is a chinese speech signal, the voice wake-up apparatus in this embodiment can execute the method steps shown in fig. 5.

EXAMPLE six

The embodiment provides a voice wake-up system, including:

the terminal is used for acquiring a first voice signal, and the first voice signal is a Chinese voice signal for example; identifying pinyin final signals contained in the first voice signal to obtain a first final signal sequence corresponding to the first voice signal; comparing the first final signal sequence with a second final signal sequence of a preset awakening word to extract a third final signal sequence with the same content as the second final signal sequence from the first final signal sequence; sending the full spelling voice signal corresponding to the third final signal sequence to a server;

and the server is used for carrying out automatic voice recognition processing on the full-spelling voice signal of the third vowel signal sequence in the first voice signal and determining whether the full-spelling voice signal is the voice signal corresponding to the awakening word.

Correspondingly, based on the voice wake-up system, the embodiment further provides a voice wake-up method, that is, the voice wake-up method is described in the execution flows of the terminal and the server. The method comprises the following steps:

a terminal acquires a first voice signal, wherein the first voice signal is a Chinese voice signal for example; identifying pinyin final signals contained in the first voice signal to obtain a first final signal sequence corresponding to the first voice signal; comparing the first final signal sequence with a second final signal sequence of a preset awakening word to extract a third final signal sequence with the same content as the second final signal sequence from the first final signal sequence; sending the full spelling voice signal corresponding to the third final signal sequence to a server;

and the server performs automatic voice recognition processing on the full-spelling voice signal of the third vowel signal sequence in the first voice signal, and determines whether the full-spelling voice signal is the voice signal corresponding to the awakening word.

By splitting the whole wake-up flow into two parts: the first part identifies the awakening word for the first time by identifying a vowel signal in the first voice signal at the terminal side; the second part carries out automatic speech recognition on the full spelling speech signal corresponding to the final signal extracted by the initial recognition at the server side, thereby completing the recognition process of whether the whole speech signal hits the awakening word. The method balances the calculated amount of the whole awakening process at the terminal and the server, reduces the calculation pressure of the terminal and improves the execution efficiency of the whole voice awakening process.

EXAMPLE seven

The third embodiment of the foregoing describes an overall architecture of a voice wake-up apparatus, and functions of the apparatus can be implemented by an electronic device, as shown in fig. 14, which is a schematic structural diagram of the electronic device according to the embodiment of the present invention, and specifically includes: a memory 141 and a processor 142.

The memory 141 stores a program.

In addition to the above-described programs, the memory 141 may also be configured to store other various data to support operations on the electronic device. Examples of such data include instructions for any application or method operating on the electronic device, contact data, phonebook data, messages, pictures, videos, and so forth.

The memory 141 may be implemented by any type or combination of volatile or non-volatile memory devices, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

A processor 142, coupled to the memory 141, for executing programs in the memory 141 to:

acquiring a first voice signal;

identifying pinyin final signals contained in the first voice signal to obtain a first final signal sequence corresponding to the first voice signal;

and carrying out automatic voice recognition processing on the full-spelling voice signal in the first voice signal corresponding to the third final signal sequence, and determining whether the full-spelling voice signal is the voice signal corresponding to the awakening word.

The above specific processing operations have been described in detail in the foregoing embodiments, and are not described again here.

Further, as shown in fig. 14, the electronic device may further include: communication components 143, power components 144, audio components 145, displays 146, and other components. Only some of the components are schematically shown in fig. 14, and it is not meant that the electronic device includes only the components shown in fig. 14.

The communication component 143 is configured to facilitate wired or wireless communication between the electronic device and other devices. The electronic device may access a wireless network based on a communication standard, such as WiFi, 2G or 3G, or a combination thereof. In an exemplary embodiment, the communication component 143 receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 143 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.

A power supply component 144 provides power to the various components of the electronic device. The power components 144 may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for an electronic device.

The audio component 145 is configured to output and/or input audio signals. For example, the audio component 145 includes a Microphone (MIC) configured to receive external audio signals when the electronic device is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signal may further be stored in the memory 141 or transmitted via the communication component 143. In some embodiments, audio component 145 also includes a speaker for outputting audio signals.

The display 146 includes a screen, which may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation.

Example eight

The fifth embodiment describes an overall architecture of a voice wake-up apparatus, and the functions of the apparatus can be implemented by an electronic device, as shown in fig. 15, which is a schematic structural diagram of the electronic device according to the embodiment of the present invention, and specifically includes: a memory 151 and a processor 152.

And a memory 151 for storing a program.

In addition to the above-described programs, the memory 151 may also be configured to store other various data to support operations on the electronic device. Examples of such data include instructions for any application or method operating on the electronic device, contact data, phonebook data, messages, pictures, videos, and so forth.

The memory 151 may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

A processor 152, coupled to the memory 151, for executing programs in the memory 151 for:

acquiring a first voice signal;

Further, as shown in fig. 15, the electronic device may further include: communication components 153, power components 154, audio components 155, a display 156, and other components. Only some of the components are schematically shown in fig. 15, and it is not meant that the electronic device includes only the components shown in fig. 15.

The communication component 153 is configured to facilitate wired or wireless communication between the electronic device and other devices. The electronic device may access a wireless network based on a communication standard, such as WiFi, 2G or 3G, or a combination thereof. In an exemplary embodiment, the communication component 153 receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 153 further includes a Near Field Communication (NFC) module to facilitate short-range communication. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.

A power supply component 154 provides power to the various components of the electronic device. The power components 154 may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for an electronic device.

Audio component 155 is configured to output and/or input audio signals. For example, audio component 155 includes a Microphone (MIC) configured to receive external audio signals when the electronic device is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may further be stored in the memory 151 or transmitted via the communication component 153. In some embodiments, audio component 155 also includes a speaker for outputting audio signals.

The display 156 includes a screen, which may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation.

Those of ordinary skill in the art will understand that: all or a portion of the steps of implementing the above-described method embodiments may be performed by hardware associated with program instructions. The program may be stored in a computer-readable storage medium. When executed, the program performs steps comprising the method embodiments described above; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.

Finally, it should be noted that: the above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present application.

Claims

1. A voice wake-up method, comprising:

acquiring a first voice signal;

2. The method of claim 1, wherein the identifying the pinyin final signal included in the first speech signal to obtain a first final signal sequence corresponding to the first speech signal comprises:

acquiring a characteristic spectrum of the first voice signal;

and classifying and calculating the characteristic frequency spectrum of the first voice signal by adopting a final classifier to obtain the first final signal sequence corresponding to the first voice signal.

3. The method of claim 2, wherein the method further comprises:

acquiring a characteristic spectrum of a voice signal for model training;

marking the pinyin final part signals in the characteristic frequency spectrum;

and taking the marked pinyin vowel signals as training samples, and training by adopting a neural network algorithm and a joint model algorithm connected with time sequence classification to generate the vowel classifier.

4. The method of claim 2, wherein the method further comprises:

acquiring a characteristic spectrum of a voice signal for model training;

marking the pinyin final part signals in the characteristic frequency spectrum;

and taking the marked pinyin final part signal as a training sample, and training by adopting a hidden Markov model and a combined model algorithm of a deep neural network to generate a final part classifier.

5. The method of claim 1, wherein identifying the pinyin final signal included in the first speech signal to obtain a first final signal sequence corresponding to the first speech signal further comprises:

and carrying out denoising pretreatment on the first voice signal.

6. The method of claim 1, wherein the comparing the first final signal sequence with a second final signal sequence of a preset wake-up word, and the extracting a third final signal sequence having the same content as the second final signal sequence from the first final signal sequence comprises:

7. The method of claim 1, wherein the first speech signal is a chinese speech signal.

8. A voice wake-up method, comprising:

acquiring a first voice signal;

9. The method according to claim 8, wherein the vowel signal included in the first speech signal is a speech signal corresponding to a vowel in a monosyllable included in a language type to which the first speech signal belongs.

10. A voice wake-up apparatus comprising:

the signal acquisition module is used for acquiring a first voice signal;

11. A voice wake-up apparatus comprising:

the signal acquisition module is used for acquiring a first voice signal;

12. A voice wake-up system comprising:

13. A voice wake-up method, comprising:

14. An electronic device, comprising:

a memory for storing a program;

a processor, coupled to the memory, for executing the program for:

acquiring a first voice signal;

15. An electronic device, comprising:

a memory for storing a program;

a processor, coupled to the memory, for executing the program for:

acquiring a first voice signal;