WO2020073839A1

WO2020073839A1 - Voice wake-up method, apparatus and system, and electronic device

Info

Publication number: WO2020073839A1
Application number: PCT/CN2019/108828
Authority: WO
Inventors: 曹元斌; 张智超; 风翮; 王刚
Original assignee: 阿里巴巴集团控股有限公司
Priority date: 2018-10-11
Filing date: 2019-09-29
Publication date: 2020-04-16
Also published as: CN111048068B; CN111048068A

Abstract

A voice wake-up method, apparatus and system, and an electronic device. The method comprises: obtaining a first voice signal (S510); recognizing a pinyin rhyme category signal comprised in a first voice signal to obtain a first rhyme category signal sequence corresponding to the first voice signal (S520); comparing the first rhyme category signal sequence with a second rhyme category signal sequence of a preset wake-up word to extract from the first rhyme category signal sequence a third rhyme category signal sequence having the same content as the second rhyme category signal sequence (S530); performing automatic voice recognition processing on a complete spelling voice signal, which corresponds to the third rhyme category signal sequence, in the first voice signal, and determining whether the complete spelling voice signal is a voice signal corresponding to the wake-up word (S540). The method can quickly and accurately recognize the wake-up word and improve a waken speed of a device.

Description

Voice wake-up method, device, system and electronic equipment

This application requires the priority of the Chinese patent application filed on October 11, 2018 with the application number 201811186019.4 and the invention titled "voice wake-up method, device, system and electronic equipment", the entire contents of which are incorporated by reference in this application.

Technical field

This specification relates to the field of computer technology, and in particular to a voice wake-up method, device, system, and electronic equipment.

Background technique

With the more and more in-depth development of artificial intelligence-related applications, voice recognition technology, as the basic interaction method of intelligent devices, plays an increasingly important role. Voice recognition technology involves many aspects, including awakening the device through voice commands, controlling the operation of the device, man-machine dialogue with the device, and voice command control for multiple devices. Efficient and accurate voice recognition technology and fast and convenient wake-up mode are important development directions for smart devices.

At present, the main performance bottleneck of the custom wake-up is that the computing resources on the terminal (terminal device) are limited, and the number of categories of the core classifier on the voice feature directly affects the speed and accuracy of the wake-up. The traditional Pinyin granularity classification strategy is to take the full spelling of commonly used Chinese characters for classification, with more than 1,200 tones and more than 400 tones removed, which can achieve an accuracy rate of about 80%. However, in order to achieve higher accuracy, it is necessary to improve the on-end computing performance and improve a lot of post-processing work.

Summary of the invention

The invention provides a voice wake-up method, device, system and electronic equipment, which can quickly and accurately identify wake-up words and improve the speed of the equipment being woken up.

To achieve the above objectives, the embodiments of the present invention adopt the following technical solutions:

In the first aspect, a voice wake-up method is provided, including:

Get the first voice signal;

Recognizing the pinyin rhyme signal included in the first speech signal to obtain a first rhyme signal sequence corresponding to the first speech signal;

Comparing the first rhyme signal sequence with a second rhyme signal sequence of a preset wake-up word, to extract from the first rhyme signal sequence the same content as the second rhyme signal sequence Three rhyme signal sequence;

Performing automatic speech recognition processing on the full spelling speech signal corresponding to the third rhythm signal sequence in the first voice signal to determine whether the full spelling speech signal is a voice signal corresponding to the wake-up word.

In the second aspect, another voice wake-up method is provided, including:

Get the first voice signal;

Identify the vowel signal contained in the first voice signal to obtain a first vowel signal sequence corresponding to the first voice signal;

Comparing the first vowel signal sequence with the preset second vowel signal sequence of the awakening word to extract from the first vowel signal sequence the same content as the second vowel signal sequence Three-vowel signal sequence;

Performing automatic speech recognition processing on the full amount of voice signals corresponding to the third vowel signal sequence in the first voice signal to determine whether the full amount of voice signals are voice signals corresponding to the wake-up words.

In a third aspect, a voice wake-up device is provided, including:

The signal acquisition module is used to acquire the first voice signal;

The signal recognition module is used to recognize the pinyin rhyme signal included in the first speech signal to obtain a first rhyme signal sequence corresponding to the first speech signal;

The signal comparison module is used to compare the first rhyme signal sequence with the preset second rhyme signal sequence of the wake word to extract the second rhyme portion from the first rhyme signal sequence The third rhythm signal sequence with the same signal sequence content;

The speech recognition module is used to perform automatic speech recognition processing on the full spelling speech signal corresponding to the third rhyme signal sequence in the first voice signal to determine whether the full spelling speech signal corresponds to the wake-up word voice signal.

In the fourth aspect, another voice wake-up device is provided, including:

The signal acquisition module is used to acquire the first voice signal;

The signal recognition module is used to recognize the vowel signal contained in the first voice signal to obtain the first vowel signal sequence corresponding to the first voice signal;

The signal comparison module is used to compare the first vowel signal sequence with the preset second vowel signal sequence of the wake word to extract the second vowel from the first vowel signal sequence A third vowel signal sequence with the same signal sequence content;

A voice recognition module, configured to perform automatic voice recognition processing on the full amount of voice signals corresponding to the third vowel signal sequence in the first voice signal, and determine whether the full amount of voice signals are the voice signals corresponding to the wake-up words .

In a fifth aspect, a voice wake-up system is provided, including:

A terminal for acquiring a first speech signal; identifying the pinyin rhyme signal included in the first speech signal to obtain a first rhyme signal sequence corresponding to the first speech signal; Comparing the signal sequence with the second rhyme signal sequence of the preset wake-up word to extract a third rhyme signal sequence with the same content as the second rhyme signal sequence from the first rhyme signal sequence; The whole Pinyin speech signal corresponding to the third rhyme signal sequence is sent to the server;

The server is configured to perform automatic speech recognition processing on the full spelling speech signal corresponding to the third rhyme signal sequence in the first voice signal, and determine whether the full spelling speech signal corresponds to the wake-up word voice signal.

In a sixth aspect, a voice wake-up method is provided, including:

The terminal acquires the first voice signal; recognizes the pinyin rhyme signal included in the first voice signal to obtain a first rhyme signal sequence corresponding to the first voice signal; and compares the first rhyme signal sequence with Compare the second rhyme signal sequence of the preset wake-up word to extract a third rhyme signal sequence with the same content as the second rhyme signal sequence from the first rhyme signal sequence; The Quanpin speech signal corresponding to the signal sequence of the Sanyun Department is sent to the server;

The server performs automatic speech recognition processing on the full spelling speech signal corresponding to the third rhythm signal sequence in the first voice signal to determine whether the full spelling speech signal is a speech signal corresponding to the wake-up word.

According to a seventh aspect, an electronic device is provided, including:

Memory for storing programs;

A processor, coupled to the memory, is used to execute the program for:

Get the first voice signal;

In the eighth aspect, another electronic device is provided, including:

Memory for storing programs;

A processor, coupled to the memory, is used to execute the program for:

Get the first voice signal;

The invention provides a voice wake-up method, device, system and electronic equipment. After acquiring the first voice signal to be recognized, the pinyin rhythm signal included in the first voice signal is recognized first to obtain the first voice signal The corresponding first rhyme signal sequence; then, the first rhyme signal sequence is compared with the preset second rhyme signal sequence of the wake-up word to extract the second rhyme signal from the first rhyme signal sequence The third rhyme signal sequence with the same sequence content; finally, the automatic speech recognition process is performed on the full-speech speech signal corresponding to the third voice-part signal sequence in the first speech signal to determine whether the full-speech speech signal is the speech corresponding to the wake-up word Signal to further identify whether the first speech signal contains a wake-up word. In this solution, the rhythm signal in the speech signal to be recognized is first compared with the rhythm part of the wake word to extract the part of the speech signal in the speech signal to be recognized that is the same as the rhythm part of the wake word, and then for the part of the speech The signal is then processed through automatic speech recognition to determine whether it contains wake-up words, so as to achieve fast and accurate recognition of wake-up words and improve the speed of the device being awakened.

The above description is only an overview of the technical solutions of this application. In order to understand the technical means of this application more clearly, it can be implemented in accordance with the content of the specification, and in order to make the above and other purposes, features and advantages of this application more obvious and understandable The specific implementation of this application is listed below.

BRIEF DESCRIPTION

By reading the detailed description of the preferred embodiments below, various other advantages and benefits will become clear to those of ordinary skill in the art. The drawings are only for the purpose of showing the preferred embodiments, and are not considered to limit the present application. Furthermore, the same reference numerals are used to denote the same parts throughout the drawings. In the drawings:

Figure 1 is a logic schematic diagram of the basic flow of voice wake-up;

Figure 2 is a schematic diagram of the processing logic of the wake-up engine on the upper end of the basic process of voice wake-up;

3 is a schematic diagram of processing logic of a wake-up engine according to an embodiment of the present invention;

4 is a structural diagram of a voice wake-up system according to an embodiment of the present invention;

5 is a flowchart 1 of a voice wake-up method according to an embodiment of the present invention;

6 is a second flowchart of a voice wake-up method according to an embodiment of the present invention;

7 is a flowchart 1 of a rhythm class training method according to an embodiment of the present invention;

8 is a flowchart 2 of a rhythm class training method according to an embodiment of the present invention;

9 is a structural diagram 1 of a voice wake-up device according to an embodiment of the invention;

10 is a second structural diagram of a voice wake-up device according to an embodiment of the present invention;

11 is a structural diagram 1 of a rhythm class training device according to an embodiment of the present invention;

12 is a structural diagram 2 of a rhythm class training device according to an embodiment of the present invention;

13 is a flowchart 3 of a voice wake-up method according to an embodiment of the present invention;

14 is a first schematic structural diagram of an electronic device according to an embodiment of the present invention;

15 is a second schematic structural diagram of an electronic device according to an embodiment of the present invention.

detailed description

Hereinafter, exemplary embodiments of the present disclosure will be described in more detail with reference to the accompanying drawings. Although exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure can be implemented in various forms and should not be limited by the embodiments set forth herein. Rather, these embodiments are provided to enable a more thorough understanding of the present disclosure and to fully convey the scope of the present disclosure to those skilled in the art.

As shown in Figure 1, for the basic process of voice wake-up, after receiving the voice signal, the voice device first performs signal processing (mainly including noise reduction and echo cancellation) and feature extraction on the voice signal, thereby converting the original input audio signal into the terminal The features (that is, the frequency spectrum signal of the voice) that can be recognized by the wake-up engine on the (terminal); then enter the features into the wake-up engine for comparison and recognition of wake-up words; when the wake-up word hits, it will continue to instruct the server to execute subsequent instructions, such as playing songs , Crosstalk, etc.

In the basic flow of voice wake-up shown in FIG. 1, the on-end wake-up engine can be considered as the core part of performing wake-up. As shown in Figure 2, the wake-up engine on this end mainly includes two parts: a classifier and a post-processing part.

First, the classifier is used to convert continuous speech features into different categories. This part of the calculation is often the most expensive part of all wake-up work. Usually the number of classifications output by the last layer of the neural network directly determines the entire network. Calculation scale. The traditional Hidden Markov Model-Deep Neural Network (HMM-DNN) is modeled using the probability density function (Probability Density, PDF) of the speed of sound (phone). Production availability requires at least 6000 to 8000 classifications; using Pinyin for classification also requires more than 1200 to 400 classifications.

Second, post-processing, there is a post-processing part in the detection of wake-up words. The traditional method detects the entire word, and can use dynamic time warping algorithm (Dynamic Time Warping, DTW) recognition after smoothing the speech output by the classifier. Whether the voice is the same as the wake-up word; automatic speech recognition (Automatic Speech Recognition, ASR) technology can also be used to recognize whether the voice hits the wake-up word.

With the above-mentioned classifier in the on-end wake-up engine, due to the large number of classifications used in the end, the classification network is huge, and a high computing performance needs to be configured on the end.

The embodiment of the present invention improves the defect in the prior art that the huge classification network leads to the need to configure higher computing resources on the end to accurately and quickly perform voice wake-up. The core idea is to split the core part of performing voice wake-up into two The recognition process of the second wake word. The first wake-up word recognition process is completed on the terminal. This process only classifies and recognizes the pinyin rhyme part of the wake-up word, completing the preliminary recognition process of the speech signal to be recognized. Then, the full-scale speech signal corresponding to the rhyme signal that is initially selected and the same as the rhyme signal of the arousal word is sent to the cloud, and the cloud recognizes the entire speech signal again to determine whether the speech signal hits the arousal word.

As shown in FIG. 3, it is a schematic diagram of a processing logic of a wake-up engine according to an embodiment of the present invention, and relates to two main bodies that perform voice wake-up, a device side (a terminal that can receive and recognize voice such as a smart speaker) and a cloud side (a server is provided).

On the device side, the speech signal to be recognized first undergoes the first wake-up word recognition. This recognition process only performs rhythm class recognition on the rhythm signal of the speech signal through the pre-trained classifier; then, the recognized rhyme The part signal sequence is compared with the prosodic part of the wake-up word through post-processing to determine whether the prosodic part of the wake-up word is hit in the voice signal, and the full amount of voice signal hitting the prosperous part of the wake-up word is transmitted to the cloud.

In the cloud, the voice signal to be recognized is a full-volume voice signal with the same rhyme signal and arousal word rhyme. Perform a second wake-up word recognition (second test) on these voice signals. The recognition process is to recognize the entire voice signal For example, ASR technology is used to identify whether the voice signal hits the wake word.

Based on the above voice wake-up solution idea, FIG. 4 is a structural diagram of a voice wake-up system provided by an embodiment of the present invention. As shown in FIG. 4, the system includes a terminal 410 and a server 420, where:

Terminal 410 includes:

A signal acquisition module, for acquiring a first voice signal, the first voice signal is, for example, a Chinese voice signal;

The signal recognition module is used for recognizing the pinyin rhyme signal included in the first speech signal to obtain the first rhyme signal sequence corresponding to the first speech signal;

The signal comparison module is used to compare the first rhyme signal sequence with the preset second rhyme signal sequence of the wake-up word, so as to extract from the first rhyme signal sequence the same content as the second rhyme signal sequence. Three rhyme signal sequence;

The server 420 includes:

The speech recognition module is used to perform automatic speech recognition processing on the full spelling speech signal corresponding to the third rhyme signal sequence in the first voice signal, and determine whether the full spelling speech signal is a speech signal corresponding to the wake word.

The technical solutions of the present application are further described below through multiple embodiments.

Example one

Based on the above voice wake-up solution idea, as shown in FIG. 5, which is a flowchart 1 of the voice wake-up method shown in an embodiment of the present invention. The method may be executed by modules deployed in the terminal 410 and the server 420 in FIG. 4. Among them, steps S510-530 can be executed on the terminal (terminal), and step S540 can be executed on the cloud (server). As shown in FIG. 5, the voice wake-up method includes the following steps:

S510. Acquire a first voice signal, where the first voice signal is, for example, a Chinese voice signal.

Wherein, the first voice signal may be a voice signal received through the voice device, and the wake-up word is recognized by the voice signal to further wake up the target device.

S520: Identify the pinyin rhyme signal included in the first speech signal to obtain a first rhyme signal sequence corresponding to the first speech signal.

In this step, the pinyin and rhyme parts are separated: such as tian-> t, ian; mao-> m, ao. In the daily conversation of Chinese, the pinyin parts (referred to as "voices") such as t and m are often blasting sounds. From the characteristic spectrum of the speech signal, the voice part is a short-lived peak or trough, basically all the extensions The sounds are in the Pinyin Rhyme Department (referred to as "Rhyme Department"). In the traditional triphone modeling, it is often necessary to combine the front and back phones to achieve a good recognition accuracy. In this case, when the priority is calculated on the end, the recognition of the voice is removed, only the first speech The rhythm signals included in the signal are identified to obtain a first rhyme signal sequence corresponding to the first speech signal. The first rhyme signal sequence includes a time sequence and a rhyme signal located at each time point in the time sequence.

S530: Compare the first rhyme signal sequence with the preset second rhyme signal sequence of the wake word to extract a third rhyme signal with the same content as the second rhyme signal sequence from the first rhyme signal sequence sequence.

The traditional wake word recognition method is to detect the entire word. In order to reduce the amount of calculation on the end, this scheme only recognizes the rhyme of each word on the end, that is, the above first rhyme signal sequence and the second of the preset wake word The rhyme signal sequences are compared to extract a third rhyme signal sequence with the same content as the second rhyme signal sequence from the first rhyme signal sequence.

For example, suppose the preset wake-up word is "hello" and the corresponding second rhyme signal sequence is "ǐ, ǎo", then the signal sequence in the first rhyme signal sequence is "ǐ, ǎo", both Can be used as the third rhyme signal sequence.

S540. Perform automatic speech recognition processing on the full-speech speech signal corresponding to the third rhythm signal sequence in the first voice signal to determine whether the full-spell speech signal is a speech signal corresponding to the wake-up word.

In actual application scenarios, since only the rhyme part is verified on the end, such as: "good", "old", and "examination", since they have the same rhyme part, they can all pass the classification smoothly, and may be used as a rhythm recognition wake word. preliminary result. Therefore, it is necessary to perform a second verification on the cloud, and perform automatic speech recognition ASR processing on the full-speech speech signal (including the voice signal) corresponding to the third rhythm signal sequence in the first voice signal to determine the full-speech speech signal Whether it is the voice signal corresponding to the wake word.

The second verification in this session is to filter out the part of the voice signal that is different from the part of the wake word. The advantage of this is that most non-wake words are filtered on the end, and the cloud only needs to do the final verification. The real wake-up word can be recognized, so that the calculation of the end and the server is balanced, which can have a high accuracy rate, and at the same time, there will be no high delay caused by the large model on the end.

The voice wake-up method provided by the present invention, after acquiring the first voice signal to be recognized, first recognizes the pinyin rhyme signal included in the first voice signal to obtain the first rhyme signal sequence corresponding to the first voice signal; Then, the first rhyme signal sequence is compared with the preset second rhyme signal sequence of the wake-up word to extract a third rhyme signal having the same content as the second rhyme signal sequence from the first rhyme signal sequence Sequence; finally, automatic speech recognition processing is performed on the full spelling speech signal corresponding to the third rhythm signal sequence in the first voice signal to determine whether the full spelling speech signal is a speech signal corresponding to the wake-up word, and then the first speech signal is recognized Whether to include the wake word. In this solution, the rhythm signal in the speech signal to be recognized is first compared with the rhythm part of the wake word to extract the part of the speech signal in the speech signal to be recognized that is the same as the rhythm part of the wake word, and then for the part of the speech The signal is then processed through automatic speech recognition to determine whether it contains wake-up words, so as to achieve fast and accurate recognition of wake-up words and improve the speed of the device being awakened.

Example 2

As shown in FIG. 6, it is a flowchart 2 of a voice wake-up method according to an embodiment of the present invention. On the basis of the method shown in the previous embodiment, a preprocessing link is added, and steps S520 and S530 are refined. As shown in FIG. 6, the voice wake-up method includes the following steps:

S610: Acquire a first voice signal, where the first voice signal is, for example, a Chinese voice signal.

The content of step S610 is the same as that of step S510.

S620: Perform pre-processing for denoising the first speech signal.

After the first speech signal is acquired, the first speech signal may be subjected to pre-processing such as noise reduction and echo cancellation to maximize the retention of the effective signal ratio in the first speech signal.

S630: Obtain the characteristic spectrum of the preprocessed first speech signal.

Among them, the so-called feature spectrum refers to the voice signal to be processed needs to be converted into a spectrum signal that meets certain feature requirements when performing classification recognition or classification training.

For example, when classifying and recognizing the first voice signal, after converting the first voice signal to be recognized into a spectrum signal, the audio is cut into a frame spectrum signal of about 20 ms according to a fixed time length, which is used as a subsequent classification recognition Characteristic spectrum.

S640: Perform classification calculation on the characteristic spectrum of the first speech signal by using a rhyme classifier to obtain a first rhyme signal sequence corresponding to the first speech signal.

Among them, the rhyme classifier may be a speech classification model generated in advance, but the speech classification model only classifies the rhyme signal in the speech signal and outputs the sequence value of the corresponding rhyme signal.

Steps S630 to S640 are refinements of the above step S520.

Further, a rhythm class training method as shown in FIG. 7 may be adopted to train and generate the above-mentioned rhythm class classifier. The method includes:

S710. Acquire a characteristic spectrum of a voice signal used for model training.

S720, annotate the pinyin rhyme signal in the characteristic spectrum.

In general, the same rhyme part is affected by the voice signal during the pronunciation process, and their appearance in the characteristic spectrum will not be exactly the same. Through supervised learning, you can quickly and accurately lock the features of the rhyme signals corresponding to different rhymes.

S730, the labeled pinyin rhyme signal is used as a training sample, and a neural network algorithm and a joint model algorithm connected with time series classification are used to train and generate a rhyme classifier.

The training process mainly includes two processing links, one is how to accurately classify the characteristic spectrum signals of different rhyme parts; the other is how to place the classified rhyme parts into the correct position in the speech signal.

When solving these two problems, a neural network algorithm can be used to accurately classify the characteristic spectrum signals of different rhythm parts, and combined with the connection timing classification (ConnectionistTemporalClassification, CTC) algorithm to lock the rhyme of the classified category The correct position of the Ministry in the voice signal. These two model algorithms are used for joint modeling to generate a rhyme classifier based on training samples.

Further, a rhythm class training method as shown in FIG. 8 can also be used to train and generate the above-mentioned rhythm class classifier. The method includes:

S810. Acquire a characteristic spectrum of a voice signal used for model training.

S820, annotate the pinyin rhyme signal in the characteristic spectrum.

In S830, the marked pinyin rhyme signal is used as a training sample, and a hidden Markov model and a deep neural network combined model algorithm are used to train and generate a rhyme class classifier.

When solving these two problems, the hidden Markov model (HMM-DNN) two model algorithms can also be used for joint modeling to generate a rhythm classifier based on training samples.

Different from the prior art, the classifier in this solution is a classifier for the rhyme part that classifies the pinyin rhyme part.

S650, a dynamic time warping algorithm is used to compare the first rhyme signal sequence with the preset second rhyme signal sequence of the wake-up word according to the time sequence, so as to extract the second rhyme signal sequence from the first rhyme signal sequence The third rhyme signal sequence with the same content.

When comparing the first rhyme signal sequence with the preset second rhyme signal sequence of the wake-up word, the dynamic time warping (Dynamic Time Warping, DTW) algorithm can be used to align the two signal sequences that are compared, Then, the comparison is performed according to the timing correspondence to extract the third rhyme signal sequence with the same content as the second rhyme signal sequence from the first rhyme signal sequence.

S660: Perform automatic speech recognition processing on the full spelling speech signal corresponding to the third rhythm signal sequence in the first voice signal to determine whether the full spelling speech signal is a speech signal corresponding to the wake word.

Step S660 is the same as step S540.

The voice wake-up method provided by the present invention is expanded on the basis of the first embodiment:

First, after the first speech signal is acquired, the first speech signal is preprocessed to retain the effective signal ratio in the first speech signal to the greatest extent.

Secondly, the pinyin rhyme signal contained in the first speech signal is recognized through a pre-trained rhyme classifier to obtain a first rhyme signal sequence corresponding to the first speech signal, so as to realize rapid recognition. When training the rhythm classifier, the neural network algorithm and the joint model algorithm connected to the time series classification are used for training and modeling, or the hidden Markov model and the deep neural network joint model algorithm are used for training and modeling to ensure that the trained The accuracy of the rhyme classifier.

Finally, a dynamic time warping algorithm is used to compare the first rhyme signal sequence with the preset second rhythm signal sequence of the wake-up word to quickly and accurately obtain the third rhyme signal sequence.

Example Three

As shown in FIG. 9, it is a structural diagram 1 of a voice wake-up device according to an embodiment of the present invention. The voice wake-up device may be installed in the voice wake-up device system shown in FIG. 4 for performing the method steps shown in FIG. 5. include:

The signal obtaining module 910 is used to obtain a first voice signal;

The signal recognition module 920 is used for recognizing the pinyin rhyme signal included in the first speech signal to obtain the first rhyme signal sequence corresponding to the first speech signal;

The signal comparison module 930 is configured to compare the first rhyme signal sequence with the preset second rhyme signal sequence of the wake-up word to extract the same content as the second rhyme signal sequence from the first rhyme signal sequence The third rhythm signal sequence;

The speech recognition module 940 is configured to perform automatic speech recognition processing on the full spelling speech signal corresponding to the third rhyme signal sequence in the first voice signal, and determine whether the full spelling speech signal is a speech signal corresponding to the wake word.

Further, as shown in FIG. 10, in the above voice wake-up device, the signal recognition module 920 may include:

The feature obtaining unit 101 is used to obtain a feature spectrum of the first voice signal;

The signal recognition unit 102 is configured to perform classification calculation on the characteristic spectrum of the first speech signal by using a rhyme classifier to obtain a first rhythm signal sequence corresponding to the first speech signal.

Further, the voice wake-up device shown in FIG. 10 may further include:

The pre-processing module 103 is used to perform pre-processing for denoising the first speech signal.

Further, the above-mentioned signal comparison module 930 may be specifically used for,

A dynamic time warping algorithm is used to compare the first rhyme signal sequence with the second rhyme signal sequence of the preset wake-up word according to the time sequence to extract the same content as the second rhyme signal sequence from the first rhyme signal sequence The third rhyme signal sequence.

The voice wake-up device shown in FIG. 10 can be used to perform the method steps shown in FIG. 6.

Further, as shown in FIG. 11, the above voice wake-up device may further include:

The first spectrum acquisition module 111 is used to acquire the characteristic spectrum of the speech signal used for model training;

The first signal labeling module 112 is used to label the pinyin rhyme signal in the characteristic spectrum;

The first training module 113 is configured to use the marked Pinyin rhyme signal as a training sample, and use a neural network algorithm and a joint model algorithm connected with time series classification to train and generate a rhyme classifier.

Further, as shown in FIG. 12, the foregoing voice wake-up device may further include:

The second spectrum acquisition module 121 is used to acquire the characteristic spectrum of the speech signal used for model training;

The second signal labeling module 122 is used to label the pinyin rhyme signal in the feature spectrum;

The second training module 123 is configured to use the marked Pinyin rhyme signal as a training sample, and use a hidden Markov model and a deep neural network joint model algorithm to train and generate the rhyme classifier.

The devices shown in FIGS. 11 and 12 can be used to correspondingly execute the method steps shown in FIGS. 7 and 8.

The voice wake-up device provided by the present invention, after acquiring the first voice signal to be recognized, first recognizes the pinyin rhyme signal included in the first voice signal to obtain the first rhyme signal sequence corresponding to the first voice signal; Then, the first rhyme signal sequence is compared with the preset second rhyme signal sequence of the wake-up word to extract a third rhyme signal having the same content as the second rhyme signal sequence from the first rhyme signal sequence Sequence; finally, automatic speech recognition processing is performed on the full spelling speech signal corresponding to the third rhythm signal sequence in the first voice signal to determine whether the full spelling speech signal is a speech signal corresponding to the wake-up word, and then the first speech signal is recognized Whether to include the wake word. In this solution, the rhythm signal in the speech signal to be recognized is first compared with the rhythm part of the wake word to extract the part of the speech signal in the speech signal to be recognized that is the same as the rhythm part of the wake word, and then for the part of the speech The signal is then processed through automatic speech recognition to determine whether it contains wake-up words, so as to achieve fast and accurate recognition of wake-up words and improve the speed of the device being awakened.

Further, after the first speech signal is acquired, the first speech signal is preprocessed to retain the effective signal ratio in the first speech signal to the greatest extent.

Further, the pinyin rhyme signal included in the first speech signal is recognized by a pre-trained rhyme classifier to obtain a first rhyme signal sequence corresponding to the first speech signal, so as to realize rapid recognition. When training the rhythm classifier, the neural network algorithm and the joint model algorithm connected to the time series classification are used for training and modeling, or the hidden Markov model and the deep neural network joint model algorithm are used for training and modeling to ensure that the trained The accuracy of the rhyme classifier.

Further, a dynamic time warping algorithm is used to compare the first rhyme signal sequence with the preset second rhyme signal sequence of the wake-up word to quickly and accurately obtain the third rhyme signal sequence.

Example 4

Based on the above voice wake-up solution idea, as shown in FIG. 13, which is a flowchart 3 of the voice wake-up method shown in an embodiment of the present invention. The method may be executed by modules deployed in the terminal 410 and the server 420 in FIG. 4. Among them, steps S131 to 133 can be executed on the terminal (terminal), and step S134 can be executed on the cloud (server). As shown in FIG. 13, the voice wake-up method includes the following steps:

S131. Acquire a first voice signal.

In this step, the language type of the first voice signal is not limited, for example, it may be Chinese, English, Japanese, and so on. The first voice signal may be a voice signal received through a voice device, and a wake-up word recognition is performed on the voice signal to further wake up the target device.

S132: Identify the vowel signal included in the first voice signal to obtain a first vowel signal sequence corresponding to the first voice signal.

Natural speech is divided into phonological categories and can be vowels and consonants. For example, in Chinese, vowels correspond to rhymes in Pinyin and consonants correspond to parts in Pinyin; for example, in English, it contains 5 vowels : A, e, i, o, u, 21 consonants; for example, in Japanese, it contains 5 vowels, which are represented by the five pseudonyms "あ · い · う · え · お". Phonologically, Its pronunciation is close to [a] [i]

[e] [o], in terms of consonants, there are unvoiced sounds-"か · さ · た · な · は · ま · や · ら · わ行" consonants, voiced sounds-"が · ざ · だ · ば行" consonants , Semi-voiced sounds-consonants of "ぱ行". The first vowel signal sequence corresponding to the first voice signal can be obtained by identifying the vowel signal contained in the first voice signal of any language type. For example, when the first speech signal is a Chinese speech signal, the first vowel signal sequence corresponding to the first speech signal may be the first rhyme signal in the method shown in FIG. 5.

S133. Compare the first vowel signal sequence with the preset second vowel signal sequence of the wake word to extract a third vowel signal with the same content as the second vowel signal sequence from the first vowel signal sequence sequence.

For example, when the processing object is a Chinese speech signal, the content of step S530 may be performed, and the first rhyme signal and the second rhythm signal sequence of the arousal word are compared, so as to extract and A third rhyme signal sequence with the same content as the second rhythm signal sequence.

S134: Perform automatic voice recognition processing on the full amount of voice signals corresponding to the third vowel signal sequence in the first voice signal to determine whether the full amount of voice signals are the voice signals corresponding to the wake-up words.

Wherein, the third vowel signal sequence corresponds to the full amount of voice signals in the first voice signal is all voice signals within the interval range of the third vowel signal sequence corresponding to the first voice signal. When the voice signal is a Chinese voice signal, the full-volume voice signal is the full Pinyin voice signal corresponding to the first voice signal corresponding to the third rhyme signal sequence.

Further, according to different language types to which the first voice signal belongs, the vowel signal included in the first voice signal may specifically be a voice signal corresponding to a vowel in a single syllable included in the language type to which the first voice signal belongs.

For example, when the first speech signal is a Chinese speech signal, the vowel signal included in the first speech signal is the speech signal corresponding to the rhyme part of the single word included in Chinese.

Example 5

An embodiment of the present invention provides a voice wake-up device. The voice wake-up device may include all the modules shown in FIG. 9 for performing the method steps shown in FIG. 13, which include:

The signal obtaining module 910 is used to obtain a first voice signal;

The signal recognition module 920 is used to recognize the vowel signal contained in the first voice signal to obtain the first vowel signal sequence corresponding to the first voice signal;

The signal comparison module 930 is used to compare the first vowel signal sequence with the preset second vowel signal sequence of the wake word to extract the same content as the second vowel signal sequence from the first vowel signal sequence Third vowel signal sequence;

The voice recognition module 940 is configured to perform automatic voice recognition processing on the full-volume voice signal corresponding to the third vowel signal sequence in the first voice signal, and determine whether the full-volume voice signal is a voice signal corresponding to the wake-up word.

Further, the vowel signal included in the first voice signal may be a voice signal corresponding to a vowel in a monosyllable included in the language type to which the first voice signal belongs.

For example, the language type to which the first voice signal belongs may include: Chinese, English, Japanese, and so on. When the first voice signal is a Chinese voice signal, the voice wake-up device in this embodiment may perform the method steps shown in FIG. 5.

Example Six

This embodiment provides a voice wake-up system, including:

The terminal is used to obtain a first voice signal, for example, a Chinese voice signal; identify the pinyin rhyme signal included in the first voice signal to obtain a first rhyme signal sequence corresponding to the first voice signal; Comparing the first rhyme signal sequence with the preset second rhyme signal sequence of the wake-up word to extract a third rhyme signal sequence with the same content as the second rhyme signal sequence from the first rhyme signal sequence; Send the whole Pinyin speech signal corresponding to the third rhyme signal sequence to the server;

The server is configured to perform automatic speech recognition processing on the full spelling speech signal corresponding to the third rhyme signal sequence in the first voice signal, and determine whether the full spelling speech signal is a speech signal corresponding to the wake word.

Correspondingly, based on the above voice wake-up system, this embodiment also provides a voice wake-up method, that is, the voice wake-up method is described from the execution flow on both sides of the terminal and the server. The method includes:

The terminal acquires a first voice signal, which is, for example, a Chinese voice signal; recognizes the pinyin rhyme signal included in the first voice signal to obtain a first rhyme signal sequence corresponding to the first voice signal; converts the first The rhyme signal sequence is compared with the second rhyme signal sequence of the preset wake word to extract a third rhyme signal sequence with the same content as the second rhyme signal sequence from the first rhyme signal sequence; the third The whole Pinyin speech signal corresponding to the rhyme signal sequence is sent to the server;

The server performs automatic speech recognition processing on the full spelling speech signal corresponding to the third rhyme signal sequence in the first voice signal, and determines whether the full spelling speech signal is a speech signal corresponding to the wake word.

By splitting the entire wake-up process into two parts: the first part recognizes the wake word for the first time on the terminal side by recognizing the rhyme signal in the first voice signal; the second part uses the rhythm part refined by the initial recognition on the server side The Quanpin speech signal corresponding to the signal is automatically speech recognized, thereby completing the recognition process of whether the entire speech signal hits the wake word. This method balances the calculation amount of the entire wake-up process between the terminal and the server, reduces the calculation pressure of the terminal, and improves the execution efficiency of the entire voice wake-up process.

Example 7

The foregoing Embodiment 3 describes the overall architecture of a voice wake-up device. The functions of the device can be implemented by means of an electronic device. As shown in FIG. 14, it is a schematic structural diagram of an electronic device according to an embodiment of the present invention, which specifically includes: Memory 141 and processor 142.

The memory 141 is used to store programs.

In addition to the above-mentioned programs, the memory 141 may be configured to store various other data to support operations on the electronic device. Examples of such data include instructions for any application or method for operating on the electronic device, contact data, phone book data, messages, pictures, videos, etc.

The memory 141 may be implemented by any type of volatile or non-volatile storage device or a combination thereof, such as static random access memory (SRAM), electrically erasable programmable read only memory (EEPROM), erasable and removable Programmable read only memory (EPROM), programmable read only memory (PROM), read only memory (ROM), magnetic memory, flash memory, magnetic disk or optical disk.

The processor 142, coupled to the memory 141, is used to execute the program in the memory 141 for:

Get the first voice signal;

Recognize the pinyin rhyme signal included in the first speech signal to obtain a first rhyme signal sequence corresponding to the first speech signal;

Comparing the first rhyme signal sequence with the preset second rhyme signal sequence of the wake-up word to extract a third rhyme signal sequence with the same content as the second rhyme signal sequence from the first rhyme signal sequence;

Automatic speech recognition processing is performed on the full spelling speech signal corresponding to the third rhythm signal sequence in the first voice signal to determine whether the full spelling speech signal is a speech signal corresponding to the wake-up word.

The above specific processing operations have been described in detail in the previous embodiments, and will not be repeated here.

Further, as shown in FIG. 14, the electronic device may further include: a communication component 143, a power component 144, an audio component 145, a display 146, and other components. Only some components are schematically shown in FIG. 14, and it does not mean that the electronic device includes only the components shown in FIG.

The communication component 143 is configured to facilitate wired or wireless communication between the electronic device and other devices. Electronic devices can access wireless networks based on communication standards, such as WiFi, 2G or 3G, or a combination thereof. In an exemplary embodiment, the communication component 143 receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 143 also includes a near field communication (NFC) module to facilitate short-range communication. For example, the NFC module can be implemented based on radio frequency identification (RFID) technology, infrared data association (IrDA) technology, ultra wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.

The power supply component 144 provides power for various components of the electronic device. The power component 144 may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for electronic devices.

The audio component 145 is configured to output and / or input audio signals. For example, the audio component 145 includes a microphone (MIC). When the electronic device is in an operation mode, such as a call mode, a recording mode, and a voice recognition mode, the microphone is configured to receive an external audio signal. The received audio signal may be further stored in the memory 141 or transmitted via the communication component 143. In some embodiments, the audio component 145 further includes a speaker for outputting audio signals.

The display 146 includes a screen, which may include a liquid crystal display (LCD) and a touch panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from the user. The touch panel includes one or more touch sensors to sense touch, swipe, and gestures on the touch panel. The touch sensor may not only sense the boundary of the touch or sliding action, but also detect the duration and pressure related to the touch or sliding operation.

Example 8

The foregoing Embodiment 5 describes the overall architecture of a voice wake-up device. The functions of the device can be implemented by means of an electronic device. As shown in FIG. 15, it is a schematic structural diagram of an electronic device according to an embodiment of the present invention. Memory 151 and processor 152.

The memory 151 is used to store programs.

In addition to the above programs, the memory 151 may be configured to store various other data to support operations on the electronic device. Examples of such data include instructions for any application or method for operating on the electronic device, contact data, phone book data, messages, pictures, videos, etc.

The memory 151 may be implemented by any type of volatile or nonvolatile storage device or a combination thereof, such as static random access memory (SRAM), electrically erasable programmable read only memory (EEPROM), erasable and removable Programmable read only memory (EPROM), programmable read only memory (PROM), read only memory (ROM), magnetic memory, flash memory, magnetic disk or optical disk.

The processor 152, coupled to the memory 151, is used to execute the program in the memory 151 for:

Get the first voice signal;

Comparing the first vowel signal sequence with the preset second vowel signal sequence of the wake-up word to extract a third vowel signal sequence with the same content as the second vowel signal sequence from the first vowel signal sequence;

Automatic speech recognition processing is performed on the full-volume voice signal corresponding to the third vowel signal sequence in the first voice signal to determine whether the full-volume voice signal is a voice signal corresponding to the wake-up word.

Further, as shown in FIG. 15, the electronic device may further include: a communication component 153, a power component 154, an audio component 155, a display 156, and other components. FIG. 15 only schematically shows some components, which does not mean that the electronic device includes only the components shown in FIG. 15.

The communication component 153 is configured to facilitate wired or wireless communication between the electronic device and other devices. Electronic devices can access wireless networks based on communication standards, such as WiFi, 2G or 3G, or a combination thereof. In an exemplary embodiment, the communication component 153 receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 153 also includes a near field communication (NFC) module to facilitate short-range communication. For example, the NFC module can be implemented based on radio frequency identification (RFID) technology, infrared data association (IrDA) technology, ultra wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.

The power supply component 154 provides power for various components of the electronic device. The power supply component 154 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for electronic devices.

The audio component 155 is configured to output and / or input audio signals. For example, the audio component 155 includes a microphone (MIC). When the electronic device is in an operation mode, such as a call mode, a recording mode, and a voice recognition mode, the microphone is configured to receive an external audio signal. The received audio signal may be further stored in the memory 151 or transmitted via the communication component 153. In some embodiments, the audio component 155 further includes a speaker for outputting audio signals.

The display 156 includes a screen, which may include a liquid crystal display (LCD) and a touch panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from the user. The touch panel includes one or more touch sensors to sense touch, swipe, and gestures on the touch panel. The touch sensor may not only sense the boundary of the touch or sliding action, but also detect the duration and pressure related to the touch or sliding operation.

Those of ordinary skill in the art may understand that all or part of the steps of the foregoing method embodiments may be completed by a program instructing relevant hardware. The aforementioned program may be stored in a computer-readable storage medium. When the program is executed, the steps including the foregoing method embodiments are executed; and the foregoing storage medium includes various media that can store program codes, such as ROM, RAM, magnetic disk, or optical disk.

Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present application, not to limit them; although the present application has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that: The technical solutions described in the foregoing embodiments can still be modified, or some or all of the technical features can be equivalently replaced; and these modifications or replacements do not deviate from the essence of the corresponding technical solutions of the technical solutions of the embodiments of the present application. range.

Claims

A voice wake-up method, including:

Get the first voice signal;

Recognizing the pinyin rhyme signal included in the first speech signal to obtain a first rhyme signal sequence corresponding to the first speech signal;

Comparing the first rhyme signal sequence with a second rhyme signal sequence of a preset wake-up word, to extract from the first rhyme signal sequence the same content as the second rhyme signal sequence Three rhyme signal sequence;

Performing automatic speech recognition processing on the full spelling speech signal corresponding to the third rhythm signal sequence in the first voice signal to determine whether the full spelling speech signal is a voice signal corresponding to the wake-up word.
The method according to claim 1, wherein the recognizing the pinyin rhyme signal included in the first speech signal to obtain the first rhyme signal sequence corresponding to the first speech signal includes:

Acquiring the characteristic spectrum of the first speech signal;

The characteristic spectrum of the first speech signal is classified and calculated by a rhyme classifier to obtain the first rhythm signal sequence corresponding to the first speech signal.
The method of claim 2, wherein the method further comprises:

Obtain the characteristic spectrum of the speech signal used for model training;

Annotate the pinyin rhyme signal in the characteristic spectrum;

Using the marked Pinyin rhyme signal as a training sample, a neural network algorithm and a joint model algorithm connected with time series classification are used to train and generate the rhyme classifier.
The method of claim 2, wherein the method further comprises:

Obtain the characteristic spectrum of the speech signal used for model training;

Annotate the pinyin rhyme signal in the characteristic spectrum;

Taking the marked pinyin rhyme signals as training samples, a hidden Markov model and deep neural network combined model algorithm are used to train and generate rhyme classifiers.
The method according to claim 1, wherein before recognizing the pinyin rhyme signal included in the first speech signal, before obtaining the first rhyme signal sequence corresponding to the first speech signal, further comprising:

Pre-denoising the first speech signal.
The method according to claim 1, wherein the comparing the first rhyme signal sequence with a second rhyme signal sequence of a preset wake-up word, extracting from the first rhyme signal sequence and The third rhyme signal sequence with the same content of the second rhyme signal sequence includes:

A dynamic time warping algorithm is used to compare the first rhyme signal sequence with the preset second rhyme signal sequence of the wake-up word in time sequence to extract the second rhyme signal sequence from the first rhyme signal sequence and the second A third rhyme signal sequence with the same rhyme signal sequence content.
The method according to claim 1, wherein the first voice signal is a Chinese voice signal.
A voice wake-up method, including:

Get the first voice signal;

Identify the vowel signal contained in the first voice signal to obtain a first vowel signal sequence corresponding to the first voice signal;

Comparing the first vowel signal sequence with the preset second vowel signal sequence of the awakening word to extract from the first vowel signal sequence the same content as the second vowel signal sequence Three-vowel signal sequence;

Performing automatic speech recognition processing on the full amount of voice signals corresponding to the third vowel signal sequence in the first voice signal to determine whether the full amount of voice signals are voice signals corresponding to the wake-up words.
The method according to claim 8, wherein the vowel signal included in the first speech signal is a speech signal corresponding to a vowel in a single syllable included in the language type to which the first speech signal belongs.
A voice wake-up device, including:

The signal acquisition module is used to acquire the first voice signal;

The signal recognition module is used to recognize the pinyin rhyme signal included in the first speech signal to obtain a first rhyme signal sequence corresponding to the first speech signal;

The signal comparison module is used to compare the first rhyme signal sequence with the preset second rhyme signal sequence of the wake word to extract the second rhyme portion from the first rhyme signal sequence The third rhythm signal sequence with the same signal sequence content;

The speech recognition module is used to perform automatic speech recognition processing on the full spelling speech signal corresponding to the third rhyme signal sequence in the first voice signal to determine whether the full spelling speech signal corresponds to the wake-up word voice signal.
A voice wake-up device, including:

The signal acquisition module is used to acquire the first voice signal;

The signal recognition module is used to recognize the vowel signal contained in the first voice signal to obtain the first vowel signal sequence corresponding to the first voice signal;

The signal comparison module is used to compare the first vowel signal sequence with the preset second vowel signal sequence of the wake word to extract the second vowel from the first vowel signal sequence A third vowel signal sequence with the same signal sequence content;

A voice recognition module, configured to perform automatic voice recognition processing on the full-volume voice signal corresponding to the third vowel signal sequence in the first voice signal, and determine whether the full-volume voice signal is a voice signal corresponding to the wake-up word .
A voice wake-up system, including:

A terminal for acquiring a first speech signal; identifying the pinyin rhyme signal included in the first speech signal to obtain a first rhyme signal sequence corresponding to the first speech signal; Comparing the signal sequence with the second rhyme signal sequence of the preset wake-up word to extract a third rhyme signal sequence with the same content as the second rhyme signal sequence from the first rhyme signal sequence; The whole Pinyin speech signal corresponding to the third rhyme signal sequence is sent to the server;

The server is configured to perform automatic speech recognition processing on the full spelling speech signal corresponding to the third rhyme signal sequence in the first voice signal, and determine whether the full spelling speech signal corresponds to the wake-up word voice signal.
A voice wake-up method, including:

The terminal acquires the first voice signal; recognizes the pinyin rhyme signal included in the first voice signal to obtain a first rhyme signal sequence corresponding to the first voice signal; and compares the first rhyme signal sequence with Compare the second rhyme signal sequence of the preset wake-up word to extract a third rhyme signal sequence with the same content as the second rhyme signal sequence from the first rhyme signal sequence; The Quanpin speech signal corresponding to the signal sequence of the Sanyun Department is sent to the server;

The server performs automatic speech recognition processing on the full spelling speech signal corresponding to the third rhythm signal sequence in the first voice signal to determine whether the full spelling speech signal is a speech signal corresponding to the wake-up word.
An electronic device, including:

Memory for storing programs;

A processor, coupled to the memory, is used to execute the program for:

Get the first voice signal;

Recognizing the pinyin rhyme signal included in the first speech signal to obtain a first rhyme signal sequence corresponding to the first speech signal;

Comparing the first rhyme signal sequence with a second rhyme signal sequence of a preset wake-up word, to extract from the first rhyme signal sequence the same content as the second rhyme signal sequence Three rhyme signal sequence;

Automatic speech recognition processing is performed on the full spelling speech signal corresponding to the third rhythm signal sequence in the first voice signal to determine whether the full spelling speech signal is a speech signal corresponding to the wake-up word.
An electronic device, including:

Memory for storing programs;

A processor, coupled to the memory, is used to execute the program for:

Get the first voice signal;

Identify the vowel signal contained in the first voice signal to obtain a first vowel signal sequence corresponding to the first voice signal;

Comparing the first vowel signal sequence with the preset second vowel signal sequence of the awakening word to extract from the first vowel signal sequence the same content as the second vowel signal sequence Three-vowel signal sequence;

Performing automatic speech recognition processing on the full amount of voice signals corresponding to the third vowel signal sequence in the first voice signal to determine whether the full amount of voice signals are voice signals corresponding to the wake-up words.