WO2020073839A1 - Voice wake-up method, apparatus and system, and electronic device - Google Patents

Voice wake-up method, apparatus and system, and electronic device Download PDF

Info

Publication number
WO2020073839A1
WO2020073839A1 PCT/CN2019/108828 CN2019108828W WO2020073839A1 WO 2020073839 A1 WO2020073839 A1 WO 2020073839A1 CN 2019108828 W CN2019108828 W CN 2019108828W WO 2020073839 A1 WO2020073839 A1 WO 2020073839A1
Authority
WO
WIPO (PCT)
Prior art keywords
signal
rhyme
signal sequence
voice
wake
Prior art date
Application number
PCT/CN2019/108828
Other languages
French (fr)
Chinese (zh)
Inventor
曹元斌
张智超
风翮
王刚
Original Assignee
阿里巴巴集团控股有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 阿里巴巴集团控股有限公司 filed Critical 阿里巴巴集团控股有限公司
Publication of WO2020073839A1 publication Critical patent/WO2020073839A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/22Interactive procedures; Man-machine interfaces
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L2015/088Word spotting

Definitions

  • This specification relates to the field of computer technology, and in particular to a voice wake-up method, device, system, and electronic equipment.
  • voice recognition technology As the basic interaction method of intelligent devices, plays an increasingly important role.
  • Voice recognition technology involves many aspects, including awakening the device through voice commands, controlling the operation of the device, man-machine dialogue with the device, and voice command control for multiple devices.
  • Efficient and accurate voice recognition technology and fast and convenient wake-up mode are important development directions for smart devices.
  • the main performance bottleneck of the custom wake-up is that the computing resources on the terminal (terminal device) are limited, and the number of categories of the core classifier on the voice feature directly affects the speed and accuracy of the wake-up.
  • the traditional Pinyin granularity classification strategy is to take the full spelling of commonly used Chinese characters for classification, with more than 1,200 tones and more than 400 tones removed, which can achieve an accuracy rate of about 80%.
  • it is necessary to improve the on-end computing performance and improve a lot of post-processing work.
  • the invention provides a voice wake-up method, device, system and electronic equipment, which can quickly and accurately identify wake-up words and improve the speed of the equipment being woken up.
  • a voice wake-up method including:
  • another voice wake-up method including:
  • a voice wake-up device including:
  • the signal acquisition module is used to acquire the first voice signal
  • the signal recognition module is used to recognize the pinyin rhyme signal included in the first speech signal to obtain a first rhyme signal sequence corresponding to the first speech signal;
  • the signal comparison module is used to compare the first rhyme signal sequence with the preset second rhyme signal sequence of the wake word to extract the second rhyme portion from the first rhyme signal sequence
  • the third rhythm signal sequence with the same signal sequence content
  • the speech recognition module is used to perform automatic speech recognition processing on the full spelling speech signal corresponding to the third rhyme signal sequence in the first voice signal to determine whether the full spelling speech signal corresponds to the wake-up word voice signal.
  • another voice wake-up device including:
  • the signal acquisition module is used to acquire the first voice signal
  • the signal recognition module is used to recognize the vowel signal contained in the first voice signal to obtain the first vowel signal sequence corresponding to the first voice signal;
  • the signal comparison module is used to compare the first vowel signal sequence with the preset second vowel signal sequence of the wake word to extract the second vowel from the first vowel signal sequence A third vowel signal sequence with the same signal sequence content;
  • a voice recognition module configured to perform automatic voice recognition processing on the full amount of voice signals corresponding to the third vowel signal sequence in the first voice signal, and determine whether the full amount of voice signals are the voice signals corresponding to the wake-up words .
  • a voice wake-up system including:
  • the server is configured to perform automatic speech recognition processing on the full spelling speech signal corresponding to the third rhyme signal sequence in the first voice signal, and determine whether the full spelling speech signal corresponds to the wake-up word voice signal.
  • a voice wake-up method including:
  • the terminal acquires the first voice signal; recognizes the pinyin rhyme signal included in the first voice signal to obtain a first rhyme signal sequence corresponding to the first voice signal; and compares the first rhyme signal sequence with Compare the second rhyme signal sequence of the preset wake-up word to extract a third rhyme signal sequence with the same content as the second rhyme signal sequence from the first rhyme signal sequence;
  • the Quanpin speech signal corresponding to the signal sequence of the Sanyun Department is sent to the server;
  • the server performs automatic speech recognition processing on the full spelling speech signal corresponding to the third rhythm signal sequence in the first voice signal to determine whether the full spelling speech signal is a speech signal corresponding to the wake-up word.
  • an electronic device including:
  • a processor coupled to the memory, is used to execute the program for:
  • another electronic device including:
  • a processor coupled to the memory, is used to execute the program for:
  • the invention provides a voice wake-up method, device, system and electronic equipment. After acquiring the first voice signal to be recognized, the pinyin rhythm signal included in the first voice signal is recognized first to obtain the first voice signal The corresponding first rhyme signal sequence; then, the first rhyme signal sequence is compared with the preset second rhyme signal sequence of the wake-up word to extract the second rhyme signal from the first rhyme signal sequence The third rhyme signal sequence with the same sequence content; finally, the automatic speech recognition process is performed on the full-speech speech signal corresponding to the third voice-part signal sequence in the first speech signal to determine whether the full-speech speech signal is the speech corresponding to the wake-up word Signal to further identify whether the first speech signal contains a wake-up word.
  • the rhythm signal in the speech signal to be recognized is first compared with the rhythm part of the wake word to extract the part of the speech signal in the speech signal to be recognized that is the same as the rhythm part of the wake word, and then for the part of the speech
  • the signal is then processed through automatic speech recognition to determine whether it contains wake-up words, so as to achieve fast and accurate recognition of wake-up words and improve the speed of the device being awakened.
  • Figure 1 is a logic schematic diagram of the basic flow of voice wake-up
  • Figure 2 is a schematic diagram of the processing logic of the wake-up engine on the upper end of the basic process of voice wake-up;
  • FIG. 3 is a schematic diagram of processing logic of a wake-up engine according to an embodiment of the present invention.
  • FIG. 4 is a structural diagram of a voice wake-up system according to an embodiment of the present invention.
  • FIG. 5 is a flowchart 1 of a voice wake-up method according to an embodiment of the present invention.
  • FIG. 6 is a second flowchart of a voice wake-up method according to an embodiment of the present invention.
  • FIG. 7 is a flowchart 1 of a rhythm class training method according to an embodiment of the present invention.
  • FIG. 8 is a flowchart 2 of a rhythm class training method according to an embodiment of the present invention.
  • FIG. 9 is a structural diagram 1 of a voice wake-up device according to an embodiment of the invention.
  • FIG. 10 is a second structural diagram of a voice wake-up device according to an embodiment of the present invention.
  • FIG. 11 is a structural diagram 1 of a rhythm class training device according to an embodiment of the present invention.
  • FIG. 12 is a structural diagram 2 of a rhythm class training device according to an embodiment of the present invention.
  • FIG. 13 is a flowchart 3 of a voice wake-up method according to an embodiment of the present invention.
  • FIG. 14 is a first schematic structural diagram of an electronic device according to an embodiment of the present invention.
  • 15 is a second schematic structural diagram of an electronic device according to an embodiment of the present invention.
  • the voice device After receiving the voice signal, the voice device first performs signal processing (mainly including noise reduction and echo cancellation) and feature extraction on the voice signal, thereby converting the original input audio signal into the terminal
  • the features that is, the frequency spectrum signal of the voice
  • the wake-up engine on the (terminal) that can be recognized by the wake-up engine on the (terminal); then enter the features into the wake-up engine for comparison and recognition of wake-up words; when the wake-up word hits, it will continue to instruct the server to execute subsequent instructions, such as playing songs , Crosstalk, etc.
  • the on-end wake-up engine can be considered as the core part of performing wake-up.
  • the wake-up engine on this end mainly includes two parts: a classifier and a post-processing part.
  • the classifier is used to convert continuous speech features into different categories. This part of the calculation is often the most expensive part of all wake-up work. Usually the number of classifications output by the last layer of the neural network directly determines the entire network. Calculation scale.
  • the traditional Hidden Markov Model-Deep Neural Network (HMM-DNN) is modeled using the probability density function (Probability Density, PDF) of the speed of sound (phone). Production availability requires at least 6000 to 8000 classifications; using Pinyin for classification also requires more than 1200 to 400 classifications.
  • post-processing there is a post-processing part in the detection of wake-up words.
  • the traditional method detects the entire word, and can use dynamic time warping algorithm (Dynamic Time Warping, DTW) recognition after smoothing the speech output by the classifier. Whether the voice is the same as the wake-up word; automatic speech recognition (Automatic Speech Recognition, ASR) technology can also be used to recognize whether the voice hits the wake-up word.
  • DTW Dynamic Time Warping
  • ASR Automatic Speech Recognition
  • the classification network is huge, and a high computing performance needs to be configured on the end.
  • the embodiment of the present invention improves the defect in the prior art that the huge classification network leads to the need to configure higher computing resources on the end to accurately and quickly perform voice wake-up.
  • the core idea is to split the core part of performing voice wake-up into two The recognition process of the second wake word.
  • the first wake-up word recognition process is completed on the terminal. This process only classifies and recognizes the pinyin rhyme part of the wake-up word, completing the preliminary recognition process of the speech signal to be recognized. Then, the full-scale speech signal corresponding to the rhyme signal that is initially selected and the same as the rhyme signal of the arousal word is sent to the cloud, and the cloud recognizes the entire speech signal again to determine whether the speech signal hits the arousal word.
  • FIG. 3 it is a schematic diagram of a processing logic of a wake-up engine according to an embodiment of the present invention, and relates to two main bodies that perform voice wake-up, a device side (a terminal that can receive and recognize voice such as a smart speaker) and a cloud side (a server is provided).
  • a device side a terminal that can receive and recognize voice such as a smart speaker
  • a cloud side a server is provided.
  • the speech signal to be recognized first undergoes the first wake-up word recognition.
  • This recognition process only performs rhythm class recognition on the rhythm signal of the speech signal through the pre-trained classifier; then, the recognized rhyme
  • the part signal sequence is compared with the prosodic part of the wake-up word through post-processing to determine whether the prosodic part of the wake-up word is hit in the voice signal, and the full amount of voice signal hitting the promising part of the wake-up word is transmitted to the cloud.
  • the voice signal to be recognized is a full-volume voice signal with the same rhyme signal and arousal word rhyme.
  • the recognition process is to recognize the entire voice signal For example, ASR technology is used to identify whether the voice signal hits the wake word.
  • FIG. 4 is a structural diagram of a voice wake-up system provided by an embodiment of the present invention. As shown in FIG. 4, the system includes a terminal 410 and a server 420, where:
  • Terminal 410 includes:
  • a signal acquisition module for acquiring a first voice signal
  • the first voice signal is, for example, a Chinese voice signal
  • the signal recognition module is used for recognizing the pinyin rhyme signal included in the first speech signal to obtain the first rhyme signal sequence corresponding to the first speech signal;
  • the signal comparison module is used to compare the first rhyme signal sequence with the preset second rhyme signal sequence of the wake-up word, so as to extract from the first rhyme signal sequence the same content as the second rhyme signal sequence.
  • Three rhyme signal sequence Three rhyme signal sequence;
  • the server 420 includes:
  • the speech recognition module is used to perform automatic speech recognition processing on the full spelling speech signal corresponding to the third rhyme signal sequence in the first voice signal, and determine whether the full spelling speech signal is a speech signal corresponding to the wake word.
  • FIG. 5 is a flowchart 1 of the voice wake-up method shown in an embodiment of the present invention.
  • the method may be executed by modules deployed in the terminal 410 and the server 420 in FIG. 4.
  • steps S510-530 can be executed on the terminal (terminal)
  • step S540 can be executed on the cloud (server).
  • the voice wake-up method includes the following steps:
  • the first voice signal may be a voice signal received through the voice device, and the wake-up word is recognized by the voice signal to further wake up the target device.
  • S520 Identify the pinyin rhyme signal included in the first speech signal to obtain a first rhyme signal sequence corresponding to the first speech signal.
  • the pinyin and rhyme parts are separated: such as tian-> t, ian; mao-> m, ao.
  • the pinyin parts referred to as "voices"
  • the voice part is a short-lived peak or trough, basically all the extensions
  • the sounds are in the Pinyin Rhyme Department (referred to as "Rhyme Department”).
  • Rhyme Department In the traditional triphone modeling, it is often necessary to combine the front and back phones to achieve a good recognition accuracy.
  • the rhythm signals included in the signal are identified to obtain a first rhyme signal sequence corresponding to the first speech signal.
  • the first rhyme signal sequence includes a time sequence and a rhyme signal located at each time point in the time sequence.
  • the traditional wake word recognition method is to detect the entire word. In order to reduce the amount of calculation on the end, this scheme only recognizes the rhyme of each word on the end, that is, the above first rhyme signal sequence and the second of the preset wake word.
  • the rhyme signal sequences are compared to extract a third rhyme signal sequence with the same content as the second rhyme signal sequence from the first rhyme signal sequence.
  • the second verification in this session is to filter out the part of the voice signal that is different from the part of the wake word.
  • the advantage of this is that most non-wake words are filtered on the end, and the cloud only needs to do the final verification.
  • the real wake-up word can be recognized, so that the calculation of the end and the server is balanced, which can have a high accuracy rate, and at the same time, there will be no high delay caused by the large model on the end.
  • the voice wake-up method after acquiring the first voice signal to be recognized, first recognizes the pinyin rhyme signal included in the first voice signal to obtain the first rhyme signal sequence corresponding to the first voice signal; Then, the first rhyme signal sequence is compared with the preset second rhyme signal sequence of the wake-up word to extract a third rhyme signal having the same content as the second rhyme signal sequence from the first rhyme signal sequence Sequence; finally, automatic speech recognition processing is performed on the full spelling speech signal corresponding to the third rhythm signal sequence in the first voice signal to determine whether the full spelling speech signal is a speech signal corresponding to the wake-up word, and then the first speech signal is recognized Whether to include the wake word.
  • the rhythm signal in the speech signal to be recognized is first compared with the rhythm part of the wake word to extract the part of the speech signal in the speech signal to be recognized that is the same as the rhythm part of the wake word, and then for the part of the speech
  • the signal is then processed through automatic speech recognition to determine whether it contains wake-up words, so as to achieve fast and accurate recognition of wake-up words and improve the speed of the device being awakened.
  • FIG. 6 it is a flowchart 2 of a voice wake-up method according to an embodiment of the present invention.
  • a preprocessing link is added, and steps S520 and S530 are refined.
  • the voice wake-up method includes the following steps:
  • S610 Acquire a first voice signal, where the first voice signal is, for example, a Chinese voice signal.
  • step S610 The content of step S610 is the same as that of step S510.
  • S620 Perform pre-processing for denoising the first speech signal.
  • the first speech signal may be subjected to pre-processing such as noise reduction and echo cancellation to maximize the retention of the effective signal ratio in the first speech signal.
  • the so-called feature spectrum refers to the voice signal to be processed needs to be converted into a spectrum signal that meets certain feature requirements when performing classification recognition or classification training.
  • the audio is cut into a frame spectrum signal of about 20 ms according to a fixed time length, which is used as a subsequent classification recognition Characteristic spectrum.
  • S640 Perform classification calculation on the characteristic spectrum of the first speech signal by using a rhyme classifier to obtain a first rhyme signal sequence corresponding to the first speech signal.
  • the rhyme classifier may be a speech classification model generated in advance, but the speech classification model only classifies the rhyme signal in the speech signal and outputs the sequence value of the corresponding rhyme signal.
  • Steps S630 to S640 are refinements of the above step S520.
  • rhythm class training method as shown in FIG. 7 may be adopted to train and generate the above-mentioned rhythm class classifier.
  • the method includes:
  • the labeled pinyin rhyme signal is used as a training sample, and a neural network algorithm and a joint model algorithm connected with time series classification are used to train and generate a rhyme classifier.
  • the training process mainly includes two processing links, one is how to accurately classify the characteristic spectrum signals of different rhyme parts; the other is how to place the classified rhyme parts into the correct position in the speech signal.
  • a neural network algorithm can be used to accurately classify the characteristic spectrum signals of different rhythm parts, and combined with the connection timing classification (ConnectionistTemporalClassification, CTC) algorithm to lock the rhyme of the classified category The correct position of the Ministry in the voice signal.
  • CTC connection timing classification
  • These two model algorithms are used for joint modeling to generate a rhyme classifier based on training samples.
  • rhythm class training method as shown in FIG. 8 can also be used to train and generate the above-mentioned rhythm class classifier.
  • the method includes:
  • the marked pinyin rhyme signal is used as a training sample, and a hidden Markov model and a deep neural network combined model algorithm are used to train and generate a rhyme class classifier.
  • the training process mainly includes two processing links, one is how to accurately classify the characteristic spectrum signals of different rhyme parts; the other is how to place the classified rhyme parts into the correct position in the speech signal.
  • HMM-DNN hidden Markov model
  • the classifier in this solution is a classifier for the rhyme part that classifies the pinyin rhyme part.
  • a dynamic time warping algorithm is used to compare the first rhyme signal sequence with the preset second rhyme signal sequence of the wake-up word according to the time sequence, so as to extract the second rhyme signal sequence from the first rhyme signal sequence
  • the third rhyme signal sequence with the same content.
  • the dynamic time warping (Dynamic Time Warping, DTW) algorithm can be used to align the two signal sequences that are compared, Then, the comparison is performed according to the timing correspondence to extract the third rhyme signal sequence with the same content as the second rhyme signal sequence from the first rhyme signal sequence.
  • DTW Dynamic Time Warping
  • S660 Perform automatic speech recognition processing on the full spelling speech signal corresponding to the third rhythm signal sequence in the first voice signal to determine whether the full spelling speech signal is a speech signal corresponding to the wake word.
  • Step S660 is the same as step S540.
  • the first speech signal is preprocessed to retain the effective signal ratio in the first speech signal to the greatest extent.
  • the pinyin rhyme signal contained in the first speech signal is recognized through a pre-trained rhyme classifier to obtain a first rhyme signal sequence corresponding to the first speech signal, so as to realize rapid recognition.
  • the neural network algorithm and the joint model algorithm connected to the time series classification are used for training and modeling, or the hidden Markov model and the deep neural network joint model algorithm are used for training and modeling to ensure that the trained The accuracy of the rhyme classifier.
  • a dynamic time warping algorithm is used to compare the first rhyme signal sequence with the preset second rhythm signal sequence of the wake-up word to quickly and accurately obtain the third rhyme signal sequence.
  • FIG. 9 it is a structural diagram 1 of a voice wake-up device according to an embodiment of the present invention.
  • the voice wake-up device may be installed in the voice wake-up device system shown in FIG. 4 for performing the method steps shown in FIG. 5. include:
  • the signal obtaining module 910 is used to obtain a first voice signal
  • the signal recognition module 920 is used for recognizing the pinyin rhyme signal included in the first speech signal to obtain the first rhyme signal sequence corresponding to the first speech signal;
  • the signal comparison module 930 is configured to compare the first rhyme signal sequence with the preset second rhyme signal sequence of the wake-up word to extract the same content as the second rhyme signal sequence from the first rhyme signal sequence The third rhythm signal sequence;
  • the speech recognition module 940 is configured to perform automatic speech recognition processing on the full spelling speech signal corresponding to the third rhyme signal sequence in the first voice signal, and determine whether the full spelling speech signal is a speech signal corresponding to the wake word.
  • the signal recognition module 920 may include:
  • the feature obtaining unit 101 is used to obtain a feature spectrum of the first voice signal
  • the signal recognition unit 102 is configured to perform classification calculation on the characteristic spectrum of the first speech signal by using a rhyme classifier to obtain a first rhythm signal sequence corresponding to the first speech signal.
  • the voice wake-up device shown in FIG. 10 may further include:
  • the pre-processing module 103 is used to perform pre-processing for denoising the first speech signal.
  • the above-mentioned signal comparison module 930 may be specifically used for,
  • a dynamic time warping algorithm is used to compare the first rhyme signal sequence with the second rhyme signal sequence of the preset wake-up word according to the time sequence to extract the same content as the second rhyme signal sequence from the first rhyme signal sequence The third rhyme signal sequence.
  • the voice wake-up device shown in FIG. 10 can be used to perform the method steps shown in FIG. 6.
  • the above voice wake-up device may further include:
  • the first spectrum acquisition module 111 is used to acquire the characteristic spectrum of the speech signal used for model training
  • the first signal labeling module 112 is used to label the pinyin rhyme signal in the characteristic spectrum
  • the first training module 113 is configured to use the marked Pinyin rhyme signal as a training sample, and use a neural network algorithm and a joint model algorithm connected with time series classification to train and generate a rhyme classifier.
  • the foregoing voice wake-up device may further include:
  • the second spectrum acquisition module 121 is used to acquire the characteristic spectrum of the speech signal used for model training
  • the second signal labeling module 122 is used to label the pinyin rhyme signal in the feature spectrum
  • the second training module 123 is configured to use the marked Pinyin rhyme signal as a training sample, and use a hidden Markov model and a deep neural network joint model algorithm to train and generate the rhyme classifier.
  • FIGS. 11 and 12 can be used to correspondingly execute the method steps shown in FIGS. 7 and 8.
  • the voice wake-up device after acquiring the first voice signal to be recognized, first recognizes the pinyin rhyme signal included in the first voice signal to obtain the first rhyme signal sequence corresponding to the first voice signal; Then, the first rhyme signal sequence is compared with the preset second rhyme signal sequence of the wake-up word to extract a third rhyme signal having the same content as the second rhyme signal sequence from the first rhyme signal sequence Sequence; finally, automatic speech recognition processing is performed on the full spelling speech signal corresponding to the third rhythm signal sequence in the first voice signal to determine whether the full spelling speech signal is a speech signal corresponding to the wake-up word, and then the first speech signal is recognized Whether to include the wake word.
  • the rhythm signal in the speech signal to be recognized is first compared with the rhythm part of the wake word to extract the part of the speech signal in the speech signal to be recognized that is the same as the rhythm part of the wake word, and then for the part of the speech
  • the signal is then processed through automatic speech recognition to determine whether it contains wake-up words, so as to achieve fast and accurate recognition of wake-up words and improve the speed of the device being awakened.
  • the first speech signal is preprocessed to retain the effective signal ratio in the first speech signal to the greatest extent.
  • the pinyin rhyme signal included in the first speech signal is recognized by a pre-trained rhyme classifier to obtain a first rhyme signal sequence corresponding to the first speech signal, so as to realize rapid recognition.
  • the neural network algorithm and the joint model algorithm connected to the time series classification are used for training and modeling, or the hidden Markov model and the deep neural network joint model algorithm are used for training and modeling to ensure that the trained The accuracy of the rhyme classifier.
  • a dynamic time warping algorithm is used to compare the first rhyme signal sequence with the preset second rhyme signal sequence of the wake-up word to quickly and accurately obtain the third rhyme signal sequence.
  • FIG. 13 is a flowchart 3 of the voice wake-up method shown in an embodiment of the present invention.
  • the method may be executed by modules deployed in the terminal 410 and the server 420 in FIG. 4. Among them, steps S131 to 133 can be executed on the terminal (terminal), and step S134 can be executed on the cloud (server).
  • the voice wake-up method includes the following steps:
  • the language type of the first voice signal is not limited, for example, it may be Chinese, English, Japanese, and so on.
  • the first voice signal may be a voice signal received through a voice device, and a wake-up word recognition is performed on the voice signal to further wake up the target device.
  • S132 Identify the vowel signal included in the first voice signal to obtain a first vowel signal sequence corresponding to the first voice signal.
  • Natural speech is divided into phonological categories and can be vowels and consonants.
  • vowels correspond to rhymes in Pinyin and consonants correspond to parts in Pinyin; for example, in English, it contains 5 vowels : A, e, i, o, u, 21 consonants; for example, in Japanese, it contains 5 vowels, which are represented by the five pseudonyms " ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ".
  • the first vowel signal sequence corresponding to the first voice signal can be obtained by identifying the vowel signal contained in the first voice signal of any language type. For example, when the first speech signal is a Chinese speech signal, the first vowel signal sequence corresponding to the first speech signal may be the first rhyme signal in the method shown in FIG. 5.
  • step S530 may be performed, and the first rhyme signal and the second rhythm signal sequence of the arousal word are compared, so as to extract and A third rhyme signal sequence with the same content as the second rhythm signal sequence.
  • S134 Perform automatic voice recognition processing on the full amount of voice signals corresponding to the third vowel signal sequence in the first voice signal to determine whether the full amount of voice signals are the voice signals corresponding to the wake-up words.
  • the third vowel signal sequence corresponds to the full amount of voice signals in the first voice signal is all voice signals within the interval range of the third vowel signal sequence corresponding to the first voice signal.
  • the full-volume voice signal is the full Pinyin voice signal corresponding to the first voice signal corresponding to the third rhyme signal sequence.
  • the vowel signal included in the first voice signal may specifically be a voice signal corresponding to a vowel in a single syllable included in the language type to which the first voice signal belongs.
  • the vowel signal included in the first speech signal is the speech signal corresponding to the rhyme part of the single word included in Chinese.
  • the voice wake-up device may include all the modules shown in FIG. 9 for performing the method steps shown in FIG. 13, which include:
  • the signal obtaining module 910 is used to obtain a first voice signal
  • the signal recognition module 920 is used to recognize the vowel signal contained in the first voice signal to obtain the first vowel signal sequence corresponding to the first voice signal;
  • the signal comparison module 930 is used to compare the first vowel signal sequence with the preset second vowel signal sequence of the wake word to extract the same content as the second vowel signal sequence from the first vowel signal sequence Third vowel signal sequence;
  • the voice recognition module 940 is configured to perform automatic voice recognition processing on the full-volume voice signal corresponding to the third vowel signal sequence in the first voice signal, and determine whether the full-volume voice signal is a voice signal corresponding to the wake-up word.
  • the vowel signal included in the first voice signal may be a voice signal corresponding to a vowel in a monosyllable included in the language type to which the first voice signal belongs.
  • the language type to which the first voice signal belongs may include: Chinese, English, Japanese, and so on.
  • the voice wake-up device in this embodiment may perform the method steps shown in FIG. 5.
  • This embodiment provides a voice wake-up system, including:
  • the terminal is used to obtain a first voice signal, for example, a Chinese voice signal; identify the pinyin rhyme signal included in the first voice signal to obtain a first rhyme signal sequence corresponding to the first voice signal; Comparing the first rhyme signal sequence with the preset second rhyme signal sequence of the wake-up word to extract a third rhyme signal sequence with the same content as the second rhyme signal sequence from the first rhyme signal sequence; Send the whole Pinyin speech signal corresponding to the third rhyme signal sequence to the server;
  • a first voice signal for example, a Chinese voice signal
  • identify the pinyin rhyme signal included in the first voice signal to obtain a first rhyme signal sequence corresponding to the first voice signal
  • Comparing the first rhyme signal sequence with the preset second rhyme signal sequence of the wake-up word to extract a third rhyme signal sequence with the same content as the second rhyme signal sequence from the first rhyme signal sequence
  • the server is configured to perform automatic speech recognition processing on the full spelling speech signal corresponding to the third rhyme signal sequence in the first voice signal, and determine whether the full spelling speech signal is a speech signal corresponding to the wake word.
  • this embodiment also provides a voice wake-up method, that is, the voice wake-up method is described from the execution flow on both sides of the terminal and the server.
  • the method includes:
  • the terminal acquires a first voice signal, which is, for example, a Chinese voice signal; recognizes the pinyin rhyme signal included in the first voice signal to obtain a first rhyme signal sequence corresponding to the first voice signal; converts the first The rhyme signal sequence is compared with the second rhyme signal sequence of the preset wake word to extract a third rhyme signal sequence with the same content as the second rhyme signal sequence from the first rhyme signal sequence; the third The whole Pinyin speech signal corresponding to the rhyme signal sequence is sent to the server;
  • a first voice signal which is, for example, a Chinese voice signal
  • recognizes the pinyin rhyme signal included in the first voice signal to obtain a first rhyme signal sequence corresponding to the first voice signal
  • converts the first The rhyme signal sequence is compared with the second rhyme signal sequence of the preset wake word to extract a third rhyme signal sequence with the same content as the second rhyme signal sequence from the first rhyme signal sequence
  • the third The whole Pinyin speech signal corresponding to the rhyme signal sequence is sent to the server;
  • the server performs automatic speech recognition processing on the full spelling speech signal corresponding to the third rhyme signal sequence in the first voice signal, and determines whether the full spelling speech signal is a speech signal corresponding to the wake word.
  • the first part recognizes the wake word for the first time on the terminal side by recognizing the rhyme signal in the first voice signal; the second part uses the rhythm part refined by the initial recognition on the server side
  • the Quanpin speech signal corresponding to the signal is automatically speech recognized, thereby completing the recognition process of whether the entire speech signal hits the wake word.
  • Embodiment 3 describes the overall architecture of a voice wake-up device.
  • the functions of the device can be implemented by means of an electronic device.
  • FIG. 14 it is a schematic structural diagram of an electronic device according to an embodiment of the present invention, which specifically includes: Memory 141 and processor 142.
  • the memory 141 is used to store programs.
  • the memory 141 may be configured to store various other data to support operations on the electronic device. Examples of such data include instructions for any application or method for operating on the electronic device, contact data, phone book data, messages, pictures, videos, etc.
  • the memory 141 may be implemented by any type of volatile or non-volatile storage device or a combination thereof, such as static random access memory (SRAM), electrically erasable programmable read only memory (EEPROM), erasable and removable Programmable read only memory (EPROM), programmable read only memory (PROM), read only memory (ROM), magnetic memory, flash memory, magnetic disk or optical disk.
  • SRAM static random access memory
  • EEPROM electrically erasable programmable read only memory
  • EPROM erasable and removable Programmable read only memory
  • PROM programmable read only memory
  • ROM read only memory
  • magnetic memory magnetic memory
  • flash memory magnetic disk or optical disk.
  • the processor 142 coupled to the memory 141, is used to execute the program in the memory 141 for:
  • Automatic speech recognition processing is performed on the full spelling speech signal corresponding to the third rhythm signal sequence in the first voice signal to determine whether the full spelling speech signal is a speech signal corresponding to the wake-up word.
  • the electronic device may further include: a communication component 143, a power component 144, an audio component 145, a display 146, and other components. Only some components are schematically shown in FIG. 14, and it does not mean that the electronic device includes only the components shown in FIG.
  • the communication component 143 is configured to facilitate wired or wireless communication between the electronic device and other devices.
  • Electronic devices can access wireless networks based on communication standards, such as WiFi, 2G or 3G, or a combination thereof.
  • the communication component 143 receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel.
  • the communication component 143 also includes a near field communication (NFC) module to facilitate short-range communication.
  • NFC near field communication
  • the NFC module can be implemented based on radio frequency identification (RFID) technology, infrared data association (IrDA) technology, ultra wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.
  • RFID radio frequency identification
  • IrDA infrared data association
  • UWB ultra wideband
  • Bluetooth Bluetooth
  • the power supply component 144 provides power for various components of the electronic device.
  • the power component 144 may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for electronic devices.
  • the audio component 145 is configured to output and / or input audio signals.
  • the audio component 145 includes a microphone (MIC).
  • the microphone When the electronic device is in an operation mode, such as a call mode, a recording mode, and a voice recognition mode, the microphone is configured to receive an external audio signal.
  • the received audio signal may be further stored in the memory 141 or transmitted via the communication component 143.
  • the audio component 145 further includes a speaker for outputting audio signals.
  • the display 146 includes a screen, which may include a liquid crystal display (LCD) and a touch panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from the user.
  • the touch panel includes one or more touch sensors to sense touch, swipe, and gestures on the touch panel. The touch sensor may not only sense the boundary of the touch or sliding action, but also detect the duration and pressure related to the touch or sliding operation.
  • Embodiment 5 describes the overall architecture of a voice wake-up device.
  • the functions of the device can be implemented by means of an electronic device.
  • FIG. 15 it is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
  • Memory 151 and processor 152 are schematic structural diagrams of an electronic device according to an embodiment of the present invention.
  • the memory 151 is used to store programs.
  • the memory 151 may be configured to store various other data to support operations on the electronic device. Examples of such data include instructions for any application or method for operating on the electronic device, contact data, phone book data, messages, pictures, videos, etc.
  • the memory 151 may be implemented by any type of volatile or nonvolatile storage device or a combination thereof, such as static random access memory (SRAM), electrically erasable programmable read only memory (EEPROM), erasable and removable Programmable read only memory (EPROM), programmable read only memory (PROM), read only memory (ROM), magnetic memory, flash memory, magnetic disk or optical disk.
  • SRAM static random access memory
  • EEPROM electrically erasable programmable read only memory
  • EPROM erasable and removable Programmable read only memory
  • PROM programmable read only memory
  • ROM read only memory
  • magnetic memory magnetic memory
  • flash memory magnetic disk or optical disk.
  • the processor 152 coupled to the memory 151, is used to execute the program in the memory 151 for:
  • Automatic speech recognition processing is performed on the full-volume voice signal corresponding to the third vowel signal sequence in the first voice signal to determine whether the full-volume voice signal is a voice signal corresponding to the wake-up word.
  • the electronic device may further include: a communication component 153, a power component 154, an audio component 155, a display 156, and other components.
  • FIG. 15 only schematically shows some components, which does not mean that the electronic device includes only the components shown in FIG. 15.
  • the communication component 153 is configured to facilitate wired or wireless communication between the electronic device and other devices.
  • Electronic devices can access wireless networks based on communication standards, such as WiFi, 2G or 3G, or a combination thereof.
  • the communication component 153 receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel.
  • the communication component 153 also includes a near field communication (NFC) module to facilitate short-range communication.
  • NFC near field communication
  • the NFC module can be implemented based on radio frequency identification (RFID) technology, infrared data association (IrDA) technology, ultra wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.
  • RFID radio frequency identification
  • IrDA infrared data association
  • UWB ultra wideband
  • Bluetooth Bluetooth
  • the power supply component 154 provides power for various components of the electronic device.
  • the power supply component 154 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for electronic devices.
  • the audio component 155 is configured to output and / or input audio signals.
  • the audio component 155 includes a microphone (MIC).
  • the microphone When the electronic device is in an operation mode, such as a call mode, a recording mode, and a voice recognition mode, the microphone is configured to receive an external audio signal.
  • the received audio signal may be further stored in the memory 151 or transmitted via the communication component 153.
  • the audio component 155 further includes a speaker for outputting audio signals.
  • the display 156 includes a screen, which may include a liquid crystal display (LCD) and a touch panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from the user.
  • the touch panel includes one or more touch sensors to sense touch, swipe, and gestures on the touch panel. The touch sensor may not only sense the boundary of the touch or sliding action, but also detect the duration and pressure related to the touch or sliding operation.
  • a program instructing relevant hardware may be completed by a program instructing relevant hardware.
  • the aforementioned program may be stored in a computer-readable storage medium.
  • the steps including the foregoing method embodiments are executed; and the foregoing storage medium includes various media that can store program codes, such as ROM, RAM, magnetic disk, or optical disk.

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Telephonic Communication Services (AREA)
  • Document Processing Apparatus (AREA)

Abstract

A voice wake-up method, apparatus and system, and an electronic device. The method comprises: obtaining a first voice signal (S510); recognizing a pinyin rhyme category signal comprised in a first voice signal to obtain a first rhyme category signal sequence corresponding to the first voice signal (S520); comparing the first rhyme category signal sequence with a second rhyme category signal sequence of a preset wake-up word to extract from the first rhyme category signal sequence a third rhyme category signal sequence having the same content as the second rhyme category signal sequence (S530); performing automatic voice recognition processing on a complete spelling voice signal, which corresponds to the third rhyme category signal sequence, in the first voice signal, and determining whether the complete spelling voice signal is a voice signal corresponding to the wake-up word (S540). The method can quickly and accurately recognize the wake-up word and improve a waken speed of a device.

Description

语音唤醒方法、装置、系统及电子设备Voice wake-up method, device, system and electronic equipment
本申请要求2018年10月11日递交的申请号为201811186019.4、发明名称为“语音唤醒方法、装置、系统及电子设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application requires the priority of the Chinese patent application filed on October 11, 2018 with the application number 201811186019.4 and the invention titled "voice wake-up method, device, system and electronic equipment", the entire contents of which are incorporated by reference in this application.
技术领域Technical field
本说明书涉及计算机技术领域,尤其涉及一种语音唤醒方法、装置、系统及电子设备。This specification relates to the field of computer technology, and in particular to a voice wake-up method, device, system, and electronic equipment.
背景技术Background technique
随着人工智能相关应用的越来越深入发展,语音识别技术作为智能化设备的基本的交互方式,扮演着越来越重要的角色。语音识别技术涉及到很多方面,其中包括通过语音指令来唤醒设备、对设备的操作进行控制、与设备进行人机对话以及针对多个设备的语音指令控制等。高效和准确的语音识别技术以及快捷便利的唤醒模式,是智能化设备的重要的发展方向。With the more and more in-depth development of artificial intelligence-related applications, voice recognition technology, as the basic interaction method of intelligent devices, plays an increasingly important role. Voice recognition technology involves many aspects, including awakening the device through voice commands, controlling the operation of the device, man-machine dialogue with the device, and voice command control for multiple devices. Efficient and accurate voice recognition technology and fast and convenient wake-up mode are important development directions for smart devices.
目前,自定义唤醒的主要性能瓶颈在于端上(终端设备)计算资源有限,核心部分的分类器对语音特征所分的类别数直接影响到唤醒的速度和准确率。传统的拼音粒度的分类策略是取常用汉字的全拼做分类,带声调的1200多个,去掉声调的400多个,可以达到80%左右的准确率。但是,要想达到更高的准确率,需要提高端上计算性能和完善很多后处理工作。At present, the main performance bottleneck of the custom wake-up is that the computing resources on the terminal (terminal device) are limited, and the number of categories of the core classifier on the voice feature directly affects the speed and accuracy of the wake-up. The traditional Pinyin granularity classification strategy is to take the full spelling of commonly used Chinese characters for classification, with more than 1,200 tones and more than 400 tones removed, which can achieve an accuracy rate of about 80%. However, in order to achieve higher accuracy, it is necessary to improve the on-end computing performance and improve a lot of post-processing work.
发明内容Summary of the invention
本发明提供了一种语音唤醒方法、装置、系统及电子设备,能够快速、准确的识别唤醒词,提高设备的被唤醒速度。The invention provides a voice wake-up method, device, system and electronic equipment, which can quickly and accurately identify wake-up words and improve the speed of the equipment being woken up.
为达到上述目的,本发明的实施例采用如下技术方案:To achieve the above objectives, the embodiments of the present invention adopt the following technical solutions:
第一方面,提供了一种语音唤醒方法,包括:In the first aspect, a voice wake-up method is provided, including:
获取第一语音信号;Get the first voice signal;
对所述第一语音信号中包含的拼音韵部信号进行识别,得到所述第一语音信号对应的第一韵部信号序列;Recognizing the pinyin rhyme signal included in the first speech signal to obtain a first rhyme signal sequence corresponding to the first speech signal;
将所述第一韵部信号序列与预设的唤醒词的第二韵部信号序列进行比较,以从所述 第一韵部信号序列中提取与所述第二韵部信号序列内容相同的第三韵部信号序列;Comparing the first rhyme signal sequence with a second rhyme signal sequence of a preset wake-up word, to extract from the first rhyme signal sequence the same content as the second rhyme signal sequence Three rhyme signal sequence;
对所述第三韵部信号序列对应在所述第一语音信号中的全拼语音信号进行自动语音识别处理,确定所述全拼语音信号是否为所述唤醒词对应的语音信号。Performing automatic speech recognition processing on the full spelling speech signal corresponding to the third rhythm signal sequence in the first voice signal to determine whether the full spelling speech signal is a voice signal corresponding to the wake-up word.
第二方面,提供了另一种语音唤醒方法,包括:In the second aspect, another voice wake-up method is provided, including:
获取第一语音信号;Get the first voice signal;
对所述第一语音信号中包含的元音信号进行识别,得到所述第一语音信号对应的第一元音信号序列;Identify the vowel signal contained in the first voice signal to obtain a first vowel signal sequence corresponding to the first voice signal;
将所述第一元音信号序列与预设的唤醒词的第二元音信号序列进行比较,以从所述第一元音信号序列中提取与所述第二元音信号序列内容相同的第三元音信号序列;Comparing the first vowel signal sequence with the preset second vowel signal sequence of the awakening word to extract from the first vowel signal sequence the same content as the second vowel signal sequence Three-vowel signal sequence;
对所述第三元音信号序列对应在所述第一语音信号中的全量语音信号进行自动语音识别处理,确定所述全量语音信号是否为所述唤醒词对应的语音信号。Performing automatic speech recognition processing on the full amount of voice signals corresponding to the third vowel signal sequence in the first voice signal to determine whether the full amount of voice signals are voice signals corresponding to the wake-up words.
第三方面,提供了一种语音唤醒装置,包括:In a third aspect, a voice wake-up device is provided, including:
信号获取模块,用于获取第一语音信号;The signal acquisition module is used to acquire the first voice signal;
信号识别模块,用于对所述第一语音信号中包含的拼音韵部信号进行识别,得到所述第一语音信号对应的第一韵部信号序列;The signal recognition module is used to recognize the pinyin rhyme signal included in the first speech signal to obtain a first rhyme signal sequence corresponding to the first speech signal;
信号比较模块,用于将所述第一韵部信号序列与预设的唤醒词的第二韵部信号序列进行比较,以从所述第一韵部信号序列中提取与所述第二韵部信号序列内容相同的第三韵部信号序列;The signal comparison module is used to compare the first rhyme signal sequence with the preset second rhyme signal sequence of the wake word to extract the second rhyme portion from the first rhyme signal sequence The third rhythm signal sequence with the same signal sequence content;
语音识别模块,用于对所述第三韵部信号序列对应在所述第一语音信号中的全拼语音信号进行自动语音识别处理,确定所述全拼语音信号是否为所述唤醒词对应的语音信号。The speech recognition module is used to perform automatic speech recognition processing on the full spelling speech signal corresponding to the third rhyme signal sequence in the first voice signal to determine whether the full spelling speech signal corresponds to the wake-up word voice signal.
第四方面,提供了另一种语音唤醒装置,包括:In the fourth aspect, another voice wake-up device is provided, including:
信号获取模块,用于获取第一语音信号;The signal acquisition module is used to acquire the first voice signal;
信号识别模块,用于对所述第一语音信号中包含的元音信号进行识别,得到所述第一语音信号对应的第一元音信号序列;The signal recognition module is used to recognize the vowel signal contained in the first voice signal to obtain the first vowel signal sequence corresponding to the first voice signal;
信号比较模块,用于将所述第一元音信号序列与预设的唤醒词的第二元音信号序列进行比较,以从所述第一元音信号序列中提取与所述第二元音信号序列内容相同的第三元音信号序列;The signal comparison module is used to compare the first vowel signal sequence with the preset second vowel signal sequence of the wake word to extract the second vowel from the first vowel signal sequence A third vowel signal sequence with the same signal sequence content;
语音识别模块,用于对所述第三元音信号序列对应在所述第一语音信号中的全量语音信号进行自动语音识别处理,确定所述全量语音信号是否为所述唤醒词对应的语音信 号。A voice recognition module, configured to perform automatic voice recognition processing on the full amount of voice signals corresponding to the third vowel signal sequence in the first voice signal, and determine whether the full amount of voice signals are the voice signals corresponding to the wake-up words .
第五方面,提供了一种语音唤醒系统,包括:In a fifth aspect, a voice wake-up system is provided, including:
终端,用于获取第一语音信号;对所述第一语音信号中包含的拼音韵部信号进行识别,得到所述第一语音信号对应的第一韵部信号序列;将所述第一韵部信号序列与预设的唤醒词的第二韵部信号序列进行比较,以从所述第一韵部信号序列中提取与所述第二韵部信号序列内容相同的第三韵部信号序列;将所述第三韵部信号序列对应的全拼语音信号发送至服务器;A terminal for acquiring a first speech signal; identifying the pinyin rhyme signal included in the first speech signal to obtain a first rhyme signal sequence corresponding to the first speech signal; Comparing the signal sequence with the second rhyme signal sequence of the preset wake-up word to extract a third rhyme signal sequence with the same content as the second rhyme signal sequence from the first rhyme signal sequence; The whole Pinyin speech signal corresponding to the third rhyme signal sequence is sent to the server;
所述服务器,用于对所述第三韵部信号序列对应在所述第一语音信号中的全拼语音信号进行自动语音识别处理,确定所述全拼语音信号是否为所述唤醒词对应的语音信号。The server is configured to perform automatic speech recognition processing on the full spelling speech signal corresponding to the third rhyme signal sequence in the first voice signal, and determine whether the full spelling speech signal corresponds to the wake-up word voice signal.
第六方面,提供了一种语音唤醒方法,包括:In a sixth aspect, a voice wake-up method is provided, including:
终端获取第一语音信号;对所述第一语音信号中包含的拼音韵部信号进行识别,得到所述第一语音信号对应的第一韵部信号序列;将所述第一韵部信号序列与预设的唤醒词的第二韵部信号序列进行比较,以从所述第一韵部信号序列中提取与所述第二韵部信号序列内容相同的第三韵部信号序列;将所述第三韵部信号序列对应的全拼语音信号发送至服务器;The terminal acquires the first voice signal; recognizes the pinyin rhyme signal included in the first voice signal to obtain a first rhyme signal sequence corresponding to the first voice signal; and compares the first rhyme signal sequence with Compare the second rhyme signal sequence of the preset wake-up word to extract a third rhyme signal sequence with the same content as the second rhyme signal sequence from the first rhyme signal sequence; The Quanpin speech signal corresponding to the signal sequence of the Sanyun Department is sent to the server;
所述服务器对所述第三韵部信号序列对应在所述第一语音信号中的全拼语音信号进行自动语音识别处理,确定所述全拼语音信号是否为所述唤醒词对应的语音信号。The server performs automatic speech recognition processing on the full spelling speech signal corresponding to the third rhythm signal sequence in the first voice signal to determine whether the full spelling speech signal is a speech signal corresponding to the wake-up word.
第七方面,提供了一种电子设备,包括:According to a seventh aspect, an electronic device is provided, including:
存储器,用于存储程序;Memory for storing programs;
处理器,耦合至所述存储器,用于执行所述程序,以用于:A processor, coupled to the memory, is used to execute the program for:
获取第一语音信号;Get the first voice signal;
对所述第一语音信号中包含的拼音韵部信号进行识别,得到所述第一语音信号对应的第一韵部信号序列;Recognizing the pinyin rhyme signal included in the first speech signal to obtain a first rhyme signal sequence corresponding to the first speech signal;
将所述第一韵部信号序列与预设的唤醒词的第二韵部信号序列进行比较,以从所述第一韵部信号序列中提取与所述第二韵部信号序列内容相同的第三韵部信号序列;Comparing the first rhyme signal sequence with a second rhyme signal sequence of a preset wake-up word, to extract from the first rhyme signal sequence the same content as the second rhyme signal sequence Three rhyme signal sequence;
对所述第三韵部信号序列对应在所述第一语音信号中的全拼语音信号进行自动语音识别处理,确定所述全拼语音信号是否为所述唤醒词对应的语音信号。Performing automatic speech recognition processing on the full spelling speech signal corresponding to the third rhythm signal sequence in the first voice signal to determine whether the full spelling speech signal is a voice signal corresponding to the wake-up word.
第八方面,提供了另一种电子设备,包括:In the eighth aspect, another electronic device is provided, including:
存储器,用于存储程序;Memory for storing programs;
处理器,耦合至所述存储器,用于执行所述程序,以用于:A processor, coupled to the memory, is used to execute the program for:
获取第一语音信号;Get the first voice signal;
对所述第一语音信号中包含的元音信号进行识别,得到所述第一语音信号对应的第一元音信号序列;Identify the vowel signal contained in the first voice signal to obtain a first vowel signal sequence corresponding to the first voice signal;
将所述第一元音信号序列与预设的唤醒词的第二元音信号序列进行比较,以从所述第一元音信号序列中提取与所述第二元音信号序列内容相同的第三元音信号序列;Comparing the first vowel signal sequence with the preset second vowel signal sequence of the awakening word to extract from the first vowel signal sequence the same content as the second vowel signal sequence Three-vowel signal sequence;
对所述第三元音信号序列对应在所述第一语音信号中的全量语音信号进行自动语音识别处理,确定所述全量语音信号是否为所述唤醒词对应的语音信号。Performing automatic speech recognition processing on the full amount of voice signals corresponding to the third vowel signal sequence in the first voice signal to determine whether the full amount of voice signals are voice signals corresponding to the wake-up words.
本发明提供了一种语音唤醒方法、装置、系统及电子设备,在获取到待识别的第一语音信号后,先对第一语音信号中包含的拼音韵部信号进行识别,得到第一语音信号对应的第一韵部信号序列;然后,将第一韵部信号序列与预设的唤醒词的第二韵部信号序列进行比较,以从第一韵部信号序列中提取与第二韵部信号序列内容相同的第三韵部信号序列;最后,对第三韵部信号序列对应在第一语音信号中的全拼语音信号进行自动语音识别处理,确定全拼语音信号是否为唤醒词对应的语音信号,进而识别出第一语音信号中是否包含唤醒词。本方案采用先对待识别的语音信号中的韵部信号与唤醒词的韵部进行比对,提取出待识别语音信号中韵部信号与唤醒词韵部相同的语音信号部分,然后针对该部分语音信号再整体通过自动语音识别处理以确定其中是否包含唤醒词,从而实现快速、准确的识别唤醒词,提高设备的被唤醒速度。The invention provides a voice wake-up method, device, system and electronic equipment. After acquiring the first voice signal to be recognized, the pinyin rhythm signal included in the first voice signal is recognized first to obtain the first voice signal The corresponding first rhyme signal sequence; then, the first rhyme signal sequence is compared with the preset second rhyme signal sequence of the wake-up word to extract the second rhyme signal from the first rhyme signal sequence The third rhyme signal sequence with the same sequence content; finally, the automatic speech recognition process is performed on the full-speech speech signal corresponding to the third voice-part signal sequence in the first speech signal to determine whether the full-speech speech signal is the speech corresponding to the wake-up word Signal to further identify whether the first speech signal contains a wake-up word. In this solution, the rhythm signal in the speech signal to be recognized is first compared with the rhythm part of the wake word to extract the part of the speech signal in the speech signal to be recognized that is the same as the rhythm part of the wake word, and then for the part of the speech The signal is then processed through automatic speech recognition to determine whether it contains wake-up words, so as to achieve fast and accurate recognition of wake-up words and improve the speed of the device being awakened.
上述说明仅是本申请技术方案的概述,为了能够更清楚了解本申请的技术手段,而可依照说明书的内容予以实施,并且为了让本申请的上述和其它目的、特征和优点能够更明显易懂,以下特举本申请的具体实施方式。The above description is only an overview of the technical solutions of this application. In order to understand the technical means of this application more clearly, it can be implemented in accordance with the content of the specification, and in order to make the above and other purposes, features and advantages of this application more obvious and understandable The specific implementation of this application is listed below.
附图说明BRIEF DESCRIPTION
通过阅读下文优选实施方式的详细描述,各种其他的优点和益处对于本领域普通技术人员将变得清楚明了。附图仅用于示出优选实施方式的目的,而并不认为是对本申请的限制。而且在整个附图中,用相同的参考符号表示相同的部件。在附图中:By reading the detailed description of the preferred embodiments below, various other advantages and benefits will become clear to those of ordinary skill in the art. The drawings are only for the purpose of showing the preferred embodiments, and are not considered to limit the present application. Furthermore, the same reference numerals are used to denote the same parts throughout the drawings. In the drawings:
图1为语音唤醒的基本流程逻辑示意图;Figure 1 is a logic schematic diagram of the basic flow of voice wake-up;
图2为语音唤醒的基本流程中端上唤醒引擎的处理逻辑示意图;Figure 2 is a schematic diagram of the processing logic of the wake-up engine on the upper end of the basic process of voice wake-up;
图3为本发明实施例的唤醒引擎的处理逻辑示意图;3 is a schematic diagram of processing logic of a wake-up engine according to an embodiment of the present invention;
图4为本发明实施例的语音唤醒系统结构图;4 is a structural diagram of a voice wake-up system according to an embodiment of the present invention;
图5为本发明实施例的语音唤醒方法流程图一;5 is a flowchart 1 of a voice wake-up method according to an embodiment of the present invention;
图6为本发明实施例的语音唤醒方法流程图二;6 is a second flowchart of a voice wake-up method according to an embodiment of the present invention;
图7为本发明实施例的韵部分类训练方法流程图一;7 is a flowchart 1 of a rhythm class training method according to an embodiment of the present invention;
图8为本发明实施例的韵部分类训练方法流程图二;8 is a flowchart 2 of a rhythm class training method according to an embodiment of the present invention;
图9为本发明实施例的语音唤醒装置结构图一;9 is a structural diagram 1 of a voice wake-up device according to an embodiment of the invention;
图10为本发明实施例的语音唤醒装置结构图二;10 is a second structural diagram of a voice wake-up device according to an embodiment of the present invention;
图11为本发明实施例的韵部分类训练装置结构图一;11 is a structural diagram 1 of a rhythm class training device according to an embodiment of the present invention;
图12为本发明实施例的韵部分类训练装置结构图二;12 is a structural diagram 2 of a rhythm class training device according to an embodiment of the present invention;
图13为本发明实施例的语音唤醒方法流程图三;13 is a flowchart 3 of a voice wake-up method according to an embodiment of the present invention;
图14为本发明实施例的电子设备的结构示意图一;14 is a first schematic structural diagram of an electronic device according to an embodiment of the present invention;
图15为本发明实施例的电子设备的结构示意图二。15 is a second schematic structural diagram of an electronic device according to an embodiment of the present invention.
具体实施方式detailed description
下面将参照附图更详细地描述本公开的示例性实施例。虽然附图中显示了本公开的示例性实施例,然而应当理解,可以以各种形式实现本公开而不应被这里阐述的实施例所限制。相反,提供这些实施例是为了能够更透彻地理解本公开,并且能够将本公开的范围完整的传达给本领域的技术人员。Hereinafter, exemplary embodiments of the present disclosure will be described in more detail with reference to the accompanying drawings. Although exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure can be implemented in various forms and should not be limited by the embodiments set forth herein. Rather, these embodiments are provided to enable a more thorough understanding of the present disclosure and to fully convey the scope of the present disclosure to those skilled in the art.
如图1所示,为语音唤醒的基本流程,语音设备接受到语音信号后,先对语音信号进行信号处理(主要包括降噪、回声消除)和特征抽取,从而将原始输入音频信号转换成端上(终端)唤醒引擎可以识别的特征(即语音的频谱信号);然后将特征输入至唤醒引擎进行唤醒词的比对识别;当唤醒词命中,会继续指示服务端执行后续指令,如播放歌曲、相声等。As shown in Figure 1, for the basic process of voice wake-up, after receiving the voice signal, the voice device first performs signal processing (mainly including noise reduction and echo cancellation) and feature extraction on the voice signal, thereby converting the original input audio signal into the terminal The features (that is, the frequency spectrum signal of the voice) that can be recognized by the wake-up engine on the (terminal); then enter the features into the wake-up engine for comparison and recognition of wake-up words; when the wake-up word hits, it will continue to instruct the server to execute subsequent instructions, such as playing songs , Crosstalk, etc.
在图1所示的语音唤醒的基本流程中,端上唤醒引擎可以认为是执行唤醒的核心部分。如图2所示,该端上唤醒引擎主要包括两部分:分类器和后处理部分。In the basic flow of voice wake-up shown in FIG. 1, the on-end wake-up engine can be considered as the core part of performing wake-up. As shown in Figure 2, the wake-up engine on this end mainly includes two parts: a classifier and a post-processing part.
一、分类器,用于将连续语音特征转换成不同的类别,这一部分计算,往往是所有唤醒工作最消耗计算资源的部分,通常神经网络的最后一层输出的分类数,直接决定了整个网络的计算规模。传统的基于隐马尔科夫模型-深度神经网络(Hidden Markov Model-Deep Neural Network,HMM-DNN),采用音速(phone)的概率密度函数(Probability Density Function,PDF)进行建模,而要想达到生产可用状态,需要至少6000~8000个分类;采用拼音做分类,也需要1200~400多个分类。First, the classifier is used to convert continuous speech features into different categories. This part of the calculation is often the most expensive part of all wake-up work. Usually the number of classifications output by the last layer of the neural network directly determines the entire network. Calculation scale. The traditional Hidden Markov Model-Deep Neural Network (HMM-DNN) is modeled using the probability density function (Probability Density, PDF) of the speed of sound (phone). Production availability requires at least 6000 to 8000 classifications; using Pinyin for classification also requires more than 1200 to 400 classifications.
二、后处理,唤醒词的检验都存在后处理部分,传统方法检测整个词,可以在对经 分类器输出的语音进行平滑处理(smooth)后采用动态时间规整算法(Dynamic Time Warping,DTW)识别该语音与唤醒词是否相同;也可以采用自动语音识别(Automatic Speech Recognition,ASR)技术对语音是否命中唤醒词进行识别。Second, post-processing, there is a post-processing part in the detection of wake-up words. The traditional method detects the entire word, and can use dynamic time warping algorithm (Dynamic Time Warping, DTW) recognition after smoothing the speech output by the classifier. Whether the voice is the same as the wake-up word; automatic speech recognition (Automatic Speech Recognition, ASR) technology can also be used to recognize whether the voice hits the wake-up word.
采用上述的端上唤醒引擎中的分类器,由于最终使用的分类数较多致使分类网络庞大,需要端上配置较高的计算性能。With the above-mentioned classifier in the on-end wake-up engine, due to the large number of classifications used in the end, the classification network is huge, and a high computing performance needs to be configured on the end.
本发明实施例改善了现有技术中分类网络庞大导致的需要在端上配置较高计算资源才能准确、快速执行语音唤醒的缺陷,其核心思想是,将执行语音唤醒的核心部分拆分成两次唤醒词的识别过程。第一次唤醒词识别过程在终端上完成,该过程只对唤醒词的拼音韵部进行分类识别,完成对待识别的语音信号的初步识别过程。然后,将初步筛选出的与唤醒词韵部信号相同的韵部信号所对应的全量语音信号发送至云端,由云端再次对该语音信号整体进行识别,确定该语音信号是否命中唤醒词。The embodiment of the present invention improves the defect in the prior art that the huge classification network leads to the need to configure higher computing resources on the end to accurately and quickly perform voice wake-up. The core idea is to split the core part of performing voice wake-up into two The recognition process of the second wake word. The first wake-up word recognition process is completed on the terminal. This process only classifies and recognizes the pinyin rhyme part of the wake-up word, completing the preliminary recognition process of the speech signal to be recognized. Then, the full-scale speech signal corresponding to the rhyme signal that is initially selected and the same as the rhyme signal of the arousal word is sent to the cloud, and the cloud recognizes the entire speech signal again to determine whether the speech signal hits the arousal word.
如图3所示,为本发明实施例的唤醒引擎的处理逻辑示意图,涉及设备端(如智能音箱等可接收和识别语音的终端)以及云端(设置有服务器)两个执行语音唤醒的主体。As shown in FIG. 3, it is a schematic diagram of a processing logic of a wake-up engine according to an embodiment of the present invention, and relates to two main bodies that perform voice wake-up, a device side (a terminal that can receive and recognize voice such as a smart speaker) and a cloud side (a server is provided).
在设备端上,待识别的语音信号首先经过第一次唤醒词识别,该识别过程只对语音信号的韵部信号通过预先训练生成的分类器进行韵部分类识别;然后,将识别出的韵部信号序列通过后处理与唤醒词的韵部进行比对,以判断语音信号中是否命中唤醒词的韵部,并将命中唤醒词韵部的全量语音信号传送到云端。On the device side, the speech signal to be recognized first undergoes the first wake-up word recognition. This recognition process only performs rhythm class recognition on the rhythm signal of the speech signal through the pre-trained classifier; then, the recognized rhyme The part signal sequence is compared with the prosodic part of the wake-up word through post-processing to determine whether the prosodic part of the wake-up word is hit in the voice signal, and the full amount of voice signal hitting the prosperous part of the wake-up word is transmitted to the cloud.
在云端,待识别的语音信号为韵部信号与唤醒词韵部相同的全量语音信号,对这些语音信号进行第二次唤醒词识别(二次检验),该识别过程是对语音信号整体进行识别,例如采用ASR技术,对该语音信号是否命中唤醒词进行识别。In the cloud, the voice signal to be recognized is a full-volume voice signal with the same rhyme signal and arousal word rhyme. Perform a second wake-up word recognition (second test) on these voice signals. The recognition process is to recognize the entire voice signal For example, ASR technology is used to identify whether the voice signal hits the wake word.
基于上述语音唤醒的方案思想,图4为本发明实施例提供的语音唤醒系统结构图。如图4所示,该系统包括终端410和服务器420,其中:Based on the above voice wake-up solution idea, FIG. 4 is a structural diagram of a voice wake-up system provided by an embodiment of the present invention. As shown in FIG. 4, the system includes a terminal 410 and a server 420, where:
终端410包括: Terminal 410 includes:
信号获取模块,用于获取第一语音信号,该第一语音信号例如为中文语音信号;A signal acquisition module, for acquiring a first voice signal, the first voice signal is, for example, a Chinese voice signal;
信号识别模块,用于对第一语音信号中包含的拼音韵部信号进行识别,得到第一语音信号对应的第一韵部信号序列;The signal recognition module is used for recognizing the pinyin rhyme signal included in the first speech signal to obtain the first rhyme signal sequence corresponding to the first speech signal;
信号比较模块,用于将第一韵部信号序列与预设的唤醒词的第二韵部信号序列进行比较,以从第一韵部信号序列中提取与第二韵部信号序列内容相同的第三韵部信号序列;The signal comparison module is used to compare the first rhyme signal sequence with the preset second rhyme signal sequence of the wake-up word, so as to extract from the first rhyme signal sequence the same content as the second rhyme signal sequence. Three rhyme signal sequence;
服务器420包括:The server 420 includes:
语音识别模块,用于对第三韵部信号序列对应在第一语音信号中的全拼语音信号进 行自动语音识别处理,确定全拼语音信号是否为唤醒词对应的语音信号。The speech recognition module is used to perform automatic speech recognition processing on the full spelling speech signal corresponding to the third rhyme signal sequence in the first voice signal, and determine whether the full spelling speech signal is a speech signal corresponding to the wake word.
下面通过多个实施例来进一步说明本申请的技术方案。The technical solutions of the present application are further described below through multiple embodiments.
实施例一Example one
基于上述语音唤醒的方案思想,如图5所示,其为本发明实施例示出的语音唤醒方法流程图一,该方法的执行主体可以为部署在图4中终端410和服务器420中的模块。其中,步骤S510~530可在端上(终端)执行,步骤S540可在云端(服务器)中执行。如图5所示,该语音唤醒方法包括如下步骤:Based on the above voice wake-up solution idea, as shown in FIG. 5, which is a flowchart 1 of the voice wake-up method shown in an embodiment of the present invention. The method may be executed by modules deployed in the terminal 410 and the server 420 in FIG. 4. Among them, steps S510-530 can be executed on the terminal (terminal), and step S540 can be executed on the cloud (server). As shown in FIG. 5, the voice wake-up method includes the following steps:
S510,获取第一语音信号,该第一语音信号例如为中文语音信号。S510. Acquire a first voice signal, where the first voice signal is, for example, a Chinese voice signal.
其中,第一语音信号可以为通过语音设备接收的语音信号,通过对该语音信号进行唤醒词的识别,以进一步唤醒目标设备。Wherein, the first voice signal may be a voice signal received through the voice device, and the wake-up word is recognized by the voice signal to further wake up the target device.
S520,对第一语音信号中包含的拼音韵部信号进行识别,得到第一语音信号对应的第一韵部信号序列。S520: Identify the pinyin rhyme signal included in the first speech signal to obtain a first rhyme signal sequence corresponding to the first speech signal.
本步骤中,将拼音的声部和韵部分开:如tian->t,ian;mao->m,ao。汉语的日常对话中,拼音声部(简称“声部”)如t、m往往都是爆破音,从语音信号的特征频谱图上看,声部就是一个短暂的尖峰或低谷,基本所有的延长音,都在拼音韵部(简称“韵部”)。传统的三音素(triphone)建模中,往往需要结合前、后phone,才能达到不错的识别准确率,本方案在端上计算优先的情况下,去掉对声部的识别,只是对第一语音信号中包含的韵部信号进行识别,从而得到第一语音信号对应的第一韵部信号序列。该第一韵部信号序列中包含时间序列,以及位于时间序列上每个时刻点的韵部信号。In this step, the pinyin and rhyme parts are separated: such as tian-> t, ian; mao-> m, ao. In the daily conversation of Chinese, the pinyin parts (referred to as "voices") such as t and m are often blasting sounds. From the characteristic spectrum of the speech signal, the voice part is a short-lived peak or trough, basically all the extensions The sounds are in the Pinyin Rhyme Department (referred to as "Rhyme Department"). In the traditional triphone modeling, it is often necessary to combine the front and back phones to achieve a good recognition accuracy. In this case, when the priority is calculated on the end, the recognition of the voice is removed, only the first speech The rhythm signals included in the signal are identified to obtain a first rhyme signal sequence corresponding to the first speech signal. The first rhyme signal sequence includes a time sequence and a rhyme signal located at each time point in the time sequence.
S530,将第一韵部信号序列与预设的唤醒词的第二韵部信号序列进行比较,以从第一韵部信号序列中提取与第二韵部信号序列内容相同的第三韵部信号序列。S530: Compare the first rhyme signal sequence with the preset second rhyme signal sequence of the wake word to extract a third rhyme signal with the same content as the second rhyme signal sequence from the first rhyme signal sequence sequence.
传统的唤醒词识别方法是检测整个词,本方案为了减少端上的计算量,只在端上识别每个字的韵部,即将上述第一韵部信号序列与预设的唤醒词的第二韵部信号序列进行比较,以从第一韵部信号序列中提取与第二韵部信号序列内容相同的第三韵部信号序列。The traditional wake word recognition method is to detect the entire word. In order to reduce the amount of calculation on the end, this scheme only recognizes the rhyme of each word on the end, that is, the above first rhyme signal sequence and the second of the preset wake word The rhyme signal sequences are compared to extract a third rhyme signal sequence with the same content as the second rhyme signal sequence from the first rhyme signal sequence.
例如,假设预设的唤醒词为“你好”,对应的第二韵部信号序列为“ǐ,ǎo”,那么在第一韵部信号序列中的信号序列为“ǐ,ǎo”的,都可作为第三韵部信号序列。For example, suppose the preset wake-up word is "hello" and the corresponding second rhyme signal sequence is "ǐ, ǎo", then the signal sequence in the first rhyme signal sequence is "ǐ, ǎo", both Can be used as the third rhyme signal sequence.
S540,对第三韵部信号序列对应在第一语音信号中的全拼语音信号进行自动语音识别处理,确定全拼语音信号是否为唤醒词对应的语音信号。S540. Perform automatic speech recognition processing on the full-speech speech signal corresponding to the third rhythm signal sequence in the first voice signal to determine whether the full-spell speech signal is a speech signal corresponding to the wake-up word.
在实际应用场景中,由于端上只验证了韵部,如:“好”,“老”,“考”,由于拥有同样韵部,都可顺利通过分类,并可能作为韵部识别唤醒词的初步结果。因此,在 云端上需要执行二次校验,对第三韵部信号序列对应在第一语音信号中的全拼语音信号(包含声部信号)进行自动语音识别ASR处理,以确定全拼语音信号是否为唤醒词对应的语音信号。In actual application scenarios, since only the rhyme part is verified on the end, such as: "good", "old", and "examination", since they have the same rhyme part, they can all pass the classification smoothly, and may be used as a rhythm recognition wake word. preliminary result. Therefore, it is necessary to perform a second verification on the cloud, and perform automatic speech recognition ASR processing on the full-speech speech signal (including the voice signal) corresponding to the third rhythm signal sequence in the first voice signal to determine the full-speech speech signal Whether it is the voice signal corresponding to the wake word.
本环节的二次校验就是过滤掉声部与唤醒词声部不一样的语音信号部分,这么做的好处是端上过滤到了绝大部分的非唤醒词,而云端只需要做最后校验就可以识别到真正的唤醒词,如此平衡了端上和服务端的计算,既可以有很高的准确率,同时不会有由于端上模型过大带来的高延时。The second verification in this session is to filter out the part of the voice signal that is different from the part of the wake word. The advantage of this is that most non-wake words are filtered on the end, and the cloud only needs to do the final verification. The real wake-up word can be recognized, so that the calculation of the end and the server is balanced, which can have a high accuracy rate, and at the same time, there will be no high delay caused by the large model on the end.
本发明提供的语音唤醒方法,在获取到待识别的第一语音信号后,先对第一语音信号中包含的拼音韵部信号进行识别,得到第一语音信号对应的第一韵部信号序列;然后,将第一韵部信号序列与预设的唤醒词的第二韵部信号序列进行比较,以从第一韵部信号序列中提取与第二韵部信号序列内容相同的第三韵部信号序列;最后,对第三韵部信号序列对应在第一语音信号中的全拼语音信号进行自动语音识别处理,确定全拼语音信号是否为唤醒词对应的语音信号,进而识别出第一语音信号中是否包含唤醒词。本方案采用先对待识别的语音信号中的韵部信号与唤醒词的韵部进行比对,提取出待识别语音信号中韵部信号与唤醒词韵部相同的语音信号部分,然后针对该部分语音信号再整体通过自动语音识别处理以确定其中是否包含唤醒词,从而实现快速、准确的识别唤醒词,提高设备的被唤醒速度。The voice wake-up method provided by the present invention, after acquiring the first voice signal to be recognized, first recognizes the pinyin rhyme signal included in the first voice signal to obtain the first rhyme signal sequence corresponding to the first voice signal; Then, the first rhyme signal sequence is compared with the preset second rhyme signal sequence of the wake-up word to extract a third rhyme signal having the same content as the second rhyme signal sequence from the first rhyme signal sequence Sequence; finally, automatic speech recognition processing is performed on the full spelling speech signal corresponding to the third rhythm signal sequence in the first voice signal to determine whether the full spelling speech signal is a speech signal corresponding to the wake-up word, and then the first speech signal is recognized Whether to include the wake word. In this solution, the rhythm signal in the speech signal to be recognized is first compared with the rhythm part of the wake word to extract the part of the speech signal in the speech signal to be recognized that is the same as the rhythm part of the wake word, and then for the part of the speech The signal is then processed through automatic speech recognition to determine whether it contains wake-up words, so as to achieve fast and accurate recognition of wake-up words and improve the speed of the device being awakened.
实施例二Example 2
如图6所示,为本发明实施例的语音唤醒方法流程图二。在上一实施例中所示方法的基础上,增加了预处理环节,并对步骤S520和S530进行了细化。如图6所示,该语音唤醒方法包括如下步骤:As shown in FIG. 6, it is a flowchart 2 of a voice wake-up method according to an embodiment of the present invention. On the basis of the method shown in the previous embodiment, a preprocessing link is added, and steps S520 and S530 are refined. As shown in FIG. 6, the voice wake-up method includes the following steps:
S610,获取第一语音信号,该第一语音信号例如为中文语音信号。S610: Acquire a first voice signal, where the first voice signal is, for example, a Chinese voice signal.
步骤S610与步骤S510内容相同。The content of step S610 is the same as that of step S510.
S620,对第一语音信号进行去噪的预处理。S620: Perform pre-processing for denoising the first speech signal.
在获取到第一语音信号之后,可对第一语音信号进行降噪、回声消除等预处理,以最大程度保留第一语音信号中的有效信号比例。After the first speech signal is acquired, the first speech signal may be subjected to pre-processing such as noise reduction and echo cancellation to maximize the retention of the effective signal ratio in the first speech signal.
S630,获取预处理后的第一语音信号的特征频谱。S630: Obtain the characteristic spectrum of the preprocessed first speech signal.
其中,所谓特征频谱指在进行分类识别或分类训练时,待处理的语音信号需要被转换为满足一定特征要求的频谱信号。Among them, the so-called feature spectrum refers to the voice signal to be processed needs to be converted into a spectrum signal that meets certain feature requirements when performing classification recognition or classification training.
比如,在对第一语音信号进行分类识别时,将待识别的第一语音信号转换为频谱信 号后,将音频按固定时间长度切成如20ms左右的帧频谱信号,以作为后续分类识别时的特征频谱。For example, when classifying and recognizing the first voice signal, after converting the first voice signal to be recognized into a spectrum signal, the audio is cut into a frame spectrum signal of about 20 ms according to a fixed time length, which is used as a subsequent classification recognition Characteristic spectrum.
S640,对第一语音信号的特征频谱采用韵部分类器进行分类计算,得到第一语音信号对应的第一韵部信号序列。S640: Perform classification calculation on the characteristic spectrum of the first speech signal by using a rhyme classifier to obtain a first rhyme signal sequence corresponding to the first speech signal.
其中,韵部分类器可为预先训练生成的语音分类模型,但该语音分类模型只对语音信号中的韵部信号进行分类,并输出相应韵部信号的序列值。Among them, the rhyme classifier may be a speech classification model generated in advance, but the speech classification model only classifies the rhyme signal in the speech signal and outputs the sequence value of the corresponding rhyme signal.
步骤S630~S640为上述步骤S520的细化。Steps S630 to S640 are refinements of the above step S520.
进一步地,可采用如图7所示的韵部分类训练方法,训练生成上述韵部分类器,方法包括:Further, a rhythm class training method as shown in FIG. 7 may be adopted to train and generate the above-mentioned rhythm class classifier. The method includes:
S710,获取用于模型训练的语音信号的特征频谱。S710. Acquire a characteristic spectrum of a voice signal used for model training.
S720,对特征频谱中的拼音韵部信号进行标注。S720, annotate the pinyin rhyme signal in the characteristic spectrum.
通常,相同韵部由于在发音过程中受到声部信号的影响,它们在特征频谱中表现的形态也会不完全相同。通过有监督的学习,可以快速、准确地锁定不同韵部所对应的韵部信号的特征形态。In general, the same rhyme part is affected by the voice signal during the pronunciation process, and their appearance in the characteristic spectrum will not be exactly the same. Through supervised learning, you can quickly and accurately lock the features of the rhyme signals corresponding to different rhymes.
S730,以已标注的拼音韵部信号作为训练样本,采用神经网络算法以及连接时序分类的联合模型算法训练生成韵部分类器。S730, the labeled pinyin rhyme signal is used as a training sample, and a neural network algorithm and a joint model algorithm connected with time series classification are used to train and generate a rhyme classifier.
训练过程主要包含两个处理环节,一个是如何准确对不同韵部的特征频谱信号进行韵部分类;另一个则是如何将分好类别的韵部放置到语音信号中的正确位置。The training process mainly includes two processing links, one is how to accurately classify the characteristic spectrum signals of different rhyme parts; the other is how to place the classified rhyme parts into the correct position in the speech signal.
在解决这两个问题时,可以采用神经网络算法实现对不同韵部的特征频谱信号进行准确的韵部分类,并结合连接时序分类(Connectionist Temporal Classification,CTC)算法,以锁定分好类别的韵部在语音信号中的正确位置。采用这两种模型算法进行联合建模,以基于训练样本训练生成韵部分类器。When solving these two problems, a neural network algorithm can be used to accurately classify the characteristic spectrum signals of different rhythm parts, and combined with the connection timing classification (ConnectionistTemporalClassification, CTC) algorithm to lock the rhyme of the classified category The correct position of the Ministry in the voice signal. These two model algorithms are used for joint modeling to generate a rhyme classifier based on training samples.
进一步地,还可采用如图8所示的韵部分类训练方法,训练生成上述韵部分类器,方法包括:Further, a rhythm class training method as shown in FIG. 8 can also be used to train and generate the above-mentioned rhythm class classifier. The method includes:
S810,获取用于模型训练的语音信号的特征频谱。S810. Acquire a characteristic spectrum of a voice signal used for model training.
S820,对特征频谱中的拼音韵部信号进行标注。S820, annotate the pinyin rhyme signal in the characteristic spectrum.
通常,相同韵部由于在发音过程中受到声部信号的影响,它们在特征频谱中表现的形态也会不完全相同。通过有监督的学习,可以快速、准确地锁定不同韵部所对应的韵部信号的特征形态。In general, the same rhyme part is affected by the voice signal during the pronunciation process, and their appearance in the characteristic spectrum will not be exactly the same. Through supervised learning, you can quickly and accurately lock the features of the rhyme signals corresponding to different rhymes.
S830,以已标注的拼音韵部信号作为训练样本,采用隐马尔科夫模型以及深度神经 网络的联合模型算法训练生成韵部分类器。In S830, the marked pinyin rhyme signal is used as a training sample, and a hidden Markov model and a deep neural network combined model algorithm are used to train and generate a rhyme class classifier.
训练过程主要包含两个处理环节,一个是如何准确对不同韵部的特征频谱信号进行韵部分类;另一个则是如何将分好类别的韵部放置到语音信号中的正确位置。The training process mainly includes two processing links, one is how to accurately classify the characteristic spectrum signals of different rhyme parts; the other is how to place the classified rhyme parts into the correct position in the speech signal.
在解决这两个问题时,也可以采用隐马尔科夫模型(HMM-DNN)两种模型算法进行联合建模,以基于训练样本训练生成韵部分类器。When solving these two problems, the hidden Markov model (HMM-DNN) two model algorithms can also be used for joint modeling to generate a rhythm classifier based on training samples.
与现有技术不同的是,本方案中的分类器,为针对拼音韵部进行分类的韵部分类器。Different from the prior art, the classifier in this solution is a classifier for the rhyme part that classifies the pinyin rhyme part.
S650,采用动态时间规整算法将第一韵部信号序列与预设的唤醒词的第二韵部信号序列按时序对应进行比较,以从第一韵部信号序列中提取与第二韵部信号序列内容相同的第三韵部信号序列。S650, a dynamic time warping algorithm is used to compare the first rhyme signal sequence with the preset second rhyme signal sequence of the wake-up word according to the time sequence, so as to extract the second rhyme signal sequence from the first rhyme signal sequence The third rhyme signal sequence with the same content.
在将第一韵部信号序列与预设的唤醒词的第二韵部信号序列进行比较时,可采用动态时间规整(Dynamic Time Warping,DTW)算法将比对的两个信号序列进行位置对齐,然后按时序对应进行比较,以从第一韵部信号序列中提取与第二韵部信号序列内容相同的第三韵部信号序列。When comparing the first rhyme signal sequence with the preset second rhyme signal sequence of the wake-up word, the dynamic time warping (Dynamic Time Warping, DTW) algorithm can be used to align the two signal sequences that are compared, Then, the comparison is performed according to the timing correspondence to extract the third rhyme signal sequence with the same content as the second rhyme signal sequence from the first rhyme signal sequence.
S660,对第三韵部信号序列对应在第一语音信号中的全拼语音信号进行自动语音识别处理,确定全拼语音信号是否为唤醒词对应的语音信号。S660: Perform automatic speech recognition processing on the full spelling speech signal corresponding to the third rhythm signal sequence in the first voice signal to determine whether the full spelling speech signal is a speech signal corresponding to the wake word.
步骤S660与步骤S540内容相同。Step S660 is the same as step S540.
本发明提供的语音唤醒方法,在实施例一的基础上进行了方法拓展:The voice wake-up method provided by the present invention is expanded on the basis of the first embodiment:
首先,在获取到第一语音信号之后,对第一语音信号进行预处理,以最大程度保留第一语音信号中的有效信号比例。First, after the first speech signal is acquired, the first speech signal is preprocessed to retain the effective signal ratio in the first speech signal to the greatest extent.
其次,通过预先训练的韵部分类器,对第一语音信号中包含的拼音韵部信号进行识别,得到第一语音信号对应的第一韵部信号序列,以实现快速识别。在训练韵部分类器时,采用神经网络算法以及连接时序分类的联合模型算法进行训练建模,或者采用隐马尔科夫模型以及深度神经网络的联合模型算法进行训练建模,以保证训练出的韵部分类器的准确度。Secondly, the pinyin rhyme signal contained in the first speech signal is recognized through a pre-trained rhyme classifier to obtain a first rhyme signal sequence corresponding to the first speech signal, so as to realize rapid recognition. When training the rhythm classifier, the neural network algorithm and the joint model algorithm connected to the time series classification are used for training and modeling, or the hidden Markov model and the deep neural network joint model algorithm are used for training and modeling to ensure that the trained The accuracy of the rhyme classifier.
最后,采用动态时间规整算法对第一韵部信号序列与预设的唤醒词的第二韵部信号序列进行比较,以快速准确的得到第三韵部信号序列。Finally, a dynamic time warping algorithm is used to compare the first rhyme signal sequence with the preset second rhythm signal sequence of the wake-up word to quickly and accurately obtain the third rhyme signal sequence.
实施例三Example Three
如图9所示,为本发明实施例的语音唤醒装置结构图一,该语音唤醒装置可设置在图4所示的语音唤醒装置系统中,用于执行如图5所示的方法步骤,其包括:As shown in FIG. 9, it is a structural diagram 1 of a voice wake-up device according to an embodiment of the present invention. The voice wake-up device may be installed in the voice wake-up device system shown in FIG. 4 for performing the method steps shown in FIG. 5. include:
信号获取模块910,用于获取第一语音信号;The signal obtaining module 910 is used to obtain a first voice signal;
信号识别模块920,用于对第一语音信号中包含的拼音韵部信号进行识别,得到第一语音信号对应的第一韵部信号序列;The signal recognition module 920 is used for recognizing the pinyin rhyme signal included in the first speech signal to obtain the first rhyme signal sequence corresponding to the first speech signal;
信号比较模块930,用于将第一韵部信号序列与预设的唤醒词的第二韵部信号序列进行比较,以从第一韵部信号序列中提取与第二韵部信号序列内容相同的第三韵部信号序列;The signal comparison module 930 is configured to compare the first rhyme signal sequence with the preset second rhyme signal sequence of the wake-up word to extract the same content as the second rhyme signal sequence from the first rhyme signal sequence The third rhythm signal sequence;
语音识别模块940,用于对第三韵部信号序列对应在第一语音信号中的全拼语音信号进行自动语音识别处理,确定全拼语音信号是否为唤醒词对应的语音信号。The speech recognition module 940 is configured to perform automatic speech recognition processing on the full spelling speech signal corresponding to the third rhyme signal sequence in the first voice signal, and determine whether the full spelling speech signal is a speech signal corresponding to the wake word.
进一步地,如图10所示,上述语音唤醒装置中,信号识别模块920可包括:Further, as shown in FIG. 10, in the above voice wake-up device, the signal recognition module 920 may include:
特征获取单元101,用于获取第一语音信号的特征频谱;The feature obtaining unit 101 is used to obtain a feature spectrum of the first voice signal;
信号识别单元102,用于对第一语音信号的特征频谱采用韵部分类器进行分类计算,得到第一语音信号对应的第一韵部信号序列。The signal recognition unit 102 is configured to perform classification calculation on the characteristic spectrum of the first speech signal by using a rhyme classifier to obtain a first rhythm signal sequence corresponding to the first speech signal.
进一步地,在图10所示的语音唤醒装置中,还可包括:Further, the voice wake-up device shown in FIG. 10 may further include:
预处理模块103,用于对第一语音信号进行去噪的预处理。The pre-processing module 103 is used to perform pre-processing for denoising the first speech signal.
进一步地,上述信号比较模块930具体可用于,Further, the above-mentioned signal comparison module 930 may be specifically used for,
采用动态时间规整算法将第一韵部信号序列与预设的唤醒词的第二韵部信号序列按时序对应进行比较,以从第一韵部信号序列中提取与第二韵部信号序列内容相同的第三韵部信号序列。A dynamic time warping algorithm is used to compare the first rhyme signal sequence with the second rhyme signal sequence of the preset wake-up word according to the time sequence to extract the same content as the second rhyme signal sequence from the first rhyme signal sequence The third rhyme signal sequence.
图10所示语音唤醒装置可用于执行图6所示的方法步骤。The voice wake-up device shown in FIG. 10 can be used to perform the method steps shown in FIG. 6.
进一步地,如图11所示,上述语音唤醒装置中,还可包括:Further, as shown in FIG. 11, the above voice wake-up device may further include:
第一频谱获取模块111,用于获取用于模型训练的语音信号的特征频谱;The first spectrum acquisition module 111 is used to acquire the characteristic spectrum of the speech signal used for model training;
第一信号标注模块112,用于对特征频谱中的拼音韵部信号进行标注;The first signal labeling module 112 is used to label the pinyin rhyme signal in the characteristic spectrum;
第一训练模块113,用于以已标注的拼音韵部信号作为训练样本,采用神经网络算法以及连接时序分类的联合模型算法训练生成韵部分类器。The first training module 113 is configured to use the marked Pinyin rhyme signal as a training sample, and use a neural network algorithm and a joint model algorithm connected with time series classification to train and generate a rhyme classifier.
进一步地,如图12所示,上述语音唤醒装置中,还可包括:Further, as shown in FIG. 12, the foregoing voice wake-up device may further include:
第二频谱获取模块121,用于获取用于模型训练的语音信号的特征频谱;The second spectrum acquisition module 121 is used to acquire the characteristic spectrum of the speech signal used for model training;
第二信号标注模块122,用于对所述特征频谱中的拼音韵部信号进行标注;The second signal labeling module 122 is used to label the pinyin rhyme signal in the feature spectrum;
第二训练模块123,用于以所述已标注的拼音韵部信号作为训练样本,采用隐马尔科夫模型以及深度神经网络的联合模型算法训练生成所述韵部分类器。The second training module 123 is configured to use the marked Pinyin rhyme signal as a training sample, and use a hidden Markov model and a deep neural network joint model algorithm to train and generate the rhyme classifier.
图11、图12所示装置可用于对应执行图7、图8所示的方法步骤。The devices shown in FIGS. 11 and 12 can be used to correspondingly execute the method steps shown in FIGS. 7 and 8.
本发明提供的语音唤醒装置,在获取到待识别的第一语音信号后,先对第一语音信 号中包含的拼音韵部信号进行识别,得到第一语音信号对应的第一韵部信号序列;然后,将第一韵部信号序列与预设的唤醒词的第二韵部信号序列进行比较,以从第一韵部信号序列中提取与第二韵部信号序列内容相同的第三韵部信号序列;最后,对第三韵部信号序列对应在第一语音信号中的全拼语音信号进行自动语音识别处理,确定全拼语音信号是否为唤醒词对应的语音信号,进而识别出第一语音信号中是否包含唤醒词。本方案采用先对待识别的语音信号中的韵部信号与唤醒词的韵部进行比对,提取出待识别语音信号中韵部信号与唤醒词韵部相同的语音信号部分,然后针对该部分语音信号再整体通过自动语音识别处理以确定其中是否包含唤醒词,从而实现快速、准确的识别唤醒词,提高设备的被唤醒速度。The voice wake-up device provided by the present invention, after acquiring the first voice signal to be recognized, first recognizes the pinyin rhyme signal included in the first voice signal to obtain the first rhyme signal sequence corresponding to the first voice signal; Then, the first rhyme signal sequence is compared with the preset second rhyme signal sequence of the wake-up word to extract a third rhyme signal having the same content as the second rhyme signal sequence from the first rhyme signal sequence Sequence; finally, automatic speech recognition processing is performed on the full spelling speech signal corresponding to the third rhythm signal sequence in the first voice signal to determine whether the full spelling speech signal is a speech signal corresponding to the wake-up word, and then the first speech signal is recognized Whether to include the wake word. In this solution, the rhythm signal in the speech signal to be recognized is first compared with the rhythm part of the wake word to extract the part of the speech signal in the speech signal to be recognized that is the same as the rhythm part of the wake word, and then for the part of the speech The signal is then processed through automatic speech recognition to determine whether it contains wake-up words, so as to achieve fast and accurate recognition of wake-up words and improve the speed of the device being awakened.
进一步地,在获取到第一语音信号之后,对第一语音信号进行预处理,以最大程度保留第一语音信号中的有效信号比例。Further, after the first speech signal is acquired, the first speech signal is preprocessed to retain the effective signal ratio in the first speech signal to the greatest extent.
进一步地,通过预先训练的韵部分类器,对第一语音信号中包含的拼音韵部信号进行识别,得到第一语音信号对应的第一韵部信号序列,以实现快速识别。在训练韵部分类器时,采用神经网络算法以及连接时序分类的联合模型算法进行训练建模,或者采用隐马尔科夫模型以及深度神经网络的联合模型算法进行训练建模,以保证训练出的韵部分类器的准确度。Further, the pinyin rhyme signal included in the first speech signal is recognized by a pre-trained rhyme classifier to obtain a first rhyme signal sequence corresponding to the first speech signal, so as to realize rapid recognition. When training the rhythm classifier, the neural network algorithm and the joint model algorithm connected to the time series classification are used for training and modeling, or the hidden Markov model and the deep neural network joint model algorithm are used for training and modeling to ensure that the trained The accuracy of the rhyme classifier.
进一步地,采用动态时间规整算法对第一韵部信号序列与预设的唤醒词的第二韵部信号序列进行比较,以快速准确的得到第三韵部信号序列。Further, a dynamic time warping algorithm is used to compare the first rhyme signal sequence with the preset second rhyme signal sequence of the wake-up word to quickly and accurately obtain the third rhyme signal sequence.
实施例四Example 4
基于上述语音唤醒的方案思想,如图13所示,其为本发明实施例示出的语音唤醒方法流程图三,该方法的执行主体可以为部署在图4中终端410和服务器420中的模块。其中,步骤S131~133可在端上(终端)执行,步骤S134可在云端(服务器)中执行。如图13所示,该语音唤醒方法包括如下步骤:Based on the above voice wake-up solution idea, as shown in FIG. 13, which is a flowchart 3 of the voice wake-up method shown in an embodiment of the present invention. The method may be executed by modules deployed in the terminal 410 and the server 420 in FIG. 4. Among them, steps S131 to 133 can be executed on the terminal (terminal), and step S134 can be executed on the cloud (server). As shown in FIG. 13, the voice wake-up method includes the following steps:
S131,获取第一语音信号。S131. Acquire a first voice signal.
本步骤中,对第一语音信号的语言类型不做限定,例如可以为中文、英文、日文等。该第一语音信号可以为通过语音设备接收的语音信号,通过对该语音信号进行唤醒词的识别,以进一步唤醒目标设备。In this step, the language type of the first voice signal is not limited, for example, it may be Chinese, English, Japanese, and so on. The first voice signal may be a voice signal received through a voice device, and a wake-up word recognition is performed on the voice signal to further wake up the target device.
S132,对第一语音信号中包含的元音信号进行识别,得到第一语音信号对应的第一元音信号序列。S132: Identify the vowel signal included in the first voice signal to obtain a first vowel signal sequence corresponding to the first voice signal.
自然语音按发音学范畴划分,可为元音和辅音,比如在中文中,元音对应为拼音中 的韵部、辅音对应为拼音中的声部;又比如在英文中,包含5个元音:a、e、i、o、u,21个辅音;又比如在日文中,包含5个元音,以“あ·い·う·え·お”这五个假名来表示,音韵学上,其发音接近[a][i]
Figure PCTCN2019108828-appb-000001
[e][o],辅音方面,有清音-“か·さ·た·な·は·ま·や·ら·わ行”的辅音、浊音-“が·ざ·だ·ば行”的辅音、半浊音-“ぱ行”的辅音。对任一语言类型的第一语音信号中包含的元音信号进行识别,均可以得到第一语音信号对应的第一元音信号序列。例如,当第一语音信号为中文语音信号时,第一语音信号对应的第一元音信号序列可以为图5所示方法中的第一韵部信号。
Natural speech is divided into phonological categories and can be vowels and consonants. For example, in Chinese, vowels correspond to rhymes in Pinyin and consonants correspond to parts in Pinyin; for example, in English, it contains 5 vowels : A, e, i, o, u, 21 consonants; for example, in Japanese, it contains 5 vowels, which are represented by the five pseudonyms "あ · い · う · え · お". Phonologically, Its pronunciation is close to [a] [i]
Figure PCTCN2019108828-appb-000001
[e] [o], in terms of consonants, there are unvoiced sounds-"か · さ · た · な · は · ま · や · ら · わ 行" consonants, voiced sounds-"が · ざ · だ · ば 行" consonants , Semi-voiced sounds-consonants of "ぱ 行". The first vowel signal sequence corresponding to the first voice signal can be obtained by identifying the vowel signal contained in the first voice signal of any language type. For example, when the first speech signal is a Chinese speech signal, the first vowel signal sequence corresponding to the first speech signal may be the first rhyme signal in the method shown in FIG. 5.
S133,将第一元音信号序列与预设的唤醒词的第二元音信号序列进行比较,以从第一元音信号序列中提取与第二元音信号序列内容相同的第三元音信号序列。S133. Compare the first vowel signal sequence with the preset second vowel signal sequence of the wake word to extract a third vowel signal with the same content as the second vowel signal sequence from the first vowel signal sequence sequence.
例如,当处理对象为中文语音信号时,可以执行如步骤S530的内容,将上述第一韵部信号与唤醒词的第二韵部信号序列进行比较,从而从第一韵部信号序列中提取与第二韵部信号序列内容相同的第三韵部信号序列。For example, when the processing object is a Chinese speech signal, the content of step S530 may be performed, and the first rhyme signal and the second rhythm signal sequence of the arousal word are compared, so as to extract and A third rhyme signal sequence with the same content as the second rhythm signal sequence.
S134,对第三元音信号序列对应在第一语音信号中的全量语音信号进行自动语音识别处理,确定全量语音信号是否为唤醒词对应的语音信号。S134: Perform automatic voice recognition processing on the full amount of voice signals corresponding to the third vowel signal sequence in the first voice signal to determine whether the full amount of voice signals are the voice signals corresponding to the wake-up words.
其中,第三元音信号序列对应在第一语音信号中的全量语音信号为第三元音信号序列对应在第一语音信号中的区间范围内所有的语音信号。当语音信号为中文语音信号时,该全量语音信号即为上述第三韵部信号序列对应在第一语音信号中的全拼语音信号。Wherein, the third vowel signal sequence corresponds to the full amount of voice signals in the first voice signal is all voice signals within the interval range of the third vowel signal sequence corresponding to the first voice signal. When the voice signal is a Chinese voice signal, the full-volume voice signal is the full Pinyin voice signal corresponding to the first voice signal corresponding to the third rhyme signal sequence.
进一步地,根据第一语音信号所属语言类型的不同,上述第一语音信号中包含的元音信号可以具体为第一语音信号所属语言类型所包含的单音节中元音对应的语音信号。Further, according to different language types to which the first voice signal belongs, the vowel signal included in the first voice signal may specifically be a voice signal corresponding to a vowel in a single syllable included in the language type to which the first voice signal belongs.
例如,当第一语音信号为中文语音信号时,第一语音信号中包含的元音信号即为中文所包含的单字中韵部对应的语音信号。For example, when the first speech signal is a Chinese speech signal, the vowel signal included in the first speech signal is the speech signal corresponding to the rhyme part of the single word included in Chinese.
实施例五Example 5
本发明实施例提供一种语音唤醒装置,该语音唤醒装置可包含图9所示的所有模块,用于执行图13所示的方法步骤,其包括:An embodiment of the present invention provides a voice wake-up device. The voice wake-up device may include all the modules shown in FIG. 9 for performing the method steps shown in FIG. 13, which include:
信号获取模块910,用于获取第一语音信号;The signal obtaining module 910 is used to obtain a first voice signal;
信号识别模块920,用于对第一语音信号中包含的元音信号进行识别,得到第一语音信号对应的第一元音信号序列;The signal recognition module 920 is used to recognize the vowel signal contained in the first voice signal to obtain the first vowel signal sequence corresponding to the first voice signal;
信号比较模块930,用于将第一元音信号序列与预设的唤醒词的第二元音信号序列进行比较,以从第一元音信号序列中提取与第二元音信号序列内容相同的第三元音信号序列;The signal comparison module 930 is used to compare the first vowel signal sequence with the preset second vowel signal sequence of the wake word to extract the same content as the second vowel signal sequence from the first vowel signal sequence Third vowel signal sequence;
语音识别模块940,用于对第三元音信号序列对应在第一语音信号中的全量语音信号进行自动语音识别处理,确定全量语音信号是否为唤醒词对应的语音信号。The voice recognition module 940 is configured to perform automatic voice recognition processing on the full-volume voice signal corresponding to the third vowel signal sequence in the first voice signal, and determine whether the full-volume voice signal is a voice signal corresponding to the wake-up word.
进一步地,上述第一语音信号中包含的元音信号可以为第一语音信号所属语言类型所包含的单音节中元音对应的语音信号。Further, the vowel signal included in the first voice signal may be a voice signal corresponding to a vowel in a monosyllable included in the language type to which the first voice signal belongs.
例如,第一语音信号所属语言类型可包括:中文、英文、日文等。当第一语音信号为中文语音信号时,本实施例中的语音唤醒装置可以执行如图5所示的方法步骤。For example, the language type to which the first voice signal belongs may include: Chinese, English, Japanese, and so on. When the first voice signal is a Chinese voice signal, the voice wake-up device in this embodiment may perform the method steps shown in FIG. 5.
实施例六Example Six
本实施例提供了一种语音唤醒系统,包括:This embodiment provides a voice wake-up system, including:
终端,用于获取第一语音信号,该第一语音信号例如为中文语音信号;对第一语音信号中包含的拼音韵部信号进行识别,得到第一语音信号对应的第一韵部信号序列;将第一韵部信号序列与预设的唤醒词的第二韵部信号序列进行比较,以从第一韵部信号序列中提取与第二韵部信号序列内容相同的第三韵部信号序列;将第三韵部信号序列对应的全拼语音信号发送至服务器;The terminal is used to obtain a first voice signal, for example, a Chinese voice signal; identify the pinyin rhyme signal included in the first voice signal to obtain a first rhyme signal sequence corresponding to the first voice signal; Comparing the first rhyme signal sequence with the preset second rhyme signal sequence of the wake-up word to extract a third rhyme signal sequence with the same content as the second rhyme signal sequence from the first rhyme signal sequence; Send the whole Pinyin speech signal corresponding to the third rhyme signal sequence to the server;
服务器,用于对第三韵部信号序列对应在第一语音信号中的全拼语音信号进行自动语音识别处理,确定全拼语音信号是否为唤醒词对应的语音信号。The server is configured to perform automatic speech recognition processing on the full spelling speech signal corresponding to the third rhyme signal sequence in the first voice signal, and determine whether the full spelling speech signal is a speech signal corresponding to the wake word.
相应的,基于上述语音唤醒系统,本实施例还提供了一种语音唤醒方法,即从终端和服务端两侧的执行流程对语音唤醒方法进行描述。该方法包括:Correspondingly, based on the above voice wake-up system, this embodiment also provides a voice wake-up method, that is, the voice wake-up method is described from the execution flow on both sides of the terminal and the server. The method includes:
终端获取第一语音信号,该第一语音信号例如为中文语音信号;对第一语音信号中包含的拼音韵部信号进行识别,得到第一语音信号对应的第一韵部信号序列;将第一韵部信号序列与预设的唤醒词的第二韵部信号序列进行比较,以从第一韵部信号序列中提取与第二韵部信号序列内容相同的第三韵部信号序列;将第三韵部信号序列对应的全拼语音信号发送至服务器;The terminal acquires a first voice signal, which is, for example, a Chinese voice signal; recognizes the pinyin rhyme signal included in the first voice signal to obtain a first rhyme signal sequence corresponding to the first voice signal; converts the first The rhyme signal sequence is compared with the second rhyme signal sequence of the preset wake word to extract a third rhyme signal sequence with the same content as the second rhyme signal sequence from the first rhyme signal sequence; the third The whole Pinyin speech signal corresponding to the rhyme signal sequence is sent to the server;
服务器对第三韵部信号序列对应在第一语音信号中的全拼语音信号进行自动语音识别处理,确定全拼语音信号是否为唤醒词对应的语音信号。The server performs automatic speech recognition processing on the full spelling speech signal corresponding to the third rhyme signal sequence in the first voice signal, and determines whether the full spelling speech signal is a speech signal corresponding to the wake word.
通过将整个唤醒流程拆分为两部分:第一部分在终端侧通过识别第一语音信号中的韵部信号,对唤醒词进行初次识别;第二部分在服务器侧通过对初次识别所提炼的韵部信号对应的全拼语音信号进行自动语音识别,从而完成整个语音信号是否命中唤醒词的识别过程。该方法使得整个唤醒过程在终端和服务器的计算量达到平衡,减少了终端的计算压力,提高了整个语音唤醒过程的执行效率。By splitting the entire wake-up process into two parts: the first part recognizes the wake word for the first time on the terminal side by recognizing the rhyme signal in the first voice signal; the second part uses the rhythm part refined by the initial recognition on the server side The Quanpin speech signal corresponding to the signal is automatically speech recognized, thereby completing the recognition process of whether the entire speech signal hits the wake word. This method balances the calculation amount of the entire wake-up process between the terminal and the server, reduces the calculation pressure of the terminal, and improves the execution efficiency of the entire voice wake-up process.
实施例七Example 7
前面实施例三描述了一种语音唤醒装置的整体架构,该装置的功能可借助一种电子设备实现完成,如图14所示,其为本发明实施例的电子设备的结构示意图,具体包括:存储器141和处理器142。The foregoing Embodiment 3 describes the overall architecture of a voice wake-up device. The functions of the device can be implemented by means of an electronic device. As shown in FIG. 14, it is a schematic structural diagram of an electronic device according to an embodiment of the present invention, which specifically includes: Memory 141 and processor 142.
存储器141,用于存储程序。The memory 141 is used to store programs.
除上述程序之外,存储器141还可被配置为存储其它各种数据以支持在电子设备上的操作。这些数据的示例包括用于在电子设备上操作的任何应用程序或方法的指令,联系人数据,电话簿数据,消息,图片,视频等。In addition to the above-mentioned programs, the memory 141 may be configured to store various other data to support operations on the electronic device. Examples of such data include instructions for any application or method for operating on the electronic device, contact data, phone book data, messages, pictures, videos, etc.
存储器141可以由任何类型的易失性或非易失性存储设备或者它们的组合实现,如静态随机存取存储器(SRAM),电可擦除可编程只读存储器(EEPROM),可擦除可编程只读存储器(EPROM),可编程只读存储器(PROM),只读存储器(ROM),磁存储器,快闪存储器,磁盘或光盘。The memory 141 may be implemented by any type of volatile or non-volatile storage device or a combination thereof, such as static random access memory (SRAM), electrically erasable programmable read only memory (EEPROM), erasable and removable Programmable read only memory (EPROM), programmable read only memory (PROM), read only memory (ROM), magnetic memory, flash memory, magnetic disk or optical disk.
处理器142,耦合至存储器141,用于执行存储器141中的程序,以用于:The processor 142, coupled to the memory 141, is used to execute the program in the memory 141 for:
获取第一语音信号;Get the first voice signal;
对第一语音信号中包含的拼音韵部信号进行识别,得到第一语音信号对应的第一韵部信号序列;Recognize the pinyin rhyme signal included in the first speech signal to obtain a first rhyme signal sequence corresponding to the first speech signal;
将第一韵部信号序列与预设的唤醒词的第二韵部信号序列进行比较,以从第一韵部信号序列中提取与第二韵部信号序列内容相同的第三韵部信号序列;Comparing the first rhyme signal sequence with the preset second rhyme signal sequence of the wake-up word to extract a third rhyme signal sequence with the same content as the second rhyme signal sequence from the first rhyme signal sequence;
对第三韵部信号序列对应在第一语音信号中的全拼语音信号进行自动语音识别处理,确定全拼语音信号是否为所述唤醒词对应的语音信号。Automatic speech recognition processing is performed on the full spelling speech signal corresponding to the third rhythm signal sequence in the first voice signal to determine whether the full spelling speech signal is a speech signal corresponding to the wake-up word.
上述的具体处理操作已经在前面实施例中进行了详细说明,在此不再赘述。The above specific processing operations have been described in detail in the previous embodiments, and will not be repeated here.
进一步,如图14所示,电子设备还可以包括:通信组件143、电源组件144、音频组件145、显示器146等其它组件。图14中仅示意性给出部分组件,并不意味着电子设备只包括图14所示组件。Further, as shown in FIG. 14, the electronic device may further include: a communication component 143, a power component 144, an audio component 145, a display 146, and other components. Only some components are schematically shown in FIG. 14, and it does not mean that the electronic device includes only the components shown in FIG.
通信组件143被配置为便于电子设备和其他设备之间有线或无线方式的通信。电子设备可以接入基于通信标准的无线网络,如WiFi,2G或3G,或它们的组合。在一个示例性实施例中,通信组件143经由广播信道接收来自外部广播管理系统的广播信号或广播相关信息。在一个示例性实施例中,通信组件143还包括近场通信(NFC)模块,以促进短程通信。例如,在NFC模块可基于射频识别(RFID)技术,红外数据协会(IrDA)技术,超宽带(UWB)技术,蓝牙(BT)技术和其他技术来实现。The communication component 143 is configured to facilitate wired or wireless communication between the electronic device and other devices. Electronic devices can access wireless networks based on communication standards, such as WiFi, 2G or 3G, or a combination thereof. In an exemplary embodiment, the communication component 143 receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 143 also includes a near field communication (NFC) module to facilitate short-range communication. For example, the NFC module can be implemented based on radio frequency identification (RFID) technology, infrared data association (IrDA) technology, ultra wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.
电源组件144,为电子设备的各种组件提供电力。电源组件144可以包括电源管理 系统,一个或多个电源,及其他与为电子设备生成、管理和分配电力相关联的组件。The power supply component 144 provides power for various components of the electronic device. The power component 144 may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for electronic devices.
音频组件145被配置为输出和/或输入音频信号。例如,音频组件145包括一个麦克风(MIC),当电子设备处于操作模式,如呼叫模式、记录模式和语音识别模式时,麦克风被配置为接收外部音频信号。所接收的音频信号可以被进一步存储在存储器141或经由通信组件143发送。在一些实施例中,音频组件145还包括一个扬声器,用于输出音频信号。The audio component 145 is configured to output and / or input audio signals. For example, the audio component 145 includes a microphone (MIC). When the electronic device is in an operation mode, such as a call mode, a recording mode, and a voice recognition mode, the microphone is configured to receive an external audio signal. The received audio signal may be further stored in the memory 141 or transmitted via the communication component 143. In some embodiments, the audio component 145 further includes a speaker for outputting audio signals.
显示器146包括屏幕,其屏幕可以包括液晶显示器(LCD)和触摸面板(TP)。如果屏幕包括触摸面板,屏幕可以被实现为触摸屏,以接收来自用户的输入信号。触摸面板包括一个或多个触摸传感器以感测触摸、滑动和触摸面板上的手势。触摸传感器可以不仅感测触摸或滑动动作的边界,而且还检测与触摸或滑动操作相关的持续时间和压力。The display 146 includes a screen, which may include a liquid crystal display (LCD) and a touch panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from the user. The touch panel includes one or more touch sensors to sense touch, swipe, and gestures on the touch panel. The touch sensor may not only sense the boundary of the touch or sliding action, but also detect the duration and pressure related to the touch or sliding operation.
实施例八Example 8
前面实施例五描述了一种语音唤醒装置的整体架构,该装置的功能可借助一种电子设备实现完成,如图15所示,其为本发明实施例的电子设备的结构示意图,具体包括:存储器151和处理器152。The foregoing Embodiment 5 describes the overall architecture of a voice wake-up device. The functions of the device can be implemented by means of an electronic device. As shown in FIG. 15, it is a schematic structural diagram of an electronic device according to an embodiment of the present invention. Memory 151 and processor 152.
存储器151,用于存储程序。The memory 151 is used to store programs.
除上述程序之外,存储器151还可被配置为存储其它各种数据以支持在电子设备上的操作。这些数据的示例包括用于在电子设备上操作的任何应用程序或方法的指令,联系人数据,电话簿数据,消息,图片,视频等。In addition to the above programs, the memory 151 may be configured to store various other data to support operations on the electronic device. Examples of such data include instructions for any application or method for operating on the electronic device, contact data, phone book data, messages, pictures, videos, etc.
存储器151可以由任何类型的易失性或非易失性存储设备或者它们的组合实现,如静态随机存取存储器(SRAM),电可擦除可编程只读存储器(EEPROM),可擦除可编程只读存储器(EPROM),可编程只读存储器(PROM),只读存储器(ROM),磁存储器,快闪存储器,磁盘或光盘。The memory 151 may be implemented by any type of volatile or nonvolatile storage device or a combination thereof, such as static random access memory (SRAM), electrically erasable programmable read only memory (EEPROM), erasable and removable Programmable read only memory (EPROM), programmable read only memory (PROM), read only memory (ROM), magnetic memory, flash memory, magnetic disk or optical disk.
处理器152,耦合至存储器151,用于执行存储器151中的程序,以用于:The processor 152, coupled to the memory 151, is used to execute the program in the memory 151 for:
获取第一语音信号;Get the first voice signal;
对第一语音信号中包含的元音信号进行识别,得到第一语音信号对应的第一元音信号序列;Identify the vowel signal contained in the first voice signal to obtain a first vowel signal sequence corresponding to the first voice signal;
将第一元音信号序列与预设的唤醒词的第二元音信号序列进行比较,以从第一元音信号序列中提取与第二元音信号序列内容相同的第三元音信号序列;Comparing the first vowel signal sequence with the preset second vowel signal sequence of the wake-up word to extract a third vowel signal sequence with the same content as the second vowel signal sequence from the first vowel signal sequence;
对第三元音信号序列对应在第一语音信号中的全量语音信号进行自动语音识别处理,确定全量语音信号是否为唤醒词对应的语音信号。Automatic speech recognition processing is performed on the full-volume voice signal corresponding to the third vowel signal sequence in the first voice signal to determine whether the full-volume voice signal is a voice signal corresponding to the wake-up word.
上述的具体处理操作已经在前面实施例中进行了详细说明,在此不再赘述。The above specific processing operations have been described in detail in the previous embodiments, and will not be repeated here.
进一步,如图15所示,电子设备还可以包括:通信组件153、电源组件154、音频组件155、显示器156等其它组件。图15中仅示意性给出部分组件,并不意味着电子设备只包括图15所示组件。Further, as shown in FIG. 15, the electronic device may further include: a communication component 153, a power component 154, an audio component 155, a display 156, and other components. FIG. 15 only schematically shows some components, which does not mean that the electronic device includes only the components shown in FIG. 15.
通信组件153被配置为便于电子设备和其他设备之间有线或无线方式的通信。电子设备可以接入基于通信标准的无线网络,如WiFi,2G或3G,或它们的组合。在一个示例性实施例中,通信组件153经由广播信道接收来自外部广播管理系统的广播信号或广播相关信息。在一个示例性实施例中,通信组件153还包括近场通信(NFC)模块,以促进短程通信。例如,在NFC模块可基于射频识别(RFID)技术,红外数据协会(IrDA)技术,超宽带(UWB)技术,蓝牙(BT)技术和其他技术来实现。The communication component 153 is configured to facilitate wired or wireless communication between the electronic device and other devices. Electronic devices can access wireless networks based on communication standards, such as WiFi, 2G or 3G, or a combination thereof. In an exemplary embodiment, the communication component 153 receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 153 also includes a near field communication (NFC) module to facilitate short-range communication. For example, the NFC module can be implemented based on radio frequency identification (RFID) technology, infrared data association (IrDA) technology, ultra wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.
电源组件154,为电子设备的各种组件提供电力。电源组件154可以包括电源管理系统,一个或多个电源,及其他与为电子设备生成、管理和分配电力相关联的组件。The power supply component 154 provides power for various components of the electronic device. The power supply component 154 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for electronic devices.
音频组件155被配置为输出和/或输入音频信号。例如,音频组件155包括一个麦克风(MIC),当电子设备处于操作模式,如呼叫模式、记录模式和语音识别模式时,麦克风被配置为接收外部音频信号。所接收的音频信号可以被进一步存储在存储器151或经由通信组件153发送。在一些实施例中,音频组件155还包括一个扬声器,用于输出音频信号。The audio component 155 is configured to output and / or input audio signals. For example, the audio component 155 includes a microphone (MIC). When the electronic device is in an operation mode, such as a call mode, a recording mode, and a voice recognition mode, the microphone is configured to receive an external audio signal. The received audio signal may be further stored in the memory 151 or transmitted via the communication component 153. In some embodiments, the audio component 155 further includes a speaker for outputting audio signals.
显示器156包括屏幕,其屏幕可以包括液晶显示器(LCD)和触摸面板(TP)。如果屏幕包括触摸面板,屏幕可以被实现为触摸屏,以接收来自用户的输入信号。触摸面板包括一个或多个触摸传感器以感测触摸、滑动和触摸面板上的手势。触摸传感器可以不仅感测触摸或滑动动作的边界,而且还检测与触摸或滑动操作相关的持续时间和压力。The display 156 includes a screen, which may include a liquid crystal display (LCD) and a touch panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from the user. The touch panel includes one or more touch sensors to sense touch, swipe, and gestures on the touch panel. The touch sensor may not only sense the boundary of the touch or sliding action, but also detect the duration and pressure related to the touch or sliding operation.
本领域普通技术人员可以理解:实现上述各方法实施例的全部或部分步骤可以通过程序指令相关的硬件来完成。前述的程序可以存储于一计算机可读取存储介质中。该程序在执行时,执行包括上述各方法实施例的步骤;而前述的存储介质包括:ROM、RAM、磁碟或者光盘等各种可以存储程序代码的介质。Those of ordinary skill in the art may understand that all or part of the steps of the foregoing method embodiments may be completed by a program instructing relevant hardware. The aforementioned program may be stored in a computer-readable storage medium. When the program is executed, the steps including the foregoing method embodiments are executed; and the foregoing storage medium includes various media that can store program codes, such as ROM, RAM, magnetic disk, or optical disk.
最后应说明的是:以上各实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述各实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分或者全部技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的范围。Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present application, not to limit them; although the present application has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that: The technical solutions described in the foregoing embodiments can still be modified, or some or all of the technical features can be equivalently replaced; and these modifications or replacements do not deviate from the essence of the corresponding technical solutions of the technical solutions of the embodiments of the present application. range.

Claims (15)

  1. 一种语音唤醒方法,包括:A voice wake-up method, including:
    获取第一语音信号;Get the first voice signal;
    对所述第一语音信号中包含的拼音韵部信号进行识别,得到所述第一语音信号对应的第一韵部信号序列;Recognizing the pinyin rhyme signal included in the first speech signal to obtain a first rhyme signal sequence corresponding to the first speech signal;
    将所述第一韵部信号序列与预设的唤醒词的第二韵部信号序列进行比较,以从所述第一韵部信号序列中提取与所述第二韵部信号序列内容相同的第三韵部信号序列;Comparing the first rhyme signal sequence with a second rhyme signal sequence of a preset wake-up word, to extract from the first rhyme signal sequence the same content as the second rhyme signal sequence Three rhyme signal sequence;
    对所述第三韵部信号序列对应在所述第一语音信号中的全拼语音信号进行自动语音识别处理,确定所述全拼语音信号是否为所述唤醒词对应的语音信号。Performing automatic speech recognition processing on the full spelling speech signal corresponding to the third rhythm signal sequence in the first voice signal to determine whether the full spelling speech signal is a voice signal corresponding to the wake-up word.
  2. 根据权利要求1所述的方法,其中,所述对所述第一语音信号中包含的拼音韵部信号进行识别,得到所述第一语音信号对应的第一韵部信号序列包括:The method according to claim 1, wherein the recognizing the pinyin rhyme signal included in the first speech signal to obtain the first rhyme signal sequence corresponding to the first speech signal includes:
    获取所述第一语音信号的特征频谱;Acquiring the characteristic spectrum of the first speech signal;
    对所述第一语音信号的特征频谱采用韵部分类器进行分类计算,得到所述第一语音信号对应的所述第一韵部信号序列。The characteristic spectrum of the first speech signal is classified and calculated by a rhyme classifier to obtain the first rhythm signal sequence corresponding to the first speech signal.
  3. 根据权利要求2所述的方法,其中,所述方法还包括:The method of claim 2, wherein the method further comprises:
    获取用于模型训练的语音信号的特征频谱;Obtain the characteristic spectrum of the speech signal used for model training;
    对所述特征频谱中的拼音韵部信号进行标注;Annotate the pinyin rhyme signal in the characteristic spectrum;
    以已标注的拼音韵部信号作为训练样本,采用神经网络算法以及连接时序分类的联合模型算法训练生成所述韵部分类器。Using the marked Pinyin rhyme signal as a training sample, a neural network algorithm and a joint model algorithm connected with time series classification are used to train and generate the rhyme classifier.
  4. 根据权利要求2所述的方法,其中,所述方法还包括:The method of claim 2, wherein the method further comprises:
    获取用于模型训练的语音信号的特征频谱;Obtain the characteristic spectrum of the speech signal used for model training;
    对所述特征频谱中的拼音韵部信号进行标注;Annotate the pinyin rhyme signal in the characteristic spectrum;
    以已标注的拼音韵部信号作为训练样本,采用隐马尔科夫模型以及深度神经网络的联合模型算法训练生成韵部分类器。Taking the marked pinyin rhyme signals as training samples, a hidden Markov model and deep neural network combined model algorithm are used to train and generate rhyme classifiers.
  5. 根据权利要求1所述的方法,其中,所述对所述第一语音信号中包含的拼音韵部信号进行识别,得到所述第一语音信号对应的第一韵部信号序列之前还包括:The method according to claim 1, wherein before recognizing the pinyin rhyme signal included in the first speech signal, before obtaining the first rhyme signal sequence corresponding to the first speech signal, further comprising:
    对所述第一语音信号进行去噪的预处理。Pre-denoising the first speech signal.
  6. 根据权利要求1所述的方法,其中,所述将所述第一韵部信号序列与预设的唤醒词的第二韵部信号序列进行比较,从所述第一韵部信号序列中提取与所述第二韵部信号序列内容相同的第三韵部信号序列包括:The method according to claim 1, wherein the comparing the first rhyme signal sequence with a second rhyme signal sequence of a preset wake-up word, extracting from the first rhyme signal sequence and The third rhyme signal sequence with the same content of the second rhyme signal sequence includes:
    采用动态时间规整算法将所述第一韵部信号序列与预设的唤醒词的第二韵部信号序列按时序对应进行比较,以从所述第一韵部信号序列中提取与所述第二韵部信号序列内容相同的第三韵部信号序列。A dynamic time warping algorithm is used to compare the first rhyme signal sequence with the preset second rhyme signal sequence of the wake-up word in time sequence to extract the second rhyme signal sequence from the first rhyme signal sequence and the second A third rhyme signal sequence with the same rhyme signal sequence content.
  7. 根据权利要求1所述的方法,其中,所述第一语音信号为中文语音信号。The method according to claim 1, wherein the first voice signal is a Chinese voice signal.
  8. 一种语音唤醒方法,包括:A voice wake-up method, including:
    获取第一语音信号;Get the first voice signal;
    对所述第一语音信号中包含的元音信号进行识别,得到所述第一语音信号对应的第一元音信号序列;Identify the vowel signal contained in the first voice signal to obtain a first vowel signal sequence corresponding to the first voice signal;
    将所述第一元音信号序列与预设的唤醒词的第二元音信号序列进行比较,以从所述第一元音信号序列中提取与所述第二元音信号序列内容相同的第三元音信号序列;Comparing the first vowel signal sequence with the preset second vowel signal sequence of the awakening word to extract from the first vowel signal sequence the same content as the second vowel signal sequence Three-vowel signal sequence;
    对所述第三元音信号序列对应在所述第一语音信号中的全量语音信号进行自动语音识别处理,确定所述全量语音信号是否为所述唤醒词对应的语音信号。Performing automatic speech recognition processing on the full amount of voice signals corresponding to the third vowel signal sequence in the first voice signal to determine whether the full amount of voice signals are voice signals corresponding to the wake-up words.
  9. 根据权利要求8所述的方法,其中,所述第一语音信号中包含的元音信号为所述第一语音信号所属语言类型所包含的单音节中元音对应的语音信号。The method according to claim 8, wherein the vowel signal included in the first speech signal is a speech signal corresponding to a vowel in a single syllable included in the language type to which the first speech signal belongs.
  10. 一种语音唤醒装置,包括:A voice wake-up device, including:
    信号获取模块,用于获取第一语音信号;The signal acquisition module is used to acquire the first voice signal;
    信号识别模块,用于对所述第一语音信号中包含的拼音韵部信号进行识别,得到所述第一语音信号对应的第一韵部信号序列;The signal recognition module is used to recognize the pinyin rhyme signal included in the first speech signal to obtain a first rhyme signal sequence corresponding to the first speech signal;
    信号比较模块,用于将所述第一韵部信号序列与预设的唤醒词的第二韵部信号序列进行比较,以从所述第一韵部信号序列中提取与所述第二韵部信号序列内容相同的第三韵部信号序列;The signal comparison module is used to compare the first rhyme signal sequence with the preset second rhyme signal sequence of the wake word to extract the second rhyme portion from the first rhyme signal sequence The third rhythm signal sequence with the same signal sequence content;
    语音识别模块,用于对所述第三韵部信号序列对应在所述第一语音信号中的全拼语音信号进行自动语音识别处理,确定所述全拼语音信号是否为所述唤醒词对应的语音信号。The speech recognition module is used to perform automatic speech recognition processing on the full spelling speech signal corresponding to the third rhyme signal sequence in the first voice signal to determine whether the full spelling speech signal corresponds to the wake-up word voice signal.
  11. 一种语音唤醒装置,包括:A voice wake-up device, including:
    信号获取模块,用于获取第一语音信号;The signal acquisition module is used to acquire the first voice signal;
    信号识别模块,用于对所述第一语音信号中包含的元音信号进行识别,得到所述第一语音信号对应的第一元音信号序列;The signal recognition module is used to recognize the vowel signal contained in the first voice signal to obtain the first vowel signal sequence corresponding to the first voice signal;
    信号比较模块,用于将所述第一元音信号序列与预设的唤醒词的第二元音信号序列进行比较,以从所述第一元音信号序列中提取与所述第二元音信号序列内容相同的第三 元音信号序列;The signal comparison module is used to compare the first vowel signal sequence with the preset second vowel signal sequence of the wake word to extract the second vowel from the first vowel signal sequence A third vowel signal sequence with the same signal sequence content;
    语音识别模块,用于对所述第三元音信号序列对应在所述第一语音信号中的全量语音信号进行自动语音识别处理,确定所述全量语音信号是否为所述唤醒词对应的语音信号。A voice recognition module, configured to perform automatic voice recognition processing on the full-volume voice signal corresponding to the third vowel signal sequence in the first voice signal, and determine whether the full-volume voice signal is a voice signal corresponding to the wake-up word .
  12. 一种语音唤醒系统,包括:A voice wake-up system, including:
    终端,用于获取第一语音信号;对所述第一语音信号中包含的拼音韵部信号进行识别,得到所述第一语音信号对应的第一韵部信号序列;将所述第一韵部信号序列与预设的唤醒词的第二韵部信号序列进行比较,以从所述第一韵部信号序列中提取与所述第二韵部信号序列内容相同的第三韵部信号序列;将所述第三韵部信号序列对应的全拼语音信号发送至服务器;A terminal for acquiring a first speech signal; identifying the pinyin rhyme signal included in the first speech signal to obtain a first rhyme signal sequence corresponding to the first speech signal; Comparing the signal sequence with the second rhyme signal sequence of the preset wake-up word to extract a third rhyme signal sequence with the same content as the second rhyme signal sequence from the first rhyme signal sequence; The whole Pinyin speech signal corresponding to the third rhyme signal sequence is sent to the server;
    所述服务器,用于对所述第三韵部信号序列对应在所述第一语音信号中的全拼语音信号进行自动语音识别处理,确定所述全拼语音信号是否为所述唤醒词对应的语音信号。The server is configured to perform automatic speech recognition processing on the full spelling speech signal corresponding to the third rhyme signal sequence in the first voice signal, and determine whether the full spelling speech signal corresponds to the wake-up word voice signal.
  13. 一种语音唤醒方法,包括:A voice wake-up method, including:
    终端获取第一语音信号;对所述第一语音信号中包含的拼音韵部信号进行识别,得到所述第一语音信号对应的第一韵部信号序列;将所述第一韵部信号序列与预设的唤醒词的第二韵部信号序列进行比较,以从所述第一韵部信号序列中提取与所述第二韵部信号序列内容相同的第三韵部信号序列;将所述第三韵部信号序列对应的全拼语音信号发送至服务器;The terminal acquires the first voice signal; recognizes the pinyin rhyme signal included in the first voice signal to obtain a first rhyme signal sequence corresponding to the first voice signal; and compares the first rhyme signal sequence with Compare the second rhyme signal sequence of the preset wake-up word to extract a third rhyme signal sequence with the same content as the second rhyme signal sequence from the first rhyme signal sequence; The Quanpin speech signal corresponding to the signal sequence of the Sanyun Department is sent to the server;
    所述服务器对所述第三韵部信号序列对应在所述第一语音信号中的全拼语音信号进行自动语音识别处理,确定所述全拼语音信号是否为所述唤醒词对应的语音信号。The server performs automatic speech recognition processing on the full spelling speech signal corresponding to the third rhythm signal sequence in the first voice signal to determine whether the full spelling speech signal is a speech signal corresponding to the wake-up word.
  14. 一种电子设备,包括:An electronic device, including:
    存储器,用于存储程序;Memory for storing programs;
    处理器,耦合至所述存储器,用于执行所述程序,以用于:A processor, coupled to the memory, is used to execute the program for:
    获取第一语音信号;Get the first voice signal;
    对所述第一语音信号中包含的拼音韵部信号进行识别,得到所述第一语音信号对应的第一韵部信号序列;Recognizing the pinyin rhyme signal included in the first speech signal to obtain a first rhyme signal sequence corresponding to the first speech signal;
    将所述第一韵部信号序列与预设的唤醒词的第二韵部信号序列进行比较,以从所述第一韵部信号序列中提取与所述第二韵部信号序列内容相同的第三韵部信号序列;Comparing the first rhyme signal sequence with a second rhyme signal sequence of a preset wake-up word, to extract from the first rhyme signal sequence the same content as the second rhyme signal sequence Three rhyme signal sequence;
    对所述第三韵部信号序列对应在所述第一语音信号中的全拼语音信号进行自动语 音识别处理,确定所述全拼语音信号是否为所述唤醒词对应的语音信号。Automatic speech recognition processing is performed on the full spelling speech signal corresponding to the third rhythm signal sequence in the first voice signal to determine whether the full spelling speech signal is a speech signal corresponding to the wake-up word.
  15. 一种电子设备,包括:An electronic device, including:
    存储器,用于存储程序;Memory for storing programs;
    处理器,耦合至所述存储器,用于执行所述程序,以用于:A processor, coupled to the memory, is used to execute the program for:
    获取第一语音信号;Get the first voice signal;
    对所述第一语音信号中包含的元音信号进行识别,得到所述第一语音信号对应的第一元音信号序列;Identify the vowel signal contained in the first voice signal to obtain a first vowel signal sequence corresponding to the first voice signal;
    将所述第一元音信号序列与预设的唤醒词的第二元音信号序列进行比较,以从所述第一元音信号序列中提取与所述第二元音信号序列内容相同的第三元音信号序列;Comparing the first vowel signal sequence with the preset second vowel signal sequence of the awakening word to extract from the first vowel signal sequence the same content as the second vowel signal sequence Three-vowel signal sequence;
    对所述第三元音信号序列对应在所述第一语音信号中的全量语音信号进行自动语音识别处理,确定所述全量语音信号是否为所述唤醒词对应的语音信号。Performing automatic speech recognition processing on the full amount of voice signals corresponding to the third vowel signal sequence in the first voice signal to determine whether the full amount of voice signals are voice signals corresponding to the wake-up words.
PCT/CN2019/108828 2018-10-11 2019-09-29 Voice wake-up method, apparatus and system, and electronic device WO2020073839A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201811186019.4A CN111048068B (en) 2018-10-11 2018-10-11 Voice wake-up method, device and system and electronic equipment
CN201811186019.4 2018-10-11

Publications (1)

Publication Number Publication Date
WO2020073839A1 true WO2020073839A1 (en) 2020-04-16

Family

ID=70164846

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/108828 WO2020073839A1 (en) 2018-10-11 2019-09-29 Voice wake-up method, apparatus and system, and electronic device

Country Status (2)

Country Link
CN (1) CN111048068B (en)
WO (1) WO2020073839A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113782005A (en) * 2021-01-18 2021-12-10 北京沃东天骏信息技术有限公司 Voice recognition method and device, storage medium and electronic equipment
US20230054011A1 (en) * 2021-08-20 2023-02-23 Beijing Xiaomi Mobile Software Co., Ltd. Voice collaborative awakening method and apparatus, electronic device and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH08314490A (en) * 1995-05-23 1996-11-29 Nippon Hoso Kyokai <Nhk> Word spotting type method and device for recognizing voice
CN107221325A (en) * 2016-03-22 2017-09-29 华硕电脑股份有限公司 Aeoplotropism keyword verification method and the electronic installation using this method
WO2018151772A1 (en) * 2017-02-14 2018-08-23 Google Llc Server side hotwording

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1741131B (en) * 2004-08-27 2010-04-14 中国科学院自动化研究所 Method and apparatus for identifying non-particular person isolating word voice
CN101819772B (en) * 2010-02-09 2012-03-28 中国船舶重工集团公司第七○九研究所 Phonetic segmentation-based isolate word recognition method
CN102208186B (en) * 2011-05-16 2012-12-19 南宁向明信息科技有限责任公司 Chinese phonetic recognition method
CN102970618A (en) * 2012-11-26 2013-03-13 河海大学 Video on demand method based on syllable identification
CN103745722B (en) * 2014-02-10 2017-02-08 上海金牌软件开发有限公司 Voice interaction smart home system and voice interaction method
KR101459050B1 (en) * 2014-05-08 2014-11-12 주식회사 소니스트 Apparatus and method for setting and releasing locking function of terminal
CN106847273B (en) * 2016-12-23 2020-05-05 北京云知声信息技术有限公司 Awakening word selection method and device for voice recognition

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH08314490A (en) * 1995-05-23 1996-11-29 Nippon Hoso Kyokai <Nhk> Word spotting type method and device for recognizing voice
CN107221325A (en) * 2016-03-22 2017-09-29 华硕电脑股份有限公司 Aeoplotropism keyword verification method and the electronic installation using this method
WO2018151772A1 (en) * 2017-02-14 2018-08-23 Google Llc Server side hotwording

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113782005A (en) * 2021-01-18 2021-12-10 北京沃东天骏信息技术有限公司 Voice recognition method and device, storage medium and electronic equipment
CN113782005B (en) * 2021-01-18 2024-03-01 北京沃东天骏信息技术有限公司 Speech recognition method and device, storage medium and electronic equipment
US20230054011A1 (en) * 2021-08-20 2023-02-23 Beijing Xiaomi Mobile Software Co., Ltd. Voice collaborative awakening method and apparatus, electronic device and storage medium

Also Published As

Publication number Publication date
CN111048068B (en) 2023-04-18
CN111048068A (en) 2020-04-21

Similar Documents

Publication Publication Date Title
CN110718223B (en) Method, apparatus, device and medium for voice interaction control
CN110364143B (en) Voice awakening method and device and intelligent electronic equipment
WO2021051544A1 (en) Voice recognition method and device
CN109741732B (en) Named entity recognition method, named entity recognition device, equipment and medium
US9940935B2 (en) Method and device for voiceprint recognition
CN110570873B (en) Voiceprint wake-up method and device, computer equipment and storage medium
US20150325240A1 (en) Method and system for speech input
CN111341325A (en) Voiceprint recognition method and device, storage medium and electronic device
US11676625B2 (en) Unified endpointer using multitask and multidomain learning
WO2014048113A1 (en) Voice recognition method and device
CN111462756B (en) Voiceprint recognition method and device, electronic equipment and storage medium
CN109036395A (en) Personalized speaker control method, system, intelligent sound box and storage medium
CN110223687B (en) Instruction execution method and device, storage medium and electronic equipment
CN113129867B (en) Training method of voice recognition model, voice recognition method, device and equipment
WO2020073839A1 (en) Voice wake-up method, apparatus and system, and electronic device
CN110853669B (en) Audio identification method, device and equipment
CN111128134B (en) Acoustic model training method, voice awakening method and device and electronic equipment
CN113611316A (en) Man-machine interaction method, device, equipment and storage medium
CN112185357A (en) Device and method for simultaneously recognizing human voice and non-human voice
CN111833878A (en) Chinese voice interaction non-inductive control system and method based on raspberry Pi edge calculation
US11769491B1 (en) Performing utterance detection using convolution
CN110265003B (en) Method for recognizing voice keywords in broadcast signal
CN112037772A (en) Multi-mode-based response obligation detection method, system and device
CN113168438A (en) User authentication method and device
US11991511B2 (en) Contextual awareness in dynamic device groups

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19870531

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19870531

Country of ref document: EP

Kind code of ref document: A1