CN108320733B - Voice data processing method and device, storage medium and electronic equipment - Google Patents
Voice data processing method and device, storage medium and electronic equipment Download PDFInfo
- Publication number
- CN108320733B CN108320733B CN201711364085.1A CN201711364085A CN108320733B CN 108320733 B CN108320733 B CN 108320733B CN 201711364085 A CN201711364085 A CN 201711364085A CN 108320733 B CN108320733 B CN 108320733B
- Authority
- CN
- China
- Prior art keywords
- voice data
- awakening
- data
- voice
- level features
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000003672 processing method Methods 0.000 title abstract description 8
- 238000000034 method Methods 0.000 claims abstract description 38
- 238000005457 optimization Methods 0.000 claims abstract description 14
- 238000012545 processing Methods 0.000 claims description 45
- 238000000605 extraction Methods 0.000 claims description 42
- 238000012549 training Methods 0.000 claims description 23
- 230000011218 segmentation Effects 0.000 claims description 13
- 230000002618 waking effect Effects 0.000 claims description 11
- 238000004458 analytical method Methods 0.000 claims description 9
- 230000037007 arousal Effects 0.000 claims description 7
- 230000002441 reversible effect Effects 0.000 claims description 3
- 230000008569 process Effects 0.000 description 18
- 230000003993 interaction Effects 0.000 description 12
- 238000010586 diagram Methods 0.000 description 5
- 238000013528 artificial neural network Methods 0.000 description 4
- 210000001260 vocal cord Anatomy 0.000 description 4
- 238000013527 convolutional neural network Methods 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 230000002159 abnormal effect Effects 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 2
- 230000008859 change Effects 0.000 description 2
- 238000013145 classification model Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000012544 monitoring process Methods 0.000 description 2
- 238000003062 neural network model Methods 0.000 description 2
- 238000012706 support-vector machine Methods 0.000 description 2
- 230000001960 triggered effect Effects 0.000 description 2
- 230000003542 behavioural effect Effects 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000007796 conventional method Methods 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000005284 excitation Effects 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 210000004704 glottis Anatomy 0.000 description 1
- 238000011478 gradient descent method Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000002372 labelling Methods 0.000 description 1
- 230000010355 oscillation Effects 0.000 description 1
- 238000012805 post-processing Methods 0.000 description 1
- 230000000717 retained effect Effects 0.000 description 1
- 230000003595 spectral effect Effects 0.000 description 1
- 230000002123 temporal effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/14—Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/90—Pitch determination of speech signals
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/93—Discriminating between voiced and unvoiced parts of speech signals
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Computational Linguistics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Probability & Statistics with Applications (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Artificial Intelligence (AREA)
- Telephone Function (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The disclosure provides a voice data processing method and device, a storage medium and an electronic device. The method comprises the following steps: acquiring voice data input by a user, wherein the voice data comprises awakening voice data for successfully awakening the intelligent terminal and control voice data representing operation intention; extracting acoustic level features and/or semantic level features of the voice data, wherein the acoustic level features are used for representing pronunciation features of a user, and the semantic level features are used for representing text features of the voice data; and taking the acoustic level features and/or the semantic level features as input, and determining whether the awakening voice data is mistaken awakening data or not after the voice identification model is processed by a pre-established voice identification model. According to the scheme, the model optimization is carried out by using the awakening voice data with the mistaken awakening data screened out, and the optimization performance of the awakening model is improved.
Description
Technical Field
The present disclosure relates to the field of speech signal processing technologies, and in particular, to a method and an apparatus for processing speech data, a storage medium, and an electronic device.
Background
The voice awakening technology is an important branch in the technical field of voice signal processing, and has important application in the aspects of intelligent home, intelligent robots, intelligent car machines, intelligent mobile phones and the like.
Generally, the voice wake-up process of the smart terminal may be embodied as: the intelligent terminal monitors whether a user inputs voice data or not, and if the voice data input by the user is received, acoustic features of the voice data can be extracted; then, taking the acoustic characteristics as input, carrying out awakening word recognition by a pre-constructed awakening model, if the recognition result is an awakening word, awakening successfully, and continuously monitoring whether the user inputs an operation intention; otherwise, the awakening fails, and the monitoring of whether the user awakens the intelligent terminal again can be continued. The acoustic features may be embodied as spectral features of the voice data, such as Mel Frequency Cepstrum Coefficient (MFCC) features, Perceptual Linear Prediction (PLP) features, and the like.
Generally, the performance of the initial wake-up model cannot reach an optimal level, and model optimization needs to be continuously performed in the using process to improve the identification accuracy of the model. Specifically, the voice data that is successfully awakened may be regarded as regular voice data, the voice data that is unsuccessfully awakened may be regarded as counter-example voice data, and the current awakening model may be trained and optimized based on the discriminative criterion.
In the practical application process, due to the fact that the performance of the initial awakening model is not high, false awakening data may exist in the successfully awakened voice data, for example, background noise, human voice interference, non-awakening words with similar pronunciation to the awakening words and the like, all the intelligent terminals may be awakened by mistake, and if the false awakening data is used as regular voice data to perform model optimization, the performance of the awakening model is likely to be poor.
Disclosure of Invention
The present disclosure provides a method and an apparatus for processing voice data, a storage medium, and an electronic device, which are helpful for improving the optimization performance of a wake-up model.
In order to achieve the above object, the present disclosure provides a voice data processing method, the method including:
acquiring voice data input by a user, wherein the voice data comprises awakening voice data for successfully awakening the intelligent terminal and control voice data representing operation intention;
extracting acoustic level features and/or semantic level features of the voice data, wherein the acoustic level features are used for representing pronunciation features of a user, and the semantic level features are used for representing text features of the voice data;
and taking the acoustic level features and/or the semantic level features as input, and determining whether the awakening voice data is mistaken awakening data or not after the voice identification model is processed by a pre-established voice identification model.
Optionally, the manner of acquiring the wake-up voice data is as follows:
judging whether at least two pieces of voice data for awakening the intelligent terminal are continuously acquired within a preset time period;
if at least two voice data used for awakening the intelligent terminal are continuously collected in the preset time period, and the score value d of the at least two voice data used for awakening the intelligent terminal after being processed by the current awakening model meets the following conditions: d2≤d<d1Determining the at least two voice data for waking up the intelligent terminal as the wake-up voice data, d1Is a first wake-up score threshold, d2Is the second wake-up score threshold.
Optionally, the acoustic level features include an acoustic score of a current arousal model, and extracting the acoustic level features of the voice data includes:
acquiring the first N recognition results output by the current awakening model aiming at each voice unit of the awakening voice data;
if the first N recognition results of each voice unit contain the correct pronunciation of the voice unit, judging that the recognition result of the voice unit is correct for recognition;
and counting the recognition accuracy of the awakening voice data according to the recognition result of each voice unit, wherein the recognition accuracy is used as the acoustic score of the current awakening model.
Optionally, the acoustic level features further include at least one of a fundamental frequency mean, a short-time mean energy, and a short-time zero-crossing rate;
and/or the presence of a gas in the gas,
if the acoustic level features further include the unvoiced and voiced sequence features, extracting the acoustic level features of the voice data includes: at least one of the fundamental frequency mean value, the short-time average energy and the short-time zero crossing rate is used as input, and after the input is processed by a pre-constructed unvoiced and voiced classifier, the unvoiced and voiced sequence { a) of the awakening voice data is output1,a2,…,ai,…,amIn which aiRepresenting the voiced category corresponding to the ith phoneme of the awakening voice data; calculating the similarity between the voiced and unvoiced sequence of the awakening voice data and the voiced and unvoiced sequence of the awakening word corresponding to the awakening voice data to serve as the voiced and unvoiced sequence feature;
and/or the presence of a gas in the gas,
if the acoustic level features further include pitch sequence features, extracting the acoustic level features of the speech data includes: at least one of the fundamental frequency mean value, the short-time average energy and the short-time zero crossing rate is used as input, and after the input is processed by a pre-constructed tone classifier, the tone sequence { b ] of the awakening voice data is output1,b2,…,bj,…,bnIn which b isjRepresenting a tone category corresponding to a jth syllable of the wake-up voice data; calculating the similarity between the tone sequence of the awakening voice data and the tone sequence of the awakening word corresponding to the awakening voice data to serve as the tone sequence characteristic;
and/or the presence of a gas in the gas,
the acoustic level features further include time features of a speech unit, and extracting the acoustic level features of the speech data includes: counting the duration of each voice unit of the awakening voice data; calculating a time mean value and a time variance as the time characteristics of each voice unit by using the duration of each voice unit;
and/or the presence of a gas in the gas,
the acoustic surface features further include voiceprint features, and extracting the acoustic surface features of the speech data includes: extracting i-vector characteristics of the awakening voice data by using a pre-constructed voiceprint extraction model to serve as the voiceprint characteristics;
and/or the presence of a gas in the gas,
if the acoustic level features further include energy distribution features, extracting the acoustic level features of the speech data includes: dividing the voice data into three parts ct-1、ct、ct+1Counting the average energy distribution of each part as the energy distribution characteristics; wherein, ctRepresenting said wake-up speech data, ct+1Representing a voice data set comprising said control voice data acquired after said wake-up voice data, ct-1Representing a voice data set collected prior to the wake-up voice data.
Optionally, if the semantic level features include semantic smoothness, extracting the semantic level features of the voice data includes: performing word segmentation processing on the voice data to obtain a word sequence { w1,w2,…,wk,…,wfIn which wkA k-th word representing the speech data; calculating the probability of f words appearing in sequence according to the sequence of the word sequence as the semantic smoothness;
and/or the presence of a gas in the gas,
if the semantic level features include the edit distance of the part-of-speech sequence, extracting the semantic level features of the voice data includes: performing word segmentation processing on the voice data to obtain a part of speech sequence { q1,q2,…,qk,…,qfWherein q iskA part of speech representing a k-th word of the speech data; calculating an editing distance between the part-of-speech sequence of the speech data and the part-of-speech sequence of each sample speech data, and selecting a minimum editing distance from the editing distances as the editing distance of the part-of-speech sequence, wherein the sample speech data is data participating in training the speech discrimination model;
and/or the presence of a gas in the gas,
if the semantic level features include intent features, extracting the semantic level features of the voice data includes: and extracting intention characteristics of the control voice data by using a pre-constructed intention analysis model, wherein the intention characteristics comprise clear intentions or no clear intentions, or the intention characteristics comprise intention categories corresponding to the control voice data.
Optionally, the speech discrimination model is constructed in a manner that:
collecting sample voice data, wherein the sample voice data comprises sample awakening voice data and sample control voice data, the data type of the sample awakening voice data is marked as positive example awakening voice data or negative example awakening voice data, and the negative example awakening voice data comprises false awakening data and voice data failed to be awakened;
extracting acoustic level features and/or semantic level features of the sample voice data;
determining a topological structure of the voice discrimination model;
and training the voice discrimination model by using the topological structure and the acoustic level features and/or semantic level features of the sample voice data until the data type of the sample awakening voice data output by the voice discrimination model is the same as the labeled data type.
Optionally, the method further comprises:
and optimizing the current awakening model by utilizing the awakening voice data with the mistaken awakening data screened out.
The present disclosure provides a voice data processing apparatus, the apparatus comprising:
the voice data acquisition module is used for acquiring voice data input by a user, wherein the voice data comprises awakening voice data for successfully awakening the intelligent terminal and control voice data representing operation intention;
the feature extraction module is used for extracting acoustic level features and/or semantic level features of the voice data, wherein the acoustic level features are used for representing pronunciation features of a user, and the semantic level features are used for representing text features of the voice data;
and the model processing module is used for taking the acoustic level features and/or the semantic level features as input, and determining whether the awakening voice data is mistaken awakening data or not after the voice identification model which is constructed in advance is used for processing.
Optionally, the voice data acquiring module is configured to determine whether at least two pieces of voice data for waking up the intelligent terminal are continuously acquired within a preset time period; if at least two pieces of voice data used for awakening the intelligent terminal are continuously collected in the preset time period, and the at least two pieces of voice data used for awakening the intelligent terminal are processed through the current voice dataThe score d after the wake-up model processing meets the following conditions: d2≤d<d1Determining the at least two voice data for waking up the intelligent terminal as the wake-up voice data, d1Is a first wake-up score threshold, d2Is the second wake-up score threshold.
Optionally, the acoustic level features include an acoustic score of a current arousal model,
the feature extraction module is configured to obtain the first N recognition results output by the current wake-up model for each voice unit of the wake-up voice data; if the first N recognition results of each voice unit contain the correct pronunciation of the voice unit, judging that the recognition result of the voice unit is correct for recognition; and counting the recognition accuracy of the awakening voice data according to the recognition result of each voice unit, wherein the recognition accuracy is used as the acoustic score of the current awakening model.
Optionally, the acoustic level features further include at least one of a fundamental frequency mean, a short-time mean energy, and a short-time zero-crossing rate;
and/or the presence of a gas in the gas,
the acoustic level features further comprise a voiced-unvoiced sequence feature, and the feature extraction module is used for outputting the voiced-unvoiced sequence { a ] of the awakening voice data after processing by a pre-constructed voiced-unvoiced classifier by taking at least one of a fundamental frequency mean value, a short-time average energy and a short-time zero crossing rate as an input1,a2,…,ai,…,amIn which aiRepresenting the voiced category corresponding to the ith phoneme of the awakening voice data; calculating the similarity between the voiced and unvoiced sequence of the awakening voice data and the voiced and unvoiced sequence of the awakening word corresponding to the awakening voice data to serve as the voiced and unvoiced sequence feature;
and/or the presence of a gas in the gas,
the acoustic level features further comprise tone sequence features, and the feature extraction module is used for outputting the tone sequence { b ] of the awakening voice data after processing by a pre-constructed tone classifier by taking at least one of a fundamental frequency mean value, a short-time average energy and a short-time zero crossing rate as input1,b2,…,bj,…,bnIn which b isjRepresenting a tone category corresponding to a jth syllable of the wake-up voice data; calculating the similarity between the tone sequence of the awakening voice data and the tone sequence of the awakening word corresponding to the awakening voice data to serve as the tone sequence characteristic;
and/or the presence of a gas in the gas,
the acoustic level features further include time features of voice units, and the feature extraction module is configured to count a duration of each voice unit of the wake-up voice data; calculating a time mean value and a time variance as the time characteristics of each voice unit by using the duration of each voice unit;
and/or the presence of a gas in the gas,
the acoustic level features further comprise voiceprint features, and the feature extraction module is used for extracting i-vector features of the awakening voice data by using a pre-constructed voiceprint extraction model to serve as the voiceprint features;
and/or the presence of a gas in the gas,
the acoustic level features further comprise energy distribution features, and the feature extraction module is used for dividing the voice data into three parts ct-1、ct、ct+1Counting the average energy distribution of each part as the energy distribution characteristics; wherein, ctRepresenting said wake-up speech data, ct+1Representing a voice data set comprising said control voice data acquired after said wake-up voice data, ct-1Representing a voice data set collected prior to the wake-up voice data.
Optionally, the semantic level features include semantic smoothness, and the feature extraction module is configured to perform word segmentation processing on the voice data to obtain a word sequence { w }1,w2,…,wk,…,wfIn which wkA k-th word representing the speech data; calculating the probability of f words appearing in sequence according to the sequence of the word sequence as the semantic smoothness;
and/or the presence of a gas in the gas,
the semantic level features comprise editing distance of a part of speech sequence, and the feature extraction module is used for performing word segmentation processing on the voice data to obtain the part of speech sequence { q1,q2,…,qk,…,qfWherein q iskA part of speech representing a k-th word of the speech data; calculating an editing distance between the part-of-speech sequence of the speech data and the part-of-speech sequence of each sample speech data, and selecting a minimum editing distance from the editing distances as the editing distance of the part-of-speech sequence, wherein the sample speech data is data participating in training the speech discrimination model;
and/or the presence of a gas in the gas,
the semantic level features comprise intention features, and the feature extraction module is used for extracting the intention features of the control voice data by utilizing a pre-constructed intention analysis model, wherein the intention features comprise clear intentions or no clear intentions, or the intention features comprise intention categories corresponding to the control voice data.
Optionally, the apparatus further comprises:
the system comprises a sample voice data acquisition module, a sample voice data acquisition module and a sample control module, wherein the sample voice data comprises sample awakening voice data and sample control voice data, the data type of the sample awakening voice data is marked as positive example awakening voice data or negative example awakening voice data, and the negative example awakening voice data comprises false awakening data and voice data failed to be awakened;
the sample feature extraction module is used for extracting the acoustic level features and/or the semantic level features of the sample voice data;
the topological structure determining module is used for determining the topological structure of the voice distinguishing model;
and the model training module is used for training the voice discrimination model by utilizing the topological structure and the acoustic level characteristics and/or the semantic level characteristics of the sample voice data until the data type of the sample awakening voice data output by the voice discrimination model is the same as the labeled data type.
Optionally, the apparatus further comprises:
and the model optimization module is used for optimizing the current awakening model by utilizing the awakening voice data with the mistaken awakening data screened out.
The present disclosure provides a storage device having stored therein a plurality of instructions, the instructions being loaded by a processor, for performing the steps of the above-described voice data processing method.
The present disclosure provides an electronic device, comprising;
the above-mentioned storage device; and
a processor to execute instructions in the storage device.
According to the scheme, the awakening voice data for successfully awakening the intelligent terminal and the control voice data for representing the operation intention can be collected, the acoustic aspect characteristics for representing the pronunciation characteristics of the user and/or the semantic aspect characteristics for representing the text characteristics of the voice data are extracted, the acoustic aspect characteristics and/or the semantic aspect characteristics are used as the input of the voice distinguishing model, and whether the awakening voice data are mistaken awakening data or not is determined after model processing. According to the scheme, the mistaken awakening data can be screened out from the awakening voice data, and compared with the prior art that the mistaken awakening data is used as the correct voice data to perform model optimization, the method and the device for optimizing the voice data are beneficial to improvement of model optimization performance.
Additional features and advantages of the disclosure will be set forth in the detailed description which follows.
Drawings
The accompanying drawings, which are included to provide a further understanding of the disclosure and are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description serve to explain the disclosure without limiting the disclosure. In the drawings:
FIG. 1 is a schematic flow chart of a voice data processing method according to the disclosed embodiment;
FIG. 2 is a schematic flow chart of the construction of a speech discrimination model according to the present disclosure;
FIG. 3 is a schematic diagram of a voice data processing apparatus according to the present disclosure;
fig. 4 is a schematic structural diagram of an electronic device for voice data processing according to the present disclosure.
Detailed Description
The following detailed description of specific embodiments of the present disclosure is provided in connection with the accompanying drawings. It should be understood that the detailed description and specific examples, while indicating the present disclosure, are given by way of illustration and explanation only, not limitation.
Referring to fig. 1, a flow diagram of the disclosed voice data processing method is shown. May include the steps of:
s101, voice data input by a user are obtained, wherein the voice data comprise awakening voice data for successfully awakening the intelligent terminal and control voice data representing operation intention.
In general, the wake-up interaction process between the user and the smart terminal may be embodied as: the intelligent terminal monitors whether the voice data for awakening the intelligent terminal is input by the user, if the voice data is input and the awakening word is recognized according to the voice data, awakening is successful, whether the voice data for controlling the intelligent terminal is input by the user can be continuously monitored, and if the voice data is input and the operation intention is recognized according to the voice data, the intelligent terminal can be controlled to execute related operations.
In the scheme of the disclosure, the voice data which identifies the awakening words and successfully awakens the intelligent terminal can be called as awakening voice data; the voice data, which identifies the operation intention and controls the smart terminal to perform the related operation, may be referred to as control voice data.
It can be understood that compared with other types of voice interaction processes, the wake-up interaction process has obvious discontinuity sense and can be abstracted as 'silence + wake-up word + short pause + operation intention'. For example, "sil ding-dong sp i want to listen to music in liudebua," where "sil" represents the silence or background noise that the smart terminal hears before the user wakes up the smart terminal; "ding-dong" means a wake-up word; "sp" represents a short pause between the wake-up voice data and the control voice data; "I want to listen to music of Liudebua" means an operational intention.
In order to improve the performance of model optimization, the false awakening data can be screened from the awakening voice data, namely, whether the awakening voice data is the false awakening data or not is judged, and if the awakening voice data is the false awakening data, the false awakening data can be determined to be the inverse voice data. Compared with the prior art that the model optimization is carried out by regarding the false awakening data as the correct voice data, the scheme disclosed by the invention is beneficial to improving the model optimization performance.
As an example, the scheme of the present disclosure may be triggered to execute the voice data processing procedure when the intelligent terminal is successfully awakened; or, the voice data processing process may be triggered to be executed when other preset conditions are met, for example, the preset conditions may be that a preset number of voice data are collected, a preset time is reached, and the like.
As an example, the intelligent terminal in the present disclosure may be an electronic device with a voice wake-up function, for example, may be an intelligent appliance, a mobile phone, a personal computer, a tablet computer, or the like; in the practical application process, the voice data input by the user can be collected through the microphone of the intelligent terminal, and the expression form of the intelligent terminal, the equipment for acquiring the voice data and the like can be not particularly limited by the scheme disclosed by the invention.
As an example, the wake-up speech data in the present disclosure may be speech data in which a wake-up word is recognized via a current wake-up model. For example, a first wake-up score threshold value d may be set1If the voice data is used for awakening the intelligent terminal, the score output after the voice data is processed by the current awakening model is not lower than d1Then, the recognition result of the voice data is regarded as a wake-up word, and it can be determined as wake-up voice data.
As an example, the present disclosure may further obtain wake-up voice data in the following manner: judging whether at least two pieces of voice data for awakening the intelligent terminal are continuously acquired within a preset time period; if at least two voice data used for awakening the intelligent terminal are continuously acquired within a preset time period, and the score value d of the at least two voice data used for awakening the intelligent terminal after being processed by the current awakening model meets the following conditions: d2≤d<d1And determining at least two pieces of voice data for waking up the intelligent terminal as wake-up voice data.
According to practical application, if the user fails to wake up when the user performs the wake-up interaction for the first time, the user can perform the wake-up interaction for the second time quickly, even perform the wake-up interaction for multiple times until the wake-up interaction is successful or the user actively stops the wake-up interaction. For example, can be as above d1On the basis of the first wake-up score threshold value d, a second wake-up score threshold value d is set2And d is2<d1If at least two voice data which are continuously collected in a preset time period and used for awakening the intelligent terminal, the score d output after the current awakening model processing is located in the interval [ d2,d1) Then the two pieces of voice data may be determined as wake-up voice data. Thus, the score below d can be retained to some extent1The wake-up voice data enriches the data available for optimizing the current wake-up model.
S102, extracting acoustic level features and/or semantic level features of the voice data, wherein the acoustic level features are used for representing pronunciation features of a user, and the semantic level features are used for representing text features of the voice data.
After the voice data input by the user is obtained, the acoustic level features and/or semantic level features of the voice data can be extracted for processing and use by the voice discrimination model.
As one example, the acoustic level features may include an acoustic score of the current arousal model. Optionally, in addition to the acoustic score of the current arousal model, the acoustic profile features may also include at least one of the following optional features: fundamental frequency mean value, short-time average energy, short-time zero-crossing rate, unvoiced and voiced sequence feature, pitch sequence feature, time feature of voice unit, voiceprint feature and energy distribution feature. It will be appreciated that the optional features can be divided into two types: one is the original features extracted directly from the wake-up speech data, e.g., mean fundamental frequency, mean energy in short time, zero-crossing rate in short time; another is to wake up post-processing features of the speech data, such as voiced sequence features, pitch sequence features, temporal features of the speech unit, voiceprint features, energy distribution features.
As an example, the semantic level features may include at least one of the following features: semantic smoothness, editing distance of part of speech sequences and intention characteristics.
For the meaning of each feature representation and the specific extraction process, reference is made to the description below, and the detailed description is omitted here.
S103, the acoustic level features and/or the semantic level features are used as input and are processed by a pre-established voice discrimination model, and whether the awakening voice data is mistaken awakening data or not is determined.
After the acoustic level features and/or the semantic level features are extracted from the voice data, model processing can be carried out by utilizing a pre-established voice discrimination model to determine whether the awakening voice data is mistaken awakening data or not, and if the awakening voice data is mistaken awakening data, the awakening voice data can be classified as counterexample voice data; if not, it can continue as regular voice data.
Taking the current wake-up model as the current foreground wake-up model and the current background wake-up model as an example, the following is a simple description of the process of optimizing the current wake-up model by using the wake-up voice data from which the false wake-up data is filtered out.
As can be appreciated, the foreground wake-up model is used to describe the wake-up word, and the model training may be performed using the speech data containing the wake-up word; the background awakening model is used for describing non-awakening words, and the model training can be carried out by adopting voice data without the awakening words.
When the awakening model is optimized, the awakening voice data with the mistaken awakening data screened out can be used for updating the current foreground awakening model; the current background wake model may be updated with counter-example voice data, e.g., failed wake voice data, false wake data. Therefore, the distance between the two paths can be increased, and the accuracy rate of the voice recognition of the updated wake-up model is improved. The specific optimization process can be realized by referring to the related art, and is not detailed here.
As an example, only the current foreground wake-up model may be updated, i.e. the current foreground wake-up model may be optimized using only wake-up speech data that has been screened for false wake-up data. The model updating mode may be determined specifically in combination with actual application requirements, and the scheme of the present disclosure may not be limited thereto.
The following explains the acoustic level features and semantic level features in the present disclosure, respectively.
1. Acoustic bedding characteristics
(1) Acoustic score of current wake-up model for reflecting recognition accuracy of wake-up word
As an example, the first N recognition results output by the current wake-up model for each voice unit of the wake-up voice data may be obtained; if the first N recognition results of each voice unit contain the correct pronunciation of the voice unit, judging that the recognition result of the voice unit is correct for recognition; and counting the recognition accuracy of the awakening voice data according to the recognition result of each voice unit, wherein the recognition accuracy is used as the acoustic score of the current awakening model.
For example, the speech unit may be embodied as a basic recognition unit of the current wake model, such as a phoneme, a syllable, etc., which may not be specifically limited by the present disclosure.
Taking a voice unit as an example, the wake-up word "ding-dong" can be divided into "ding", "dong", "ding", and "dong" 4 voice units, if the value of N is 3, for the first voice unit "ding", the recognition probability of the current wake-up model for the voice unit output can be obtained, the first 3 with the highest probability are determined as the recognition result of the voice unit "ding", and if the correct pronunciation of "ding" exists in the 3 recognition results, the recognition result of the voice unit is determined as correct recognition. By analogy, the recognition results of other 3 voice units are respectively obtained, and then the recognition accuracy of the awakening voice data is calculated, namely, the ratio between the number of the voice units which are correctly recognized and the total number of the voice units is calculated and used as the acoustic score of the current awakening model.
It can be understood that the value of N in the present disclosure may be N ≧ 1, which may be specifically set in combination with practical application requirements, and the present disclosure is not limited to this.
(2) Waking up original features of speech data, e.g. mean fundamental frequency, mean energy in short time, zero crossing rate in short time
Generally, when a person pronounces voice, a voice signal can be classified into unvoiced sound and voiced sound according to whether a vocal cord vibrates or not. Voiced sounds, also known as voiced languages, carry most of the energy in the language, and exhibit significant periodicity in the time domain; unvoiced sounds resemble white noise with no apparent periodicity. When voiced sound occurs, the airflow passes through the glottis to make the vocal cords generate relaxation oscillation type vibration to generate a quasiperiodic excitation pulse train, and the frequency of the vocal cord vibration can be called fundamental tone frequency, called fundamental frequency for short. The pitch frequency generally has a relationship with a personal vocal cord, pronunciation habit, and the like, and can reflect personal characteristics to some extent.
As an example, the way of extracting the fundamental frequency mean value can be embodied as: and performing frame processing on the awakening voice data to obtain a plurality of voice data frames, then extracting the base frequency corresponding to each frame, and further calculating the base frequency average value corresponding to the awakening voice data by using the base frequency corresponding to each frame.
In addition, it should be noted that the short-time average energy can be used as a characteristic parameter for distinguishing unvoiced sound and voiced sound; alternatively, when the signal-to-noise ratio is high, the characteristic parameter can be used to distinguish between voiced sound and unvoiced sound.
The short-term zero-crossing rate refers to the number of times the waveform of the speech signal crosses the horizontal axis (zero level) in one frame of speech data. Generally, the energy during voiced sound is concentrated in the low frequency band, the energy during unvoiced sound is concentrated in the high frequency band, which can reflect the frequency to some extent, and the energy during voiced sound has a lower zero-crossing rate in the voiced sound band, and the energy during unvoiced sound has a higher zero-crossing rate in the unvoiced sound band.
The method for obtaining the fundamental frequency mean value, the short-time average energy and the short-time zero-crossing rate in the scheme of the disclosure is not limited, and can be realized by referring to the related technology, and the details are not described here.
(3) An unvoiced sequence feature for reflecting the unvoiced characteristics of phonemes in the wake-up speech data
As an example, the voiced sequence { a of the wake-up speech data can be output after being processed by a pre-constructed voiced classifier with at least one of the fundamental frequency mean, the short-time average energy and the short-time zero-crossing rate as input1,a2,…,ai,…,amIn which aiRepresenting the voiced category corresponding to the ith phoneme of the awakening voice data; and calculating the similarity between the voiced and unvoiced sequence of the awakened voice data and the voiced and unvoiced sequence of the awakened word corresponding to the awakened voice data as the voiced and unvoiced sequence feature.
For example, the unvoiced and voiced categories of phonemes may be: for example, an unvoiced sound and a voiced sound may be represented by "0" and "1", which may not be specifically limited in the present disclosure.
It can be understood that the intelligent terminal may only store one wake-up word, that is, know in advance what the wake-up word corresponding to the wake-up voice data is; or, the intelligent terminal may store a plurality of wake-up words, and for this, what the wake-up word corresponding to the wake-up voice data is may be identified by using the current wake-up model, which is not specifically limited in the present disclosure.
As an example, the unvoiced and turbid sequence of the wakeup word may be stored in the intelligent terminal and directly read when the similarity needs to be calculated; or, when the similarity needs to be calculated, the unvoiced and voiced sequence of the wakeup word can be determined in real time by using the unvoiced and voiced classifier, which may not be specifically limited by the scheme of the present disclosure.
As an example, calculating the similarity of the voiced and unvoiced sequences may be embodied as: calculating the similarity by means of an exclusive-or operation, wherein if the unvoiced and voiced categories of the phonemes at the corresponding positions are the same, for example, the unvoiced and voiced categories are all represented by "1", the exclusive-or result of the phonemes at the position is 0; otherwise the xor result is 1. Thus, a non-zero number can be counted to obtain the similarity, and generally, the smaller the non-zero number, the higher the similarity.
It is to be understood that the surd-turbid classifier in the present disclosure may employ a common classification model, for example, a support vector machine model, a neural network model, etc., and this may not be particularly limited in the present disclosure.
(4) Tone sequence features reflecting tone characteristics of syllables in the wake-up speech data
As an example, at least one of the mean value of the fundamental frequency, the short-time average energy and the short-time zero crossing rate can be used as an input, and after being processed by a pre-constructed tone classifier, the tone sequence { b ] of the awakening voice data is output1,b2,…,bj,…,bnIn which b isjA tone category corresponding to the j-th syllable representing the wake-up speech data; and calculating the similarity between the tone sequence of the awakening voice data and the tone sequence of the awakening word corresponding to the awakening voice data as the tone sequence characteristic.
Taking Chinese as an example, the tone category of a syllable can be represented by 4 common tones, and the identifiers "1", "2", "3" and "4" can be used to represent different tones; alternatively, the tone category of a syllable may also be determined in combination with other languages, which may not be specifically limited by the present disclosure.
As can be seen from the above description, no matter the intelligent terminal stores one wake-up word or stores a plurality of wake-up words, the wake-up word corresponding to the wake-up voice data can be determined, and then the tone sequence of the wake-up word is obtained, which can be specifically described with reference to the characteristics of the unvoiced and turbid sequence in (3), and details are not described here.
As an example, calculating the similarity of the pitch sequences may be embodied as: calculating the similarity by means of exclusive-or operation, wherein if the tone categories of the syllables at the corresponding positions are the same, for example, the fourth tone of Chinese represented by '4', the result of the exclusive-or operation on the syllables at the position is 0; otherwise the xor result is 1. Thus, a non-zero number can be counted to obtain the similarity, and generally, the smaller the non-zero number, the higher the similarity.
It is to be understood that the pitch classifier in the present disclosure may employ a common classification model, for example, a support vector machine model, a neural network model, etc., and the present disclosure is not limited thereto.
(5) Time characteristics of voice unit for reflecting abnormal condition of awakening voice data during voice unit segmentation
As an example, the voice recognition result obtained by the current wake-up model may be based on a forced segmentation of the wake-up voice data to obtain the start time and the end time of each voice unit, and further obtain the duration of each voice unit; the time mean and time variance can be calculated as the time characteristics of the speech units using the duration of each speech unit.
In general, the time characteristic of the phonetic unit can reflect the abnormal condition of the phonetic unit in the segmentation process, for example, the duration of the individual phonetic unit is too long or too short, which does not conform to the normal speech form. As an example, the speech units may be embodied as phonemes, syllables, etc., which may not be specifically limited by the present disclosure.
(6) Voiceprint features for reflecting physiological and behavioral features of a speaker
As an example, i-vector features of the wake-up speech data can be extracted as voiceprint features using a pre-constructed voiceprint extraction model. For example, voiceprint features can be extracted by using a voiceprint extraction model such as DNN I-Vector, GMM-UBM I-Vector, etc., which may not be specifically limited in the present disclosure.
It can be understood that the voiceprint feature reflects personalized features of the speaker, and generally, the voiceprint feature of the speaker does not change in a short time, so that the i-vector feature can be extracted from the control voice data or the whole voice data including the wake-up voice data and the control voice data by using a voiceprint extraction model, which is not particularly limited in the present disclosure.
(7) Energy distribution characteristics for reflecting characteristics of wake-up interaction process
As an example, the voice data may be segmented into three portions ct-1、ct、ct+1And counting the average energy distribution of each part, e.g. the average energy distribution of three parts can be expressed as gt-1、gt、gt+1And obtaining the energy distribution characteristics.
As an example, the way of extracting the energy distribution feature may be embodied as: and respectively carrying out frame processing on the 3 parts of voice data to obtain a voice data frame included in each part, then extracting energy corresponding to each frame, and further calculating the average energy of each part by using the energy corresponding to each frame.
In connection with the example of the wake-up interaction process mentioned above, "sil ding-dong sp i wants to listen to liu de hua music," the voice data can be divided into 3 parts by recognizing the wake-up word. Wherein, ctRepresenting wake-up voice data; c. Ct-1Represents a set of voice data, typically silence segments or background noise, collected prior to waking up the voice data; c. Ct+1Representing a voice data set, typically a short pause and operational intent, collected after the voice data was awakened. Understandably, ct-1、ct+1The duration of the time period may be determined flexibly, for example, according to VAD (Voice Activity Detection, chinese) information, or may be set to a fixed duration, for example, 1s to 5s, which is not limited in this disclosure.
Compared with the conventional method that the intelligent terminal is mistakenly awakened due to the fact that the awakening word is provided in daily conversation, for example, "i think that ding-dong is a good name", the energy distribution of the awakening interaction process of the scheme disclosed by the invention is obviously different.
2. Semantic level features
(1) Semantic smoothness
As an example, the speech data may be participled to obtain a sequence of words { w }1,w2,…,wk,…,wfIn which wkA k-th word representing speech data; then, the probability that f words appear in sequence according to the sequence of the word sequences is calculated and used as the semantic smoothness.
For example, semantic smoothness in the present disclosure may be embodied as w1To wfForward semantic smoothness of direction P (w)1,w2,…,wf) (ii) a And/or, wfTo w1Semantic smoothness P (w) of direction in reverse directionf,wf-1,…,w1). Taking forward semantic smoothness as an example, the forward semantic smoothness can be calculated by the following formulaTo:
wherein, P (w)k|wk-1) May be obtained based on sample speech data statistics that are involved in speech recognition model training.
(2) Edit distance of part-of-speech sequence
As an example, the speech data may be word-segmented to obtain a sequence of parts of speech { q }1,q2,…,qk,…,qfWherein q iskA part of speech representing a k-th word of the speech data; and calculating the editing distance between the part of speech sequence of the voice data and the part of speech sequence of each sample voice data, and selecting the minimum editing distance from the part of speech sequences as the editing distance of the part of speech sequences. Wherein, the sample voice data is the data participating in training the voice discrimination model.
The part-of-speech sequence characteristics can reflect semantic information to a certain extent, and are particularly more remarkable in the awakening interaction process. In the present disclosure, the part-of-speech sequence feature may be embodied as an Edit Distance (Edit Distance) of the part-of-speech sequence, specifically, the minimum number of editing operations required to change one string into another string between two strings, where generally, the smaller the Edit Distance, the greater the similarity between two strings.
If the part-of-speech sequence of the sample speech data is denoted as { p }1,p2,…,phIs calculated by using the following formula1,q2,…,qfAnd { p }1,p2,…,phEdit distance d of }[f,h]:
It is to be understood that the sample speech data may be all data that participate in the speech discrimination model training; or, the regular data screened from all the data may be used as sample voice data to calculate the edit distance, which may not be specifically limited in the present disclosure, as long as the minimum edit distance is determined.
(3) Intention characteristics
As an example, the intention features of the control speech data may be extracted using a pre-constructed intention analysis model, and the intention features include explicit intentions or no explicit intentions, or the intention features include intention categories corresponding to the control speech data.
The disclosed solution may pre-construct an intent analysis model for determining operational intent tendencies. For example, the intent analysis model may be embodied as a classifier, with the output of the model representing an express intent, no express intent; or, the intention analysis model may be embodied as a regression model, the model outputs scores representing various intention categories, and the intention category corresponding to the control voice data may be determined according to the score, for example, the top M intention category with the highest score is used as the intention category corresponding to the control voice data, a value of M in the present disclosure may be M ≧ 1, which may be specifically set in combination with actual application requirements, which may not be limited by the present disclosure. For example, the intention category may be playing music, inquiring weather, etc., and the concrete may depend on the actual application requirement.
The following explains the process of constructing the speech discrimination model in the present disclosure. Referring specifically to the flowchart shown in fig. 2, the method may include the following steps:
s201, sample voice data is collected, wherein the sample voice data comprises sample awakening voice data and sample control voice data, the data type of the sample awakening voice data is marked as positive example awakening voice data or negative example awakening voice data, and the negative example awakening voice data comprises error awakening data and voice data failed to be awakened.
During model training, a large amount of sample voice data can be collected, wherein the sample voice data can be embodied as sample awakening voice data and sample control voice data. In addition, data type labeling can be performed on the sample awakening voice data, for example, the data type can be positive example awakening voice data and negative example awakening voice data, and for the negative example awakening voice data, the data type can be further carefully labeled as false awakening data and voice data with awakening failure.
S202, extracting the acoustic level features and/or semantic level features of the sample voice data.
The specific implementation process can be described by reference to the above, and is not detailed here.
S203, determining the topological structure of the voice discrimination model.
As an example, the topology in the present disclosure can be embodied as: CNN (chinese: Convolutional Neural Network), RNN (chinese: Convolutional Neural Network), DNN (Deep Neural Network), and the like, which are not particularly limited in this disclosure.
As an example, the output layer of the neural network may include 2 output nodes, which respectively represent the positive example wake-up voice data and the false wake-up data, for example, "0" may be used to represent the positive example wake-up voice data, and "1" may be used to represent the false wake-up data. Alternatively, the output layer of the neural network may contain 1 output node, representing the probability that the wake-up voice data is determined to be false wake-up data. The specific representation form of the neural network in the scheme of the disclosure can not be limited.
S204, training the voice discrimination model by using the topological structure and the acoustic level features and/or semantic level features of the sample voice data until the data type of the sample awakening voice data output by the voice discrimination model is the same as the labeled data type.
And determining the topological structure of the model, and after extracting the acoustic level features and/or semantic level features of the sample voice data, performing model training. As an example, the training process may adopt a cross entropy criterion, and update and optimize the model parameters by using a common random gradient descent method, so as to ensure that when the model training is completed, the data type of the sample wake-up voice data output by the model is the same as the labeled data type.
As an example, the speech discrimination model may be a generic model, i.e. not built for a certain or some specific wake words; or, the voice discrimination model may be an individualized model, that is, different voice discrimination models are constructed for different wake-up words, which may not be specifically limited by the present disclosure.
Referring to fig. 3, a schematic diagram of the voice data processing apparatus of the present disclosure is shown. The apparatus may include:
the voice data acquisition module 301 is configured to acquire voice data input by a user, where the voice data includes wake-up voice data for successfully waking up the intelligent terminal and control voice data representing an operation intention;
a feature extraction module 302, configured to extract acoustic level features and/or semantic level features of the voice data, where the acoustic level features are used to represent pronunciation features of a user, and the semantic level features are used to represent text features of the voice data;
the model processing module 303 is configured to determine whether the awakening voice data is false awakening data after the acoustic level feature and/or the semantic level feature are/is used as input and processed by a pre-established voice discrimination model.
Optionally, the voice data acquiring module is configured to determine whether at least two pieces of voice data for waking up the intelligent terminal are continuously acquired within a preset time period; if at least two voice data used for awakening the intelligent terminal are continuously collected in the preset time period, and the score value d of the at least two voice data used for awakening the intelligent terminal after being processed by the current awakening model meets the following conditions: d2≤d<d1Determining the at least two voice data for waking up the intelligent terminal as the wake-up voice data, d1Is a first wake-up score threshold, d2Is the second wake-up score threshold.
Optionally, the acoustic level features include an acoustic score of a current arousal model,
the feature extraction module is configured to obtain the first N recognition results output by the current wake-up model for each voice unit of the wake-up voice data; if the first N recognition results of each voice unit contain the correct pronunciation of the voice unit, judging that the recognition result of the voice unit is correct for recognition; and counting the recognition accuracy of the awakening voice data according to the recognition result of each voice unit, wherein the recognition accuracy is used as the acoustic score of the current awakening model.
Optionally, the acoustic level features further include at least one of a fundamental frequency mean, a short-time mean energy, and a short-time zero-crossing rate;
and/or the presence of a gas in the gas,
the acoustic level features further comprise a voiced-unvoiced sequence feature, and the feature extraction module is used for outputting the voiced-unvoiced sequence { a ] of the awakening voice data after processing by a pre-constructed voiced-unvoiced classifier by taking at least one of a fundamental frequency mean value, a short-time average energy and a short-time zero crossing rate as an input1,a2,…,ai,…,amIn which aiRepresenting the voiced category corresponding to the ith phoneme of the awakening voice data; calculating the similarity between the voiced and unvoiced sequence of the awakening voice data and the voiced and unvoiced sequence of the awakening word corresponding to the awakening voice data to serve as the voiced and unvoiced sequence feature;
and/or the presence of a gas in the gas,
the acoustic level features further comprise tone sequence features, and the feature extraction module is used for outputting the tone sequence { b ] of the awakening voice data after processing by a pre-constructed tone classifier by taking at least one of a fundamental frequency mean value, a short-time average energy and a short-time zero crossing rate as input1,b2,…,bj,…,bnIn which b isjRepresenting a tone category corresponding to a jth syllable of the wake-up voice data; calculating the similarity between the tone sequence of the awakening voice data and the tone sequence of the awakening word corresponding to the awakening voice data to serve as the tone sequence characteristic;
and/or the presence of a gas in the gas,
the acoustic level features further include time features of voice units, and the feature extraction module is configured to count a duration of each voice unit of the wake-up voice data; calculating a time mean value and a time variance as the time characteristics of each voice unit by using the duration of each voice unit;
and/or the presence of a gas in the gas,
the acoustic level features further comprise voiceprint features, and the feature extraction module is used for extracting i-vector features of the awakening voice data by using a pre-constructed voiceprint extraction model to serve as the voiceprint features;
and/or the presence of a gas in the gas,
the acoustic level features further comprise energy distribution features, and the feature extraction module is used for dividing the voice data into three parts ct-1、ct、ct+1Counting the average energy distribution of each part as the energy distribution characteristics; wherein, ctRepresenting said wake-up speech data, ct+1Representing a voice data set comprising said control voice data acquired after said wake-up voice data, ct-1Representing a voice data set collected prior to the wake-up voice data.
Optionally, the semantic level features include semantic smoothness, and the feature extraction module is configured to perform word segmentation processing on the voice data to obtain a word sequence { w }1,w2,…,wk,…,wfIn which wkA k-th word representing the speech data; calculating the probability of f words appearing in sequence according to the sequence of the word sequence as the semantic smoothness;
and/or the presence of a gas in the gas,
the semantic level features comprise editing distance of a part of speech sequence, and the feature extraction module is used for performing word segmentation processing on the voice data to obtain the part of speech sequence { q1,q2,…,qk,…,qfWherein q iskA part of speech representing a k-th word of the speech data; calculating the editing distance between the part-of-speech sequence of the speech data and the part-of-speech sequence of each sample speech data, and selecting the minimum editing distance from the part-of-speech sequences as the editing distance of the part-of-speech sequence, wherein the sample speech data is used for training the speechJudging the data of the model;
and/or the presence of a gas in the gas,
the semantic level features comprise intention features, and the feature extraction module is used for extracting the intention features of the control voice data by utilizing a pre-constructed intention analysis model, wherein the intention features comprise clear intentions or no clear intentions, or the intention features comprise intention categories corresponding to the control voice data.
Optionally, the apparatus further comprises:
the system comprises a sample voice data acquisition module, a sample voice data acquisition module and a sample control module, wherein the sample voice data comprises sample awakening voice data and sample control voice data, the data type of the sample awakening voice data is marked as positive example awakening voice data or negative example awakening voice data, and the negative example awakening voice data comprises false awakening data and voice data failed to be awakened;
the sample feature extraction module is used for extracting the acoustic level features and/or the semantic level features of the sample voice data;
the topological structure determining module is used for determining the topological structure of the voice distinguishing model;
and the model training module is used for training the voice discrimination model by utilizing the topological structure and the acoustic level characteristics and/or the semantic level characteristics of the sample voice data until the data type of the sample awakening voice data output by the voice discrimination model is the same as the labeled data type.
Optionally, the apparatus further comprises:
and the model optimization module is used for optimizing the current awakening model by utilizing the awakening voice data with the mistaken awakening data screened out.
With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.
Referring to fig. 4, a schematic structural diagram of an electronic device 400 for voice data processing according to the present disclosure is shown. Referring to fig. 4, electronic device 400 includes a processing component 401 that further includes one or more processors, and storage resources, represented by storage medium 402, for storing instructions, such as application programs, that are executable by processing component 401. The application stored in the storage medium 402 may include one or more modules that each correspond to a set of instructions. Further, the processing component 401 is configured to execute instructions to perform the above-described voice data processing method.
The preferred embodiments of the present disclosure are described in detail with reference to the accompanying drawings, however, the present disclosure is not limited to the specific details of the above embodiments, and various simple modifications may be made to the technical solution of the present disclosure within the technical idea of the present disclosure, and these simple modifications all belong to the protection scope of the present disclosure.
It should be noted that, in the foregoing embodiments, various features described in the above embodiments may be combined in any suitable manner, and in order to avoid unnecessary repetition, various combinations that are possible in the present disclosure are not described again.
In addition, any combination of various embodiments of the present disclosure may be made, and the same should be considered as the disclosure of the present disclosure, as long as it does not depart from the spirit of the present disclosure.
Claims (16)
1. A method of processing speech data, the method comprising:
acquiring voice data input by a user, wherein the voice data comprises awakening voice data for successfully awakening the intelligent terminal and control voice data representing operation intention;
extracting acoustic level features and/or semantic level features of the voice data, wherein the acoustic level features are used for representing pronunciation features of a user, and the semantic level features are used for representing text features of the voice data;
the acoustic level features and/or the semantic level features are used as input, and whether the awakening voice data is mistaken awakening data or not is determined after the voice identification model which is constructed in advance is used for processing; the data type of the sample awakening voice data used for constructing the voice discrimination model comprises normal awakening voice data or reverse awakening voice data.
2. The method of claim 1, wherein the wake-up voice data is obtained by:
judging whether at least two pieces of voice data for awakening the intelligent terminal are continuously acquired within a preset time period;
if at least two voice data used for awakening the intelligent terminal are continuously collected in the preset time period, and the score value d of the at least two voice data used for awakening the intelligent terminal after being processed by the current awakening model meets the following conditions: d2≤d<d1Determining the at least two voice data for waking up the intelligent terminal as the wake-up voice data, d1Is a first wake-up score threshold, d2Is the second wake-up score threshold.
3. The method of claim 1 or 2, wherein the acoustic level features comprise an acoustic score of a current arousal model, and extracting the acoustic level features of the speech data comprises:
acquiring the first N recognition results output by the current awakening model aiming at each voice unit of the awakening voice data;
if the first N recognition results of each voice unit contain the correct pronunciation of the voice unit, judging that the recognition result of the voice unit is correct for recognition;
and counting the recognition accuracy of the awakening voice data according to the recognition result of each voice unit, wherein the recognition accuracy is used as the acoustic score of the current awakening model.
4. The method of claim 3,
the acoustic level features further comprise at least one of a fundamental frequency mean, a short-time average energy, and a short-time zero-crossing rate;
and/or the presence of a gas in the gas,
if the acoustic level features further include the unvoiced and voiced sequence features, extracting the acoustic level features of the voice data includes: at least one of the fundamental frequency mean value, the short-time average energy and the short-time zero crossing rate is used as input, and after the input is processed by a pre-constructed unvoiced and voiced classifier, the unvoiced and voiced sequence { a) of the awakening voice data is output1,a2,…,ai,…,amIn which aiRepresenting the voiced category corresponding to the ith phoneme of the awakening voice data; calculating the similarity between the voiced and unvoiced sequence of the awakening voice data and the voiced and unvoiced sequence of the awakening word corresponding to the awakening voice data to serve as the voiced and unvoiced sequence feature;
and/or the presence of a gas in the gas,
if the acoustic level features further include pitch sequence features, extracting the acoustic level features of the speech data includes: at least one of the fundamental frequency mean value, the short-time average energy and the short-time zero crossing rate is used as input, and after the input is processed by a pre-constructed tone classifier, the tone sequence { b ] of the awakening voice data is output1,b2,…,bj,…,bnIn which b isjRepresenting a tone category corresponding to a jth syllable of the wake-up voice data; calculating the similarity between the tone sequence of the awakening voice data and the tone sequence of the awakening word corresponding to the awakening voice data to serve as the tone sequence characteristic;
and/or the presence of a gas in the gas,
the acoustic level features further include time features of a speech unit, and extracting the acoustic level features of the speech data includes: counting the duration of each voice unit of the awakening voice data; calculating a time mean value and a time variance as the time characteristics of each voice unit by using the duration of each voice unit;
and/or the presence of a gas in the gas,
the acoustic surface features further include voiceprint features, and extracting the acoustic surface features of the speech data includes: extracting i-vector characteristics of the awakening voice data by using a pre-constructed voiceprint extraction model to serve as the voiceprint characteristics;
and/or the presence of a gas in the gas,
if the acoustic level features further include energy distribution features, extracting the acoustic level features of the speech data includes: dividing the voice data into three parts ct-1、ct、ct+1Counting the average energy distribution of each part as the energy distribution characteristics; wherein, ctRepresenting said wake-up speech data, ct+1Representing a voice data set comprising said control voice data acquired after said wake-up voice data, ct-1Representing a voice data set collected prior to the wake-up voice data.
5. The method according to claim 1 or 2,
if the semantic level features include semantic smoothness, extracting the semantic level features of the voice data includes: performing word segmentation processing on the voice data to obtain a word sequence { w1,w2,…,wk,…,wfIn which wkA k-th word representing the speech data; calculating the probability of f words appearing in sequence according to the sequence of the word sequence as the semantic smoothness;
and/or the presence of a gas in the gas,
if the semantic level features include the edit distance of the part-of-speech sequence, extracting the semantic level features of the voice data includes: performing word segmentation processing on the voice data to obtain a part of speech sequence { q1,q2,…,qk,…,qfWherein q iskA part of speech representing a k-th word of the speech data; calculating words of the speech dataEditing distance between the part-of-speech sequence and the part-of-speech sequence of each sample voice data, and selecting the minimum editing distance from the editing distances as the editing distance of the part-of-speech sequence, wherein the sample voice data is data participating in training the voice discrimination model;
and/or the presence of a gas in the gas,
if the semantic level features include intent features, extracting the semantic level features of the voice data includes: and extracting intention characteristics of the control voice data by using a pre-constructed intention analysis model, wherein the intention characteristics comprise clear intentions or no clear intentions, or the intention characteristics comprise intention categories corresponding to the control voice data.
6. The method according to claim 1 or 2, wherein the speech discrimination model is constructed by:
collecting sample voice data, wherein the sample voice data comprises sample awakening voice data and sample control voice data, the data type of the sample awakening voice data is marked as positive example awakening voice data or negative example awakening voice data, and the negative example awakening voice data comprises false awakening data and voice data failed to be awakened;
extracting acoustic level features and/or semantic level features of the sample voice data;
determining a topological structure of the voice discrimination model;
and training the voice discrimination model by using the topological structure and the acoustic level features and/or semantic level features of the sample voice data until the data type of the sample awakening voice data output by the voice discrimination model is the same as the labeled data type.
7. The method according to claim 1 or 2, characterized in that the method further comprises:
and optimizing the current awakening model by utilizing the awakening voice data with the mistaken awakening data screened out.
8. A speech data processing apparatus, characterized in that the apparatus comprises:
the voice data acquisition module is used for acquiring voice data input by a user, wherein the voice data comprises awakening voice data for successfully awakening the intelligent terminal and control voice data representing operation intention;
the feature extraction module is used for extracting acoustic level features and/or semantic level features of the voice data, wherein the acoustic level features are used for representing pronunciation features of a user, and the semantic level features are used for representing text features of the voice data;
the model processing module is used for taking the acoustic level features and/or the semantic level features as input, and determining whether the awakening voice data is mistaken awakening data or not after the voice identification model which is constructed in advance is used for processing; the data type of the sample awakening voice data used for constructing the voice discrimination model comprises normal awakening voice data or reverse awakening voice data.
9. The apparatus of claim 8,
the voice data acquisition module is used for judging whether at least two pieces of voice data used for awakening the intelligent terminal are continuously acquired within a preset time period; if at least two voice data used for awakening the intelligent terminal are continuously collected in the preset time period, and the score value d of the at least two voice data used for awakening the intelligent terminal after being processed by the current awakening model meets the following conditions: d2≤d<d1Determining the at least two voice data for waking up the intelligent terminal as the wake-up voice data, d1Is a first wake-up score threshold, d2Is the second wake-up score threshold.
10. The apparatus of claim 8 or 9, wherein the acoustic level features comprise an acoustic score of a current arousal model,
the feature extraction module is configured to obtain the first N recognition results output by the current wake-up model for each voice unit of the wake-up voice data; if the first N recognition results of each voice unit contain the correct pronunciation of the voice unit, judging that the recognition result of the voice unit is correct for recognition; and counting the recognition accuracy of the awakening voice data according to the recognition result of each voice unit, wherein the recognition accuracy is used as the acoustic score of the current awakening model.
11. The apparatus of claim 10,
the acoustic level features further comprise at least one of a fundamental frequency mean, a short-time average energy, and a short-time zero-crossing rate;
and/or the presence of a gas in the gas,
the acoustic level features further comprise a voiced-unvoiced sequence feature, and the feature extraction module is used for outputting the voiced-unvoiced sequence { a ] of the awakening voice data after processing by a pre-constructed voiced-unvoiced classifier by taking at least one of a fundamental frequency mean value, a short-time average energy and a short-time zero crossing rate as an input1,a2,…,ai,…,amIn which aiRepresenting the voiced category corresponding to the ith phoneme of the awakening voice data; calculating the similarity between the voiced and unvoiced sequence of the awakening voice data and the voiced and unvoiced sequence of the awakening word corresponding to the awakening voice data to serve as the voiced and unvoiced sequence feature;
and/or the presence of a gas in the gas,
the acoustic level features further comprise tone sequence features, and the feature extraction module is used for outputting the tone sequence { b ] of the awakening voice data after processing by a pre-constructed tone classifier by taking at least one of a fundamental frequency mean value, a short-time average energy and a short-time zero crossing rate as input1,b2,…,bj,…,bnIn which b isjRepresenting a tone category corresponding to a jth syllable of the wake-up voice data; calculating the similarity between the tone sequence of the awakening voice data and the tone sequence of the awakening word corresponding to the awakening voice data to serve as the tone sequence characteristic;
and/or the presence of a gas in the gas,
the acoustic level features further include time features of voice units, and the feature extraction module is configured to count a duration of each voice unit of the wake-up voice data; calculating a time mean value and a time variance as the time characteristics of each voice unit by using the duration of each voice unit;
and/or the presence of a gas in the gas,
the acoustic level features further comprise voiceprint features, and the feature extraction module is used for extracting i-vector features of the awakening voice data by using a pre-constructed voiceprint extraction model to serve as the voiceprint features;
and/or the presence of a gas in the gas,
the acoustic level features further comprise energy distribution features, and the feature extraction module is used for dividing the voice data into three parts ct-1、ct、ct+1Counting the average energy distribution of each part as the energy distribution characteristics; wherein, ctRepresenting said wake-up speech data, ct+1Representing a voice data set comprising said control voice data acquired after said wake-up voice data, ct-1Representing a voice data set collected prior to the wake-up voice data.
12. The apparatus according to claim 8 or 9,
the semantic level features comprise semantic smoothness, and the feature extraction module is used for performing word segmentation processing on the voice data to obtain a word sequence { w }1,w2,…,wk,…,wfIn which wkA k-th word representing the speech data; calculating the probability of f words appearing in sequence according to the sequence of the word sequence as the semantic smoothness;
and/or the presence of a gas in the gas,
the semantic level features comprise editing distance of a part of speech sequence, and the feature extraction module is used for performing word segmentation processing on the voice data to obtain the part of speech sequence { q1,q2,…,qk,…,qfWherein q iskRepresenting the speechPart of speech of the kth word of data; calculating an editing distance between the part-of-speech sequence of the speech data and the part-of-speech sequence of each sample speech data, and selecting a minimum editing distance from the editing distances as the editing distance of the part-of-speech sequence, wherein the sample speech data is data participating in training the speech discrimination model;
and/or the presence of a gas in the gas,
the semantic level features comprise intention features, and the feature extraction module is used for extracting the intention features of the control voice data by utilizing a pre-constructed intention analysis model, wherein the intention features comprise clear intentions or no clear intentions, or the intention features comprise intention categories corresponding to the control voice data.
13. The apparatus of claim 8 or 9, further comprising:
the system comprises a sample voice data acquisition module, a sample voice data acquisition module and a sample control module, wherein the sample voice data comprises sample awakening voice data and sample control voice data, the data type of the sample awakening voice data is marked as positive example awakening voice data or negative example awakening voice data, and the negative example awakening voice data comprises false awakening data and voice data failed to be awakened;
the sample feature extraction module is used for extracting the acoustic level features and/or the semantic level features of the sample voice data;
the topological structure determining module is used for determining the topological structure of the voice distinguishing model;
and the model training module is used for training the voice discrimination model by utilizing the topological structure and the acoustic level characteristics and/or the semantic level characteristics of the sample voice data until the data type of the sample awakening voice data output by the voice discrimination model is the same as the labeled data type.
14. The apparatus of claim 8 or 9, further comprising:
and the model optimization module is used for optimizing the current awakening model by utilizing the awakening voice data with the mistaken awakening data screened out.
15. A storage device having stored therein a plurality of instructions, wherein said instructions are loaded by a processor for performing the steps of the method of any of claims 1 to 7.
16. An electronic device, characterized in that the electronic device comprises;
the storage device of claim 15; and
a processor to execute instructions in the storage device.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711364085.1A CN108320733B (en) | 2017-12-18 | 2017-12-18 | Voice data processing method and device, storage medium and electronic equipment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711364085.1A CN108320733B (en) | 2017-12-18 | 2017-12-18 | Voice data processing method and device, storage medium and electronic equipment |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108320733A CN108320733A (en) | 2018-07-24 |
CN108320733B true CN108320733B (en) | 2022-01-04 |
Family
ID=62893086
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201711364085.1A Active CN108320733B (en) | 2017-12-18 | 2017-12-18 | Voice data processing method and device, storage medium and electronic equipment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108320733B (en) |
Families Citing this family (20)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109119079B (en) * | 2018-07-25 | 2022-04-01 | 天津字节跳动科技有限公司 | Voice input processing method and device |
CN108831471B (en) * | 2018-09-03 | 2020-10-23 | 重庆与展微电子有限公司 | Voice safety protection method and device and routing terminal |
CN109036412A (en) * | 2018-09-17 | 2018-12-18 | 苏州奇梦者网络科技有限公司 | voice awakening method and system |
EP3857541B1 (en) * | 2018-09-30 | 2023-07-19 | Microsoft Technology Licensing, LLC | Speech waveform generation |
CN110444210B (en) * | 2018-10-25 | 2022-02-08 | 腾讯科技(深圳)有限公司 | Voice recognition method, awakening word detection method and device |
CN111261143B (en) * | 2018-12-03 | 2024-03-22 | 嘉楠明芯(北京)科技有限公司 | Voice wakeup method and device and computer readable storage medium |
CN109671435B (en) * | 2019-02-21 | 2020-12-25 | 三星电子(中国)研发中心 | Method and apparatus for waking up smart device |
CN110070863A (en) * | 2019-03-11 | 2019-07-30 | 华为技术有限公司 | A kind of sound control method and device |
CN110060665A (en) * | 2019-03-15 | 2019-07-26 | 上海拍拍贷金融信息服务有限公司 | Word speed detection method and device, readable storage medium storing program for executing |
CN110049107B (en) * | 2019-03-22 | 2022-04-08 | 钛马信息网络技术有限公司 | Internet vehicle awakening method, device, equipment and medium |
CN110534098A (en) * | 2019-10-09 | 2019-12-03 | 国家电网有限公司客户服务中心 | A kind of the speech recognition Enhancement Method and device of age enhancing |
CN110992940B (en) | 2019-11-25 | 2021-06-15 | 百度在线网络技术(北京)有限公司 | Voice interaction method, device, equipment and computer-readable storage medium |
CN111128155B (en) * | 2019-12-05 | 2020-12-01 | 珠海格力电器股份有限公司 | Awakening method, device, equipment and medium for intelligent equipment |
CN111640426A (en) * | 2020-06-10 | 2020-09-08 | 北京百度网讯科技有限公司 | Method and apparatus for outputting information |
CN112037772B (en) * | 2020-09-04 | 2024-04-02 | 平安科技(深圳)有限公司 | Response obligation detection method, system and device based on multiple modes |
CN112530442B (en) * | 2020-11-05 | 2023-11-17 | 广东美的厨房电器制造有限公司 | Voice interaction method and device |
CN115039169A (en) * | 2021-01-06 | 2022-09-09 | 京东方科技集团股份有限公司 | Voice instruction recognition method, electronic device and non-transitory computer readable storage medium |
CN112951235B (en) * | 2021-01-27 | 2022-08-16 | 北京云迹科技股份有限公司 | Voice recognition method and device |
CN113436615B (en) * | 2021-07-06 | 2023-01-03 | 南京硅语智能科技有限公司 | Semantic recognition model, training method thereof and semantic recognition method |
CN117784632B (en) * | 2024-02-28 | 2024-05-14 | 深圳市轻生活科技有限公司 | Intelligent household control system based on offline voice recognition |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106272481A (en) * | 2016-08-15 | 2017-01-04 | 北京光年无限科技有限公司 | The awakening method of a kind of robot service and device |
CN106297777A (en) * | 2016-08-11 | 2017-01-04 | 广州视源电子科技股份有限公司 | A kind of method and apparatus waking up voice service up |
FI20156000A (en) * | 2015-12-22 | 2017-06-23 | Code-Q Oy | Speech recognition method and apparatus based on a wake-up call |
Family Cites Families (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101727903B (en) * | 2008-10-29 | 2011-10-19 | 中国科学院自动化研究所 | Pronunciation quality assessment and error detection method based on fusion of multiple characteristics and multiple systems |
US8635237B2 (en) * | 2009-07-02 | 2014-01-21 | Nuance Communications, Inc. | Customer feedback measurement in public places utilizing speech recognition technology |
US9117449B2 (en) * | 2012-04-26 | 2015-08-25 | Nuance Communications, Inc. | Embedded system for construction of small footprint speech recognition with user-definable constraints |
CN102999161B (en) * | 2012-11-13 | 2016-03-02 | 科大讯飞股份有限公司 | A kind of implementation method of voice wake-up module and application |
CN103474069B (en) * | 2013-09-12 | 2016-03-30 | 中国科学院计算技术研究所 | For merging the method and system of the recognition result of multiple speech recognition system |
CN103943105A (en) * | 2014-04-18 | 2014-07-23 | 安徽科大讯飞信息科技股份有限公司 | Voice interaction method and system |
CN104281645B (en) * | 2014-08-27 | 2017-06-16 | 北京理工大学 | A kind of emotion critical sentence recognition methods interdependent based on lexical semantic and syntax |
CN104538030A (en) * | 2014-12-11 | 2015-04-22 | 科大讯飞股份有限公司 | Control system and method for controlling household appliances through voice |
EP3067884B1 (en) * | 2015-03-13 | 2019-05-08 | Samsung Electronics Co., Ltd. | Speech recognition system and speech recognition method thereof |
CN105096939B (en) * | 2015-07-08 | 2017-07-25 | 百度在线网络技术(北京)有限公司 | voice awakening method and device |
CN105702253A (en) * | 2016-01-07 | 2016-06-22 | 北京云知声信息技术有限公司 | Voice awakening method and device |
CN106653056B (en) * | 2016-11-16 | 2020-04-24 | 中国科学院自动化研究所 | Fundamental frequency extraction model and training method based on LSTM recurrent neural network |
CN106782554B (en) * | 2016-12-19 | 2020-09-25 | 百度在线网络技术(北京)有限公司 | Voice awakening method and device based on artificial intelligence |
CN107223280B (en) * | 2017-03-03 | 2021-01-08 | 深圳前海达闼云端智能科技有限公司 | Robot awakening method and device and robot |
CN107358951A (en) * | 2017-06-29 | 2017-11-17 | 阿里巴巴集团控股有限公司 | A kind of voice awakening method, device and electronic equipment |
CN107464564B (en) * | 2017-08-21 | 2023-05-26 | 腾讯科技(深圳)有限公司 | Voice interaction method, device and equipment |
-
2017
- 2017-12-18 CN CN201711364085.1A patent/CN108320733B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
FI20156000A (en) * | 2015-12-22 | 2017-06-23 | Code-Q Oy | Speech recognition method and apparatus based on a wake-up call |
CN106297777A (en) * | 2016-08-11 | 2017-01-04 | 广州视源电子科技股份有限公司 | A kind of method and apparatus waking up voice service up |
CN106272481A (en) * | 2016-08-15 | 2017-01-04 | 北京光年无限科技有限公司 | The awakening method of a kind of robot service and device |
Non-Patent Citations (2)
Title |
---|
Wake-up-word spotting using end-to-end deep neural network system;Shilei Zhang,et al.;《2016 23rd International Conference on Pattern Recognition (ICPR)》;IEEE;20170424;全文 * |
智能语音机顶盒的软硬件实现方案;施唯佳等;《电信科学》;中国知网;20171020;第33卷(第10期);全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN108320733A (en) | 2018-07-24 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108320733B (en) | Voice data processing method and device, storage medium and electronic equipment | |
US11854545B2 (en) | Privacy mode based on speaker identifier | |
US11657832B2 (en) | User presence detection | |
US11636851B2 (en) | Multi-assistant natural language input processing | |
JP6705008B2 (en) | Speaker verification method and system | |
US20210090575A1 (en) | Multi-assistant natural language input processing | |
CN107767861B (en) | Voice awakening method and system and intelligent terminal | |
US11562739B2 (en) | Content output management based on speech quality | |
JP2018523156A (en) | Language model speech end pointing | |
CN113168832A (en) | Alternating response generation | |
US11302329B1 (en) | Acoustic event detection | |
US11393477B2 (en) | Multi-assistant natural language input processing to determine a voice model for synthesized speech | |
CN114051639A (en) | Emotion detection using speaker baseline | |
CN108536668B (en) | Wake-up word evaluation method and device, storage medium and electronic equipment | |
JP6915637B2 (en) | Information processing equipment, information processing methods, and programs | |
CN108269574B (en) | Method and device for processing voice signal to represent vocal cord state of user, storage medium and electronic equipment | |
CN110808050A (en) | Voice recognition method and intelligent equipment | |
CN114155882B (en) | Method and device for judging emotion of road anger based on voice recognition | |
CN111833869B (en) | Voice interaction method and system applied to urban brain | |
US11430435B1 (en) | Prompts for user feedback | |
WO2021061512A1 (en) | Multi-assistant natural language input processing | |
CN117612519A (en) | Voice awakening method, device, equipment and medium | |
CN115691478A (en) | Voice wake-up method and device, man-machine interaction equipment and storage medium | |
CN115705840A (en) | Voice wake-up method and device, electronic equipment and readable storage medium | |
CN111696551A (en) | Device control method, device, storage medium, and electronic apparatus |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |