CN108320733B

CN108320733B - Voice data processing method and device, storage medium and electronic equipment

Info

Publication number: CN108320733B
Application number: CN201711364085.1A
Authority: CN
Inventors: 吴国兵; 潘嘉
Original assignee: Shanghai Iflytek Information Technology Co ltd
Current assignee: Shanghai Iflytek Information Technology Co ltd
Priority date: 2017-12-18
Filing date: 2017-12-18
Publication date: 2022-01-04
Anticipated expiration: 2037-12-18
Also published as: CN108320733A

Abstract

The disclosure provides a voice data processing method and device, a storage medium and an electronic device. The method comprises the following steps: acquiring voice data input by a user, wherein the voice data comprises awakening voice data for successfully awakening the intelligent terminal and control voice data representing operation intention; extracting acoustic level features and/or semantic level features of the voice data, wherein the acoustic level features are used for representing pronunciation features of a user, and the semantic level features are used for representing text features of the voice data; and taking the acoustic level features and/or the semantic level features as input, and determining whether the awakening voice data is mistaken awakening data or not after the voice identification model is processed by a pre-established voice identification model. According to the scheme, the model optimization is carried out by using the awakening voice data with the mistaken awakening data screened out, and the optimization performance of the awakening model is improved.

Description

Voice data processing method and device, storage medium and electronic equipment

Technical Field

The present disclosure relates to the field of speech signal processing technologies, and in particular, to a method and an apparatus for processing speech data, a storage medium, and an electronic device.

Background

The voice awakening technology is an important branch in the technical field of voice signal processing, and has important application in the aspects of intelligent home, intelligent robots, intelligent car machines, intelligent mobile phones and the like.

Generally, the voice wake-up process of the smart terminal may be embodied as: the intelligent terminal monitors whether a user inputs voice data or not, and if the voice data input by the user is received, acoustic features of the voice data can be extracted; then, taking the acoustic characteristics as input, carrying out awakening word recognition by a pre-constructed awakening model, if the recognition result is an awakening word, awakening successfully, and continuously monitoring whether the user inputs an operation intention; otherwise, the awakening fails, and the monitoring of whether the user awakens the intelligent terminal again can be continued. The acoustic features may be embodied as spectral features of the voice data, such as Mel Frequency Cepstrum Coefficient (MFCC) features, Perceptual Linear Prediction (PLP) features, and the like.

Generally, the performance of the initial wake-up model cannot reach an optimal level, and model optimization needs to be continuously performed in the using process to improve the identification accuracy of the model. Specifically, the voice data that is successfully awakened may be regarded as regular voice data, the voice data that is unsuccessfully awakened may be regarded as counter-example voice data, and the current awakening model may be trained and optimized based on the discriminative criterion.

In the practical application process, due to the fact that the performance of the initial awakening model is not high, false awakening data may exist in the successfully awakened voice data, for example, background noise, human voice interference, non-awakening words with similar pronunciation to the awakening words and the like, all the intelligent terminals may be awakened by mistake, and if the false awakening data is used as regular voice data to perform model optimization, the performance of the awakening model is likely to be poor.

Disclosure of Invention

The present disclosure provides a method and an apparatus for processing voice data, a storage medium, and an electronic device, which are helpful for improving the optimization performance of a wake-up model.

In order to achieve the above object, the present disclosure provides a voice data processing method, the method including:

acquiring voice data input by a user, wherein the voice data comprises awakening voice data for successfully awakening the intelligent terminal and control voice data representing operation intention;

extracting acoustic level features and/or semantic level features of the voice data, wherein the acoustic level features are used for representing pronunciation features of a user, and the semantic level features are used for representing text features of the voice data;

and taking the acoustic level features and/or the semantic level features as input, and determining whether the awakening voice data is mistaken awakening data or not after the voice identification model is processed by a pre-established voice identification model.

Optionally, the manner of acquiring the wake-up voice data is as follows:

judging whether at least two pieces of voice data for awakening the intelligent terminal are continuously acquired within a preset time period;

if at least two voice data used for awakening the intelligent terminal are continuously collected in the preset time period, and the score value d of the at least two voice data used for awakening the intelligent terminal after being processed by the current awakening model meets the following conditions: d₂≤d<d₁Determining the at least two voice data for waking up the intelligent terminal as the wake-up voice data, d₁Is a first wake-up score threshold, d₂Is the second wake-up score threshold.

Optionally, the acoustic level features include an acoustic score of a current arousal model, and extracting the acoustic level features of the voice data includes:

acquiring the first N recognition results output by the current awakening model aiming at each voice unit of the awakening voice data;

if the first N recognition results of each voice unit contain the correct pronunciation of the voice unit, judging that the recognition result of the voice unit is correct for recognition;

and counting the recognition accuracy of the awakening voice data according to the recognition result of each voice unit, wherein the recognition accuracy is used as the acoustic score of the current awakening model.

Optionally, the acoustic level features further include at least one of a fundamental frequency mean, a short-time mean energy, and a short-time zero-crossing rate;

and/or the presence of a gas in the gas,

if the acoustic level features further include the unvoiced and voiced sequence features, extracting the acoustic level features of the voice data includes: at least one of the fundamental frequency mean value, the short-time average energy and the short-time zero crossing rate is used as input, and after the input is processed by a pre-constructed unvoiced and voiced classifier, the unvoiced and voiced sequence { a) of the awakening voice data is output₁，a₂，…，a_i，…，a_mIn which a_iRepresenting the voiced category corresponding to the ith phoneme of the awakening voice data; calculating the similarity between the voiced and unvoiced sequence of the awakening voice data and the voiced and unvoiced sequence of the awakening word corresponding to the awakening voice data to serve as the voiced and unvoiced sequence feature;

and/or the presence of a gas in the gas,

if the acoustic level features further include pitch sequence features, extracting the acoustic level features of the speech data includes: at least one of the fundamental frequency mean value, the short-time average energy and the short-time zero crossing rate is used as input, and after the input is processed by a pre-constructed tone classifier, the tone sequence { b ] of the awakening voice data is output₁，b₂，…，b_j，…，b_nIn which b is_jRepresenting a tone category corresponding to a jth syllable of the wake-up voice data; calculating the similarity between the tone sequence of the awakening voice data and the tone sequence of the awakening word corresponding to the awakening voice data to serve as the tone sequence characteristic;

and/or the presence of a gas in the gas,

the acoustic level features further include time features of a speech unit, and extracting the acoustic level features of the speech data includes: counting the duration of each voice unit of the awakening voice data; calculating a time mean value and a time variance as the time characteristics of each voice unit by using the duration of each voice unit;

and/or the presence of a gas in the gas,

the acoustic surface features further include voiceprint features, and extracting the acoustic surface features of the speech data includes: extracting i-vector characteristics of the awakening voice data by using a pre-constructed voiceprint extraction model to serve as the voiceprint characteristics;

and/or the presence of a gas in the gas,

if the acoustic level features further include energy distribution features, extracting the acoustic level features of the speech data includes: dividing the voice data into three parts c_t-1、c_t、c_t+1Counting the average energy distribution of each part as the energy distribution characteristics; wherein, c_tRepresenting said wake-up speech data, c_t+1Representing a voice data set comprising said control voice data acquired after said wake-up voice data, c_t-1Representing a voice data set collected prior to the wake-up voice data.

Optionally, if the semantic level features include semantic smoothness, extracting the semantic level features of the voice data includes: performing word segmentation processing on the voice data to obtain a word sequence { w₁，w₂，…，w_k，…，w_fIn which w_kA k-th word representing the speech data; calculating the probability of f words appearing in sequence according to the sequence of the word sequence as the semantic smoothness;

and/or the presence of a gas in the gas,

if the semantic level features include the edit distance of the part-of-speech sequence, extracting the semantic level features of the voice data includes: performing word segmentation processing on the voice data to obtain a part of speech sequence { q₁，q₂，…，q_k，…，q_fWherein q is_kA part of speech representing a k-th word of the speech data; calculating an editing distance between the part-of-speech sequence of the speech data and the part-of-speech sequence of each sample speech data, and selecting a minimum editing distance from the editing distances as the editing distance of the part-of-speech sequence, wherein the sample speech data is data participating in training the speech discrimination model;

and/or the presence of a gas in the gas,

if the semantic level features include intent features, extracting the semantic level features of the voice data includes: and extracting intention characteristics of the control voice data by using a pre-constructed intention analysis model, wherein the intention characteristics comprise clear intentions or no clear intentions, or the intention characteristics comprise intention categories corresponding to the control voice data.

Optionally, the speech discrimination model is constructed in a manner that:

collecting sample voice data, wherein the sample voice data comprises sample awakening voice data and sample control voice data, the data type of the sample awakening voice data is marked as positive example awakening voice data or negative example awakening voice data, and the negative example awakening voice data comprises false awakening data and voice data failed to be awakened;

extracting acoustic level features and/or semantic level features of the sample voice data;

determining a topological structure of the voice discrimination model;

and training the voice discrimination model by using the topological structure and the acoustic level features and/or semantic level features of the sample voice data until the data type of the sample awakening voice data output by the voice discrimination model is the same as the labeled data type.

Optionally, the method further comprises:

and optimizing the current awakening model by utilizing the awakening voice data with the mistaken awakening data screened out.

The present disclosure provides a voice data processing apparatus, the apparatus comprising:

the voice data acquisition module is used for acquiring voice data input by a user, wherein the voice data comprises awakening voice data for successfully awakening the intelligent terminal and control voice data representing operation intention;

the feature extraction module is used for extracting acoustic level features and/or semantic level features of the voice data, wherein the acoustic level features are used for representing pronunciation features of a user, and the semantic level features are used for representing text features of the voice data;

and the model processing module is used for taking the acoustic level features and/or the semantic level features as input, and determining whether the awakening voice data is mistaken awakening data or not after the voice identification model which is constructed in advance is used for processing.

Optionally, the voice data acquiring module is configured to determine whether at least two pieces of voice data for waking up the intelligent terminal are continuously acquired within a preset time period; if at least two pieces of voice data used for awakening the intelligent terminal are continuously collected in the preset time period, and the at least two pieces of voice data used for awakening the intelligent terminal are processed through the current voice dataThe score d after the wake-up model processing meets the following conditions: d₂≤d<d₁Determining the at least two voice data for waking up the intelligent terminal as the wake-up voice data, d₁Is a first wake-up score threshold, d₂Is the second wake-up score threshold.

Optionally, the acoustic level features include an acoustic score of a current arousal model,

the feature extraction module is configured to obtain the first N recognition results output by the current wake-up model for each voice unit of the wake-up voice data; if the first N recognition results of each voice unit contain the correct pronunciation of the voice unit, judging that the recognition result of the voice unit is correct for recognition; and counting the recognition accuracy of the awakening voice data according to the recognition result of each voice unit, wherein the recognition accuracy is used as the acoustic score of the current awakening model.

and/or the presence of a gas in the gas,

the acoustic level features further comprise a voiced-unvoiced sequence feature, and the feature extraction module is used for outputting the voiced-unvoiced sequence { a ] of the awakening voice data after processing by a pre-constructed voiced-unvoiced classifier by taking at least one of a fundamental frequency mean value, a short-time average energy and a short-time zero crossing rate as an input₁，a₂，…，a_i，…，a_mIn which a_iRepresenting the voiced category corresponding to the ith phoneme of the awakening voice data; calculating the similarity between the voiced and unvoiced sequence of the awakening voice data and the voiced and unvoiced sequence of the awakening word corresponding to the awakening voice data to serve as the voiced and unvoiced sequence feature;

and/or the presence of a gas in the gas,

the acoustic level features further comprise tone sequence features, and the feature extraction module is used for outputting the tone sequence { b ] of the awakening voice data after processing by a pre-constructed tone classifier by taking at least one of a fundamental frequency mean value, a short-time average energy and a short-time zero crossing rate as input₁，b₂，…，b_j，…，b_nIn which b is_jRepresenting a tone category corresponding to a jth syllable of the wake-up voice data; calculating the similarity between the tone sequence of the awakening voice data and the tone sequence of the awakening word corresponding to the awakening voice data to serve as the tone sequence characteristic;

and/or the presence of a gas in the gas,

the acoustic level features further include time features of voice units, and the feature extraction module is configured to count a duration of each voice unit of the wake-up voice data; calculating a time mean value and a time variance as the time characteristics of each voice unit by using the duration of each voice unit;

and/or the presence of a gas in the gas,

the acoustic level features further comprise voiceprint features, and the feature extraction module is used for extracting i-vector features of the awakening voice data by using a pre-constructed voiceprint extraction model to serve as the voiceprint features;

and/or the presence of a gas in the gas,

the acoustic level features further comprise energy distribution features, and the feature extraction module is used for dividing the voice data into three parts c_t-1、c_t、c_t+1Counting the average energy distribution of each part as the energy distribution characteristics; wherein, c_tRepresenting said wake-up speech data, c_t+1Representing a voice data set comprising said control voice data acquired after said wake-up voice data, c_t-1Representing a voice data set collected prior to the wake-up voice data.

Optionally, the semantic level features include semantic smoothness, and the feature extraction module is configured to perform word segmentation processing on the voice data to obtain a word sequence { w }₁，w₂，…，w_k，…，w_fIn which w_kA k-th word representing the speech data; calculating the probability of f words appearing in sequence according to the sequence of the word sequence as the semantic smoothness;

and/or the presence of a gas in the gas,

the semantic level features comprise editing distance of a part of speech sequence, and the feature extraction module is used for performing word segmentation processing on the voice data to obtain the part of speech sequence { q₁，q₂，…，q_k，…，q_fWherein q is_kA part of speech representing a k-th word of the speech data; calculating an editing distance between the part-of-speech sequence of the speech data and the part-of-speech sequence of each sample speech data, and selecting a minimum editing distance from the editing distances as the editing distance of the part-of-speech sequence, wherein the sample speech data is data participating in training the speech discrimination model;

and/or the presence of a gas in the gas,

the semantic level features comprise intention features, and the feature extraction module is used for extracting the intention features of the control voice data by utilizing a pre-constructed intention analysis model, wherein the intention features comprise clear intentions or no clear intentions, or the intention features comprise intention categories corresponding to the control voice data.

Optionally, the apparatus further comprises:

the system comprises a sample voice data acquisition module, a sample voice data acquisition module and a sample control module, wherein the sample voice data comprises sample awakening voice data and sample control voice data, the data type of the sample awakening voice data is marked as positive example awakening voice data or negative example awakening voice data, and the negative example awakening voice data comprises false awakening data and voice data failed to be awakened;

the sample feature extraction module is used for extracting the acoustic level features and/or the semantic level features of the sample voice data;

the topological structure determining module is used for determining the topological structure of the voice distinguishing model;

and the model training module is used for training the voice discrimination model by utilizing the topological structure and the acoustic level characteristics and/or the semantic level characteristics of the sample voice data until the data type of the sample awakening voice data output by the voice discrimination model is the same as the labeled data type.

Optionally, the apparatus further comprises:

and the model optimization module is used for optimizing the current awakening model by utilizing the awakening voice data with the mistaken awakening data screened out.

The present disclosure provides a storage device having stored therein a plurality of instructions, the instructions being loaded by a processor, for performing the steps of the above-described voice data processing method.

The present disclosure provides an electronic device, comprising;

the above-mentioned storage device; and

a processor to execute instructions in the storage device.

According to the scheme, the awakening voice data for successfully awakening the intelligent terminal and the control voice data for representing the operation intention can be collected, the acoustic aspect characteristics for representing the pronunciation characteristics of the user and/or the semantic aspect characteristics for representing the text characteristics of the voice data are extracted, the acoustic aspect characteristics and/or the semantic aspect characteristics are used as the input of the voice distinguishing model, and whether the awakening voice data are mistaken awakening data or not is determined after model processing. According to the scheme, the mistaken awakening data can be screened out from the awakening voice data, and compared with the prior art that the mistaken awakening data is used as the correct voice data to perform model optimization, the method and the device for optimizing the voice data are beneficial to improvement of model optimization performance.

Additional features and advantages of the disclosure will be set forth in the detailed description which follows.

Drawings

The accompanying drawings, which are included to provide a further understanding of the disclosure and are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description serve to explain the disclosure without limiting the disclosure. In the drawings:

FIG. 1 is a schematic flow chart of a voice data processing method according to the disclosed embodiment;

FIG. 2 is a schematic flow chart of the construction of a speech discrimination model according to the present disclosure;

FIG. 3 is a schematic diagram of a voice data processing apparatus according to the present disclosure;

fig. 4 is a schematic structural diagram of an electronic device for voice data processing according to the present disclosure.

Detailed Description

The following detailed description of specific embodiments of the present disclosure is provided in connection with the accompanying drawings. It should be understood that the detailed description and specific examples, while indicating the present disclosure, are given by way of illustration and explanation only, not limitation.

Referring to fig. 1, a flow diagram of the disclosed voice data processing method is shown. May include the steps of:

s101, voice data input by a user are obtained, wherein the voice data comprise awakening voice data for successfully awakening the intelligent terminal and control voice data representing operation intention.

In general, the wake-up interaction process between the user and the smart terminal may be embodied as: the intelligent terminal monitors whether the voice data for awakening the intelligent terminal is input by the user, if the voice data is input and the awakening word is recognized according to the voice data, awakening is successful, whether the voice data for controlling the intelligent terminal is input by the user can be continuously monitored, and if the voice data is input and the operation intention is recognized according to the voice data, the intelligent terminal can be controlled to execute related operations.

In the scheme of the disclosure, the voice data which identifies the awakening words and successfully awakens the intelligent terminal can be called as awakening voice data; the voice data, which identifies the operation intention and controls the smart terminal to perform the related operation, may be referred to as control voice data.

It can be understood that compared with other types of voice interaction processes, the wake-up interaction process has obvious discontinuity sense and can be abstracted as 'silence + wake-up word + short pause + operation intention'. For example, "sil ding-dong sp i want to listen to music in liudebua," where "sil" represents the silence or background noise that the smart terminal hears before the user wakes up the smart terminal; "ding-dong" means a wake-up word; "sp" represents a short pause between the wake-up voice data and the control voice data; "I want to listen to music of Liudebua" means an operational intention.

In order to improve the performance of model optimization, the false awakening data can be screened from the awakening voice data, namely, whether the awakening voice data is the false awakening data or not is judged, and if the awakening voice data is the false awakening data, the false awakening data can be determined to be the inverse voice data. Compared with the prior art that the model optimization is carried out by regarding the false awakening data as the correct voice data, the scheme disclosed by the invention is beneficial to improving the model optimization performance.

As an example, the scheme of the present disclosure may be triggered to execute the voice data processing procedure when the intelligent terminal is successfully awakened; or, the voice data processing process may be triggered to be executed when other preset conditions are met, for example, the preset conditions may be that a preset number of voice data are collected, a preset time is reached, and the like.

As an example, the intelligent terminal in the present disclosure may be an electronic device with a voice wake-up function, for example, may be an intelligent appliance, a mobile phone, a personal computer, a tablet computer, or the like; in the practical application process, the voice data input by the user can be collected through the microphone of the intelligent terminal, and the expression form of the intelligent terminal, the equipment for acquiring the voice data and the like can be not particularly limited by the scheme disclosed by the invention.

As an example, the wake-up speech data in the present disclosure may be speech data in which a wake-up word is recognized via a current wake-up model. For example, a first wake-up score threshold value d may be set₁If the voice data is used for awakening the intelligent terminal, the score output after the voice data is processed by the current awakening model is not lower than d₁Then, the recognition result of the voice data is regarded as a wake-up word, and it can be determined as wake-up voice data.

As an example, the present disclosure may further obtain wake-up voice data in the following manner: judging whether at least two pieces of voice data for awakening the intelligent terminal are continuously acquired within a preset time period; if at least two voice data used for awakening the intelligent terminal are continuously acquired within a preset time period, and the score value d of the at least two voice data used for awakening the intelligent terminal after being processed by the current awakening model meets the following conditions: d₂≤d<d₁And determining at least two pieces of voice data for waking up the intelligent terminal as wake-up voice data.

According to practical application, if the user fails to wake up when the user performs the wake-up interaction for the first time, the user can perform the wake-up interaction for the second time quickly, even perform the wake-up interaction for multiple times until the wake-up interaction is successful or the user actively stops the wake-up interaction. For example, can be as above d₁On the basis of the first wake-up score threshold value d, a second wake-up score threshold value d is set₂And d is₂<d₁If at least two voice data which are continuously collected in a preset time period and used for awakening the intelligent terminal, the score d output after the current awakening model processing is located in the interval [ d₂，d₁) Then the two pieces of voice data may be determined as wake-up voice data. Thus, the score below d can be retained to some extent₁The wake-up voice data enriches the data available for optimizing the current wake-up model.

S102, extracting acoustic level features and/or semantic level features of the voice data, wherein the acoustic level features are used for representing pronunciation features of a user, and the semantic level features are used for representing text features of the voice data.

After the voice data input by the user is obtained, the acoustic level features and/or semantic level features of the voice data can be extracted for processing and use by the voice discrimination model.

As one example, the acoustic level features may include an acoustic score of the current arousal model. Optionally, in addition to the acoustic score of the current arousal model, the acoustic profile features may also include at least one of the following optional features: fundamental frequency mean value, short-time average energy, short-time zero-crossing rate, unvoiced and voiced sequence feature, pitch sequence feature, time feature of voice unit, voiceprint feature and energy distribution feature. It will be appreciated that the optional features can be divided into two types: one is the original features extracted directly from the wake-up speech data, e.g., mean fundamental frequency, mean energy in short time, zero-crossing rate in short time; another is to wake up post-processing features of the speech data, such as voiced sequence features, pitch sequence features, temporal features of the speech unit, voiceprint features, energy distribution features.

As an example, the semantic level features may include at least one of the following features: semantic smoothness, editing distance of part of speech sequences and intention characteristics.

For the meaning of each feature representation and the specific extraction process, reference is made to the description below, and the detailed description is omitted here.

S103, the acoustic level features and/or the semantic level features are used as input and are processed by a pre-established voice discrimination model, and whether the awakening voice data is mistaken awakening data or not is determined.

After the acoustic level features and/or the semantic level features are extracted from the voice data, model processing can be carried out by utilizing a pre-established voice discrimination model to determine whether the awakening voice data is mistaken awakening data or not, and if the awakening voice data is mistaken awakening data, the awakening voice data can be classified as counterexample voice data; if not, it can continue as regular voice data.

Taking the current wake-up model as the current foreground wake-up model and the current background wake-up model as an example, the following is a simple description of the process of optimizing the current wake-up model by using the wake-up voice data from which the false wake-up data is filtered out.

As can be appreciated, the foreground wake-up model is used to describe the wake-up word, and the model training may be performed using the speech data containing the wake-up word; the background awakening model is used for describing non-awakening words, and the model training can be carried out by adopting voice data without the awakening words.

When the awakening model is optimized, the awakening voice data with the mistaken awakening data screened out can be used for updating the current foreground awakening model; the current background wake model may be updated with counter-example voice data, e.g., failed wake voice data, false wake data. Therefore, the distance between the two paths can be increased, and the accuracy rate of the voice recognition of the updated wake-up model is improved. The specific optimization process can be realized by referring to the related art, and is not detailed here.

As an example, only the current foreground wake-up model may be updated, i.e. the current foreground wake-up model may be optimized using only wake-up speech data that has been screened for false wake-up data. The model updating mode may be determined specifically in combination with actual application requirements, and the scheme of the present disclosure may not be limited thereto.

The following explains the acoustic level features and semantic level features in the present disclosure, respectively.

1. Acoustic bedding characteristics

(1) Acoustic score of current wake-up model for reflecting recognition accuracy of wake-up word

As an example, the first N recognition results output by the current wake-up model for each voice unit of the wake-up voice data may be obtained; if the first N recognition results of each voice unit contain the correct pronunciation of the voice unit, judging that the recognition result of the voice unit is correct for recognition; and counting the recognition accuracy of the awakening voice data according to the recognition result of each voice unit, wherein the recognition accuracy is used as the acoustic score of the current awakening model.

For example, the speech unit may be embodied as a basic recognition unit of the current wake model, such as a phoneme, a syllable, etc., which may not be specifically limited by the present disclosure.

Taking a voice unit as an example, the wake-up word "ding-dong" can be divided into "ding", "dong", "ding", and "dong" 4 voice units, if the value of N is 3, for the first voice unit "ding", the recognition probability of the current wake-up model for the voice unit output can be obtained, the first 3 with the highest probability are determined as the recognition result of the voice unit "ding", and if the correct pronunciation of "ding" exists in the 3 recognition results, the recognition result of the voice unit is determined as correct recognition. By analogy, the recognition results of other 3 voice units are respectively obtained, and then the recognition accuracy of the awakening voice data is calculated, namely, the ratio between the number of the voice units which are correctly recognized and the total number of the voice units is calculated and used as the acoustic score of the current awakening model.

It can be understood that the value of N in the present disclosure may be N ≧ 1, which may be specifically set in combination with practical application requirements, and the present disclosure is not limited to this.

(2) Waking up original features of speech data, e.g. mean fundamental frequency, mean energy in short time, zero crossing rate in short time

Generally, when a person pronounces voice, a voice signal can be classified into unvoiced sound and voiced sound according to whether a vocal cord vibrates or not. Voiced sounds, also known as voiced languages, carry most of the energy in the language, and exhibit significant periodicity in the time domain; unvoiced sounds resemble white noise with no apparent periodicity. When voiced sound occurs, the airflow passes through the glottis to make the vocal cords generate relaxation oscillation type vibration to generate a quasiperiodic excitation pulse train, and the frequency of the vocal cord vibration can be called fundamental tone frequency, called fundamental frequency for short. The pitch frequency generally has a relationship with a personal vocal cord, pronunciation habit, and the like, and can reflect personal characteristics to some extent.

As an example, the way of extracting the fundamental frequency mean value can be embodied as: and performing frame processing on the awakening voice data to obtain a plurality of voice data frames, then extracting the base frequency corresponding to each frame, and further calculating the base frequency average value corresponding to the awakening voice data by using the base frequency corresponding to each frame.

In addition, it should be noted that the short-time average energy can be used as a characteristic parameter for distinguishing unvoiced sound and voiced sound; alternatively, when the signal-to-noise ratio is high, the characteristic parameter can be used to distinguish between voiced sound and unvoiced sound.

The short-term zero-crossing rate refers to the number of times the waveform of the speech signal crosses the horizontal axis (zero level) in one frame of speech data. Generally, the energy during voiced sound is concentrated in the low frequency band, the energy during unvoiced sound is concentrated in the high frequency band, which can reflect the frequency to some extent, and the energy during voiced sound has a lower zero-crossing rate in the voiced sound band, and the energy during unvoiced sound has a higher zero-crossing rate in the unvoiced sound band.

The method for obtaining the fundamental frequency mean value, the short-time average energy and the short-time zero-crossing rate in the scheme of the disclosure is not limited, and can be realized by referring to the related technology, and the details are not described here.

(3) An unvoiced sequence feature for reflecting the unvoiced characteristics of phonemes in the wake-up speech data

As an example, the voiced sequence { a of the wake-up speech data can be output after being processed by a pre-constructed voiced classifier with at least one of the fundamental frequency mean, the short-time average energy and the short-time zero-crossing rate as input₁，a₂，…，a_i，…，a_mIn which a_iRepresenting the voiced category corresponding to the ith phoneme of the awakening voice data; and calculating the similarity between the voiced and unvoiced sequence of the awakened voice data and the voiced and unvoiced sequence of the awakened word corresponding to the awakened voice data as the voiced and unvoiced sequence feature.

For example, the unvoiced and voiced categories of phonemes may be: for example, an unvoiced sound and a voiced sound may be represented by "0" and "1", which may not be specifically limited in the present disclosure.

It can be understood that the intelligent terminal may only store one wake-up word, that is, know in advance what the wake-up word corresponding to the wake-up voice data is; or, the intelligent terminal may store a plurality of wake-up words, and for this, what the wake-up word corresponding to the wake-up voice data is may be identified by using the current wake-up model, which is not specifically limited in the present disclosure.

As an example, the unvoiced and turbid sequence of the wakeup word may be stored in the intelligent terminal and directly read when the similarity needs to be calculated; or, when the similarity needs to be calculated, the unvoiced and voiced sequence of the wakeup word can be determined in real time by using the unvoiced and voiced classifier, which may not be specifically limited by the scheme of the present disclosure.

As an example, calculating the similarity of the voiced and unvoiced sequences may be embodied as: calculating the similarity by means of an exclusive-or operation, wherein if the unvoiced and voiced categories of the phonemes at the corresponding positions are the same, for example, the unvoiced and voiced categories are all represented by "1", the exclusive-or result of the phonemes at the position is 0; otherwise the xor result is 1. Thus, a non-zero number can be counted to obtain the similarity, and generally, the smaller the non-zero number, the higher the similarity.

It is to be understood that the surd-turbid classifier in the present disclosure may employ a common classification model, for example, a support vector machine model, a neural network model, etc., and this may not be particularly limited in the present disclosure.

(4) Tone sequence features reflecting tone characteristics of syllables in the wake-up speech data

As an example, at least one of the mean value of the fundamental frequency, the short-time average energy and the short-time zero crossing rate can be used as an input, and after being processed by a pre-constructed tone classifier, the tone sequence { b ] of the awakening voice data is output₁，b₂，…，b_j，…，b_nIn which b is_jA tone category corresponding to the j-th syllable representing the wake-up speech data; and calculating the similarity between the tone sequence of the awakening voice data and the tone sequence of the awakening word corresponding to the awakening voice data as the tone sequence characteristic.

Taking Chinese as an example, the tone category of a syllable can be represented by 4 common tones, and the identifiers "1", "2", "3" and "4" can be used to represent different tones; alternatively, the tone category of a syllable may also be determined in combination with other languages, which may not be specifically limited by the present disclosure.

As can be seen from the above description, no matter the intelligent terminal stores one wake-up word or stores a plurality of wake-up words, the wake-up word corresponding to the wake-up voice data can be determined, and then the tone sequence of the wake-up word is obtained, which can be specifically described with reference to the characteristics of the unvoiced and turbid sequence in (3), and details are not described here.

As an example, calculating the similarity of the pitch sequences may be embodied as: calculating the similarity by means of exclusive-or operation, wherein if the tone categories of the syllables at the corresponding positions are the same, for example, the fourth tone of Chinese represented by '4', the result of the exclusive-or operation on the syllables at the position is 0; otherwise the xor result is 1. Thus, a non-zero number can be counted to obtain the similarity, and generally, the smaller the non-zero number, the higher the similarity.

It is to be understood that the pitch classifier in the present disclosure may employ a common classification model, for example, a support vector machine model, a neural network model, etc., and the present disclosure is not limited thereto.

(5) Time characteristics of voice unit for reflecting abnormal condition of awakening voice data during voice unit segmentation

As an example, the voice recognition result obtained by the current wake-up model may be based on a forced segmentation of the wake-up voice data to obtain the start time and the end time of each voice unit, and further obtain the duration of each voice unit; the time mean and time variance can be calculated as the time characteristics of the speech units using the duration of each speech unit.

In general, the time characteristic of the phonetic unit can reflect the abnormal condition of the phonetic unit in the segmentation process, for example, the duration of the individual phonetic unit is too long or too short, which does not conform to the normal speech form. As an example, the speech units may be embodied as phonemes, syllables, etc., which may not be specifically limited by the present disclosure.

(6) Voiceprint features for reflecting physiological and behavioral features of a speaker

As an example, i-vector features of the wake-up speech data can be extracted as voiceprint features using a pre-constructed voiceprint extraction model. For example, voiceprint features can be extracted by using a voiceprint extraction model such as DNN I-Vector, GMM-UBM I-Vector, etc., which may not be specifically limited in the present disclosure.

It can be understood that the voiceprint feature reflects personalized features of the speaker, and generally, the voiceprint feature of the speaker does not change in a short time, so that the i-vector feature can be extracted from the control voice data or the whole voice data including the wake-up voice data and the control voice data by using a voiceprint extraction model, which is not particularly limited in the present disclosure.

(7) Energy distribution characteristics for reflecting characteristics of wake-up interaction process

As an example, the voice data may be segmented into three portions c_t-1、c_t、c_t+1And counting the average energy distribution of each part, e.g. the average energy distribution of three parts can be expressed as g_t-1、g_t、g_t+1And obtaining the energy distribution characteristics.

As an example, the way of extracting the energy distribution feature may be embodied as: and respectively carrying out frame processing on the 3 parts of voice data to obtain a voice data frame included in each part, then extracting energy corresponding to each frame, and further calculating the average energy of each part by using the energy corresponding to each frame.

In connection with the example of the wake-up interaction process mentioned above, "sil ding-dong sp i wants to listen to liu de hua music," the voice data can be divided into 3 parts by recognizing the wake-up word. Wherein, c_tRepresenting wake-up voice data; c. C_t-1Represents a set of voice data, typically silence segments or background noise, collected prior to waking up the voice data; c. C_t+1Representing a voice data set, typically a short pause and operational intent, collected after the voice data was awakened. Understandably, c_t-1、c_t+1The duration of the time period may be determined flexibly, for example, according to VAD (Voice Activity Detection, chinese) information, or may be set to a fixed duration, for example, 1s to 5s, which is not limited in this disclosure.

Compared with the conventional method that the intelligent terminal is mistakenly awakened due to the fact that the awakening word is provided in daily conversation, for example, "i think that ding-dong is a good name", the energy distribution of the awakening interaction process of the scheme disclosed by the invention is obviously different.

2. Semantic level features

(1) Semantic smoothness

As an example, the speech data may be participled to obtain a sequence of words { w }₁，w₂，…，w_k，…，w_fIn which w_kA k-th word representing speech data; then, the probability that f words appear in sequence according to the sequence of the word sequences is calculated and used as the semantic smoothness.

For example, semantic smoothness in the present disclosure may be embodied as w₁To w_fForward semantic smoothness of direction P (w)₁，w₂，…，w_f) (ii) a And/or, w_fTo w₁Semantic smoothness P (w) of direction in reverse direction_f，w_f-1，…，w₁). Taking forward semantic smoothness as an example, the forward semantic smoothness can be calculated by the following formulaTo:

wherein, P (w)_k|w_k-1) May be obtained based on sample speech data statistics that are involved in speech recognition model training.

(2) Edit distance of part-of-speech sequence

As an example, the speech data may be word-segmented to obtain a sequence of parts of speech { q }₁，q₂，…，q_k，…，q_fWherein q is_kA part of speech representing a k-th word of the speech data; and calculating the editing distance between the part of speech sequence of the voice data and the part of speech sequence of each sample voice data, and selecting the minimum editing distance from the part of speech sequences as the editing distance of the part of speech sequences. Wherein, the sample voice data is the data participating in training the voice discrimination model.

The part-of-speech sequence characteristics can reflect semantic information to a certain extent, and are particularly more remarkable in the awakening interaction process. In the present disclosure, the part-of-speech sequence feature may be embodied as an Edit Distance (Edit Distance) of the part-of-speech sequence, specifically, the minimum number of editing operations required to change one string into another string between two strings, where generally, the smaller the Edit Distance, the greater the similarity between two strings.

If the part-of-speech sequence of the sample speech data is denoted as { p }₁，p₂，…，p_hIs calculated by using the following formula₁，q₂，…，q_fAnd { p }₁，p₂，…，p_hEdit distance d of }_[f，h]：

It is to be understood that the sample speech data may be all data that participate in the speech discrimination model training; or, the regular data screened from all the data may be used as sample voice data to calculate the edit distance, which may not be specifically limited in the present disclosure, as long as the minimum edit distance is determined.

(3) Intention characteristics

As an example, the intention features of the control speech data may be extracted using a pre-constructed intention analysis model, and the intention features include explicit intentions or no explicit intentions, or the intention features include intention categories corresponding to the control speech data.

The disclosed solution may pre-construct an intent analysis model for determining operational intent tendencies. For example, the intent analysis model may be embodied as a classifier, with the output of the model representing an express intent, no express intent; or, the intention analysis model may be embodied as a regression model, the model outputs scores representing various intention categories, and the intention category corresponding to the control voice data may be determined according to the score, for example, the top M intention category with the highest score is used as the intention category corresponding to the control voice data, a value of M in the present disclosure may be M ≧ 1, which may be specifically set in combination with actual application requirements, which may not be limited by the present disclosure. For example, the intention category may be playing music, inquiring weather, etc., and the concrete may depend on the actual application requirement.

The following explains the process of constructing the speech discrimination model in the present disclosure. Referring specifically to the flowchart shown in fig. 2, the method may include the following steps:

s201, sample voice data is collected, wherein the sample voice data comprises sample awakening voice data and sample control voice data, the data type of the sample awakening voice data is marked as positive example awakening voice data or negative example awakening voice data, and the negative example awakening voice data comprises error awakening data and voice data failed to be awakened.

During model training, a large amount of sample voice data can be collected, wherein the sample voice data can be embodied as sample awakening voice data and sample control voice data. In addition, data type labeling can be performed on the sample awakening voice data, for example, the data type can be positive example awakening voice data and negative example awakening voice data, and for the negative example awakening voice data, the data type can be further carefully labeled as false awakening data and voice data with awakening failure.

S202, extracting the acoustic level features and/or semantic level features of the sample voice data.

The specific implementation process can be described by reference to the above, and is not detailed here.

S203, determining the topological structure of the voice discrimination model.

As an example, the topology in the present disclosure can be embodied as: CNN (chinese: Convolutional Neural Network), RNN (chinese: Convolutional Neural Network), DNN (Deep Neural Network), and the like, which are not particularly limited in this disclosure.

As an example, the output layer of the neural network may include 2 output nodes, which respectively represent the positive example wake-up voice data and the false wake-up data, for example, "0" may be used to represent the positive example wake-up voice data, and "1" may be used to represent the false wake-up data. Alternatively, the output layer of the neural network may contain 1 output node, representing the probability that the wake-up voice data is determined to be false wake-up data. The specific representation form of the neural network in the scheme of the disclosure can not be limited.

S204, training the voice discrimination model by using the topological structure and the acoustic level features and/or semantic level features of the sample voice data until the data type of the sample awakening voice data output by the voice discrimination model is the same as the labeled data type.

And determining the topological structure of the model, and after extracting the acoustic level features and/or semantic level features of the sample voice data, performing model training. As an example, the training process may adopt a cross entropy criterion, and update and optimize the model parameters by using a common random gradient descent method, so as to ensure that when the model training is completed, the data type of the sample wake-up voice data output by the model is the same as the labeled data type.

As an example, the speech discrimination model may be a generic model, i.e. not built for a certain or some specific wake words; or, the voice discrimination model may be an individualized model, that is, different voice discrimination models are constructed for different wake-up words, which may not be specifically limited by the present disclosure.

Referring to fig. 3, a schematic diagram of the voice data processing apparatus of the present disclosure is shown. The apparatus may include:

the voice data acquisition module 301 is configured to acquire voice data input by a user, where the voice data includes wake-up voice data for successfully waking up the intelligent terminal and control voice data representing an operation intention;

a feature extraction module 302, configured to extract acoustic level features and/or semantic level features of the voice data, where the acoustic level features are used to represent pronunciation features of a user, and the semantic level features are used to represent text features of the voice data;

the model processing module 303 is configured to determine whether the awakening voice data is false awakening data after the acoustic level feature and/or the semantic level feature are/is used as input and processed by a pre-established voice discrimination model.

Optionally, the voice data acquiring module is configured to determine whether at least two pieces of voice data for waking up the intelligent terminal are continuously acquired within a preset time period; if at least two voice data used for awakening the intelligent terminal are continuously collected in the preset time period, and the score value d of the at least two voice data used for awakening the intelligent terminal after being processed by the current awakening model meets the following conditions: d₂≤d<d₁Determining the at least two voice data for waking up the intelligent terminal as the wake-up voice data, d₁Is a first wake-up score threshold, d₂Is the second wake-up score threshold.

and/or the presence of a gas in the gas,

the semantic level features comprise editing distance of a part of speech sequence, and the feature extraction module is used for performing word segmentation processing on the voice data to obtain the part of speech sequence { q₁，q₂，…，q_k，…，q_fWherein q is_kA part of speech representing a k-th word of the speech data; calculating the editing distance between the part-of-speech sequence of the speech data and the part-of-speech sequence of each sample speech data, and selecting the minimum editing distance from the part-of-speech sequences as the editing distance of the part-of-speech sequence, wherein the sample speech data is used for training the speechJudging the data of the model;

and/or the presence of a gas in the gas,

Optionally, the apparatus further comprises:

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

Referring to fig. 4, a schematic structural diagram of an electronic device 400 for voice data processing according to the present disclosure is shown. Referring to fig. 4, electronic device 400 includes a processing component 401 that further includes one or more processors, and storage resources, represented by storage medium 402, for storing instructions, such as application programs, that are executable by processing component 401. The application stored in the storage medium 402 may include one or more modules that each correspond to a set of instructions. Further, the processing component 401 is configured to execute instructions to perform the above-described voice data processing method.

Electronic device 400 may also include a power component 403 configured to perform power management of electronic device 400; a wired or wireless network interface 404 configured to connect the electronic device 400 to a network; and an input/output (I/O) interface 405. The electronic device 400 may operate based on an operating system stored on the storage medium 402, such as Windows Server, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, or the like.

The preferred embodiments of the present disclosure are described in detail with reference to the accompanying drawings, however, the present disclosure is not limited to the specific details of the above embodiments, and various simple modifications may be made to the technical solution of the present disclosure within the technical idea of the present disclosure, and these simple modifications all belong to the protection scope of the present disclosure.

It should be noted that, in the foregoing embodiments, various features described in the above embodiments may be combined in any suitable manner, and in order to avoid unnecessary repetition, various combinations that are possible in the present disclosure are not described again.

In addition, any combination of various embodiments of the present disclosure may be made, and the same should be considered as the disclosure of the present disclosure, as long as it does not depart from the spirit of the present disclosure.

Claims

1. A method of processing speech data, the method comprising:

the acoustic level features and/or the semantic level features are used as input, and whether the awakening voice data is mistaken awakening data or not is determined after the voice identification model which is constructed in advance is used for processing; the data type of the sample awakening voice data used for constructing the voice discrimination model comprises normal awakening voice data or reverse awakening voice data.

2. The method of claim 1, wherein the wake-up voice data is obtained by:

if at least two voice data used for awakening the intelligent terminal are continuously collected in the preset time period, and the score value d of the at least two voice data used for awakening the intelligent terminal after being processed by the current awakening model meets the following conditions: d₂≤d＜d₁Determining the at least two voice data for waking up the intelligent terminal as the wake-up voice data, d₁Is a first wake-up score threshold, d₂Is the second wake-up score threshold.

3. The method of claim 1 or 2, wherein the acoustic level features comprise an acoustic score of a current arousal model, and extracting the acoustic level features of the speech data comprises:

4. The method of claim 3,

the acoustic level features further comprise at least one of a fundamental frequency mean, a short-time average energy, and a short-time zero-crossing rate;

and/or the presence of a gas in the gas,

5. The method according to claim 1 or 2,

if the semantic level features include semantic smoothness, extracting the semantic level features of the voice data includes: performing word segmentation processing on the voice data to obtain a word sequence { w₁，w₂，…，w_k，…，w_fIn which w_kA k-th word representing the speech data; calculating the probability of f words appearing in sequence according to the sequence of the word sequence as the semantic smoothness;

and/or the presence of a gas in the gas,

if the semantic level features include the edit distance of the part-of-speech sequence, extracting the semantic level features of the voice data includes: performing word segmentation processing on the voice data to obtain a part of speech sequence { q₁，q₂，…，q_k，…，q_fWherein q is_kA part of speech representing a k-th word of the speech data; calculating words of the speech dataEditing distance between the part-of-speech sequence and the part-of-speech sequence of each sample voice data, and selecting the minimum editing distance from the editing distances as the editing distance of the part-of-speech sequence, wherein the sample voice data is data participating in training the voice discrimination model;

and/or the presence of a gas in the gas,

6. The method according to claim 1 or 2, wherein the speech discrimination model is constructed by:

determining a topological structure of the voice discrimination model;

7. The method according to claim 1 or 2, characterized in that the method further comprises:

8. A speech data processing apparatus, characterized in that the apparatus comprises:

the model processing module is used for taking the acoustic level features and/or the semantic level features as input, and determining whether the awakening voice data is mistaken awakening data or not after the voice identification model which is constructed in advance is used for processing; the data type of the sample awakening voice data used for constructing the voice discrimination model comprises normal awakening voice data or reverse awakening voice data.

9. The apparatus of claim 8,

the voice data acquisition module is used for judging whether at least two pieces of voice data used for awakening the intelligent terminal are continuously acquired within a preset time period; if at least two voice data used for awakening the intelligent terminal are continuously collected in the preset time period, and the score value d of the at least two voice data used for awakening the intelligent terminal after being processed by the current awakening model meets the following conditions: d₂≤d＜d₁Determining the at least two voice data for waking up the intelligent terminal as the wake-up voice data, d₁Is a first wake-up score threshold, d₂Is the second wake-up score threshold.

10. The apparatus of claim 8 or 9, wherein the acoustic level features comprise an acoustic score of a current arousal model,

11. The apparatus of claim 10,

and/or the presence of a gas in the gas,

12. The apparatus according to claim 8 or 9,

the semantic level features comprise semantic smoothness, and the feature extraction module is used for performing word segmentation processing on the voice data to obtain a word sequence { w }₁，w₂，…，w_k，…，w_fIn which w_kA k-th word representing the speech data; calculating the probability of f words appearing in sequence according to the sequence of the word sequence as the semantic smoothness;

and/or the presence of a gas in the gas,

the semantic level features comprise editing distance of a part of speech sequence, and the feature extraction module is used for performing word segmentation processing on the voice data to obtain the part of speech sequence { q₁，q₂，…，q_k，…，q_fWherein q is_kRepresenting the speechPart of speech of the kth word of data; calculating an editing distance between the part-of-speech sequence of the speech data and the part-of-speech sequence of each sample speech data, and selecting a minimum editing distance from the editing distances as the editing distance of the part-of-speech sequence, wherein the sample speech data is data participating in training the speech discrimination model;

and/or the presence of a gas in the gas,

13. The apparatus of claim 8 or 9, further comprising:

14. The apparatus of claim 8 or 9, further comprising:

15. A storage device having stored therein a plurality of instructions, wherein said instructions are loaded by a processor for performing the steps of the method of any of claims 1 to 7.

16. An electronic device, characterized in that the electronic device comprises;

the storage device of claim 15; and

a processor to execute instructions in the storage device.