CN117334198A

CN117334198A - Speech signal processing method, device, electronic equipment and computer readable medium

Info

Publication number: CN117334198A
Application number: CN202311189722.1A
Authority: CN
Inventors: 李晶; 韩海潮; 王佩琳; 张晓凯
Original assignee: Zhongguancun Smart City Co Ltd
Current assignee: Zhongguancun Smart City Co Ltd
Priority date: 2023-09-14
Filing date: 2023-09-14
Publication date: 2024-01-02
Anticipated expiration: 2043-09-14
Also published as: CN117334198B

Abstract

Embodiments of the present disclosure disclose a voice signal processing method, apparatus, electronic device, and computer readable medium. One embodiment of the method comprises the following steps: performing signal preprocessing on the voice signal to generate a preprocessed voice signal; generating a voice signal category according to the preprocessed voice signal; generating text instruction information according to the audio signal conversion model and the preprocessed voice signal; removing the target wake-up word from the text instruction information to obtain candidate control instruction information; the following first processing operation is performed on the candidate control instruction information: according to the key information extraction model and the candidate control instruction information, generating control instruction information; and executing the control action corresponding to the control action information on the control object corresponding to the control object information. According to the embodiment, the problem that the intelligent equipment cannot be effectively controlled according to the control instruction due to the fact that the control instruction obtained by identifying the voice signal is not accurate enough due to the influence of environmental noise is avoided.

Description

Speech signal processing method, device, electronic equipment and computer readable medium

Technical Field

Embodiments of the present disclosure relate to the field of computer technology, and in particular, to a method, an apparatus, an electronic device, and a computer readable medium for processing a voice signal.

Background

Intelligent devices based on voice interactions are rapidly merging into everyday life, while voice interactions between intelligent devices rely on the accurate processing and recognition of voice signals. Currently, when processing a voice signal, the following methods are generally adopted: and the voice signal recognition is directly carried out by a voice signal analysis mode.

However, the inventors found that when the above manner is adopted, there are often the following technical problems:

firstly, because the collected voice signals are often affected by environmental noise and contain more interference signals, the control instructions obtained by voice signal recognition are very easy to generate the problem that the control instructions are not accurate enough by directly analyzing the voice signals, so that the intelligent equipment cannot be effectively controlled according to the control instructions;

secondly, the voice signal is affected by the speech speed, the voice signal often has sparsity or density characteristics, and the dense voice signal is directly subjected to feature extraction aiming at the dense voice signal, and because the dense voice signal is relatively dense and adjacent signal values are relatively close, the signal feature extraction cannot be effectively performed, so that the accuracy of the text instruction information generated subsequently is affected;

Thirdly, the conventional conversion mode between the voice feature vector and the text instruction information is strongly dependent on the vector quality of the voice feature vector, and when the vector quality is poor, namely voice signals cannot be represented well, the extracted text instruction information is possibly caused to be inaccurate, so that the follow-up accurate control of intelligent equipment is affected.

The above information disclosed in this background section is only for enhancement of understanding of the background of the inventive concept and, therefore, may contain information that does not form the prior art that is already known to those of ordinary skill in the art in this country.

Disclosure of Invention

The disclosure is in part intended to introduce concepts in a simplified form that are further described below in the detailed description. The disclosure is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.

Some embodiments of the present disclosure propose a speech signal processing method, apparatus, electronic device, and computer readable medium to solve one or more of the technical problems mentioned in the background section above.

In a first aspect, some embodiments of the present disclosure provide a method for speech signal processing, the method comprising: in response to recognizing the voice signal, performing signal preprocessing on the voice signal to generate a preprocessed voice signal; generating a voice signal category according to the preprocessed voice signal, wherein the voice signal category represents an object category for generating the voice signal; responding to the determination that the voice signal category is the target category, and generating text instruction information corresponding to the pre-processed voice signal according to a pre-trained audio signal conversion model and the pre-processed voice signal; in response to determining that the text instruction information contains a target wake-up word, eliminating the target wake-up word from the text instruction information to obtain candidate control instruction information; in response to determining that the candidate control instruction information satisfies a first selection condition, performing the following first processing operation on the candidate control instruction information: generating control instruction information corresponding to the candidate control instruction information according to a pre-trained key information extraction model and the candidate control instruction information, wherein the control instruction information comprises: control object information and control action information, wherein the control object information comprises a control object representation temperature and humidity control device; and executing the control action corresponding to the control action information on the control object corresponding to the control object information.

In a second aspect, some embodiments of the present disclosure provide a speech signal processing apparatus, the apparatus comprising: a preprocessing unit configured to perform signal preprocessing on the voice signal in response to recognition of the voice signal, so as to generate a preprocessed voice signal; a first generation unit configured to generate a voice signal class from the preprocessed voice signal, wherein the voice signal class characterizes a class of an object that generates the voice signal; a second generation unit configured to generate text instruction information corresponding to the pre-processed voice signal according to a pre-trained audio signal conversion model and the pre-processed voice signal in response to determining that the voice signal class is a target class; the rejecting unit is configured to reject the target wake-up word from the text instruction information to obtain candidate control instruction information in response to determining that the text instruction information contains the target wake-up word; an execution unit configured to execute, in response to determining that the candidate control instruction information satisfies a first selection condition, the following processing operation on the candidate control instruction information: generating control instruction information corresponding to the candidate control instruction information according to a pre-trained key information extraction model and the candidate control instruction information, wherein the control instruction information comprises: control object information and control action information, wherein the control object information comprises a control object representation temperature and humidity control device; and executing the control action corresponding to the control action information on the control object corresponding to the control object information.

In a third aspect, some embodiments of the present disclosure provide an electronic device comprising: one or more processors; a storage device having one or more programs stored thereon, which when executed by one or more processors causes the one or more processors to implement the method described in any of the implementations of the first aspect above.

In a fourth aspect, some embodiments of the present disclosure provide a computer readable medium having a computer program stored thereon, wherein the program, when executed by a processor, implements the method described in any of the implementations of the first aspect above.

The above embodiments of the present disclosure have the following advantageous effects: according to the voice signal processing method, the problem that the intelligent equipment cannot be effectively controlled according to the control instruction due to the fact that the control instruction obtained by voice signal identification is not accurate enough due to the influence of environmental noise is avoided. Specifically, the reason for making the control instruction obtained by recognizing the voice signal inaccurate is that: because the voice signal who gathers often receives environmental noise influence, noise can disturb voice signal's quality, leads to the voice signal who gathers can contain more interference signal, carries out the control command that voice signal discerned obtained through the mode of voice signal analysis, very easily appears the not accurate problem of control command enough to lead to unable effective control to intelligent device according to control command. Based on this, the voice signal processing method of some embodiments of the present disclosure first performs signal preprocessing on the above-described voice signal in response to recognition of the voice signal to generate a preprocessed voice signal. By carrying out signal preprocessing on the voice signals, the interference of the environmental noise on voice signal recognition is reduced, and the problem of poor voice signal recognition accuracy caused by the fact that the collected voice signals are affected by the environmental noise and contain more interference signals is avoided. And secondly, generating a voice signal category according to the preprocessed voice signal, wherein the voice signal category represents the object category for generating the voice signal. In practice, the number of the objects for generating the voice signals is often large, and the objects for generating the voice signals can be accurately determined by determining the type of the voice model. Then, in response to determining that the speech signal class is the target class, text instruction information corresponding to the pre-processed speech signal is generated from a pre-trained audio signal conversion model and the pre-processed speech signal. Thereby obtaining the text instruction information contained in the voice signal. Further, in response to determining that the text instruction information contains the target wake-up word, the target wake-up word is removed from the text instruction information, and candidate control instruction information is obtained. Finally, in response to determining that the candidate control instruction information satisfies a first selection condition, performing the following first processing operation on the candidate control instruction information: the first step, according to a pre-trained key information extraction model and the candidate control instruction information, generating control instruction information corresponding to the candidate control instruction information, wherein the control instruction information comprises: control object information and control action information, wherein the control object information comprises a control object characterization temperature and humidity control device. And a second step of executing a control operation corresponding to the control operation information on the control object corresponding to the control object information. By the method, under the condition that the voice signal is influenced by environmental noise, the text control instruction corresponding to the voice signal can be still recognized, so that the intelligent equipment is effectively controlled.

Drawings

The above and other features, advantages, and aspects of embodiments of the present disclosure will become more apparent by reference to the following detailed description when taken in conjunction with the accompanying drawings. The same or similar reference numbers will be used throughout the drawings to refer to the same or like elements. It should be understood that the figures are schematic and that elements and components are not necessarily drawn to scale.

FIG. 1 is a flow chart of some embodiments of a speech signal processing method according to the present disclosure;

fig. 2 is a schematic structural diagram of some embodiments of a speech signal processing apparatus according to the present disclosure;

fig. 3 is a schematic structural diagram of an electronic device suitable for use in implementing some embodiments of the present disclosure.

Detailed Description

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete. It should be understood that the drawings and embodiments of the present disclosure are for illustration purposes only and are not intended to limit the scope of the present disclosure.

It should be noted that, for convenience of description, only the portions related to the present invention are shown in the drawings. Embodiments of the present disclosure and features of embodiments may be combined with each other without conflict.

It should be noted that the terms "first," "second," and the like in this disclosure are merely used to distinguish between different devices, modules, or units and are not used to define an order or interdependence of functions performed by the devices, modules, or units.

It should be noted that references to "one", "a plurality" and "a plurality" in this disclosure are intended to be illustrative rather than limiting, and those of ordinary skill in the art will appreciate that "one or more" is intended to be understood as "one or more" unless the context clearly indicates otherwise.

The names of messages or information interacted between the various devices in the embodiments of the present disclosure are for illustrative purposes only and are not intended to limit the scope of such messages or information.

The present disclosure will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.

Referring to fig. 1, a flow 100 of some embodiments of a speech signal processing method according to the present disclosure is shown. The voice signal processing method comprises the following steps:

In step 101, in response to recognizing the speech signal, signal preprocessing is performed on the speech signal to generate a preprocessed speech signal.

In some embodiments, in response to identifying a speech signal, an executing body of a speech signal processing method (e.g., a computing device) may perform signal preprocessing on the speech signal to generate a preprocessed speech signal. Wherein the speech signal may be a continuous analog signal. In practice, the speech signal may characterize speech information received by the mobile terminal. For example, the mobile terminal may be a cell phone.

As an example, the above-described execution body may perform signal preprocessing on the voice signal by means of filtering and noise reduction to generate a preprocessed voice signal. The execution body may perform wavelet noise reduction processing on the voice signal to generate a preprocessed voice signal.

The computing device may be hardware or software. When the computing device is hardware, it may be a mobile terminal as described above. When the computing device is embodied as software, it may be installed in the hardware devices listed above. Which may be implemented as a single software or software module. The present invention is not particularly limited herein. It should be appreciated that the number of computing devices may have any number, as desired for implementation.

In some optional implementations of some embodiments, in response to identifying a speech signal, the executing body performs signal preprocessing on the speech signal to generate a preprocessed speech signal, and may include the steps of:

and step one, carrying out normalization processing on the voice signals to obtain normalized voice signals.

The executing body performs normalization processing on the voice signal through a normalization formula to obtain a normalized voice signal. Wherein, the normalization formula is: normalized speech signal= (speech signal-minimum amplitude)/(maximum amplitude-minimum amplitude) ×2-1. Wherein the minimum amplitude characterizes a minimum amplitude value of the speech signal. The maximum amplitude characterizes a maximum amplitude value of the speech signal. In practice, first, the execution body may uniformly sample the voice signal with time as a scale, to obtain a sample point set. Wherein the set of sample points comprises at least one sample point. The sample points in the sample point set correspond to amplitude values. The execution body may then traverse the sample points to determine the minimum and maximum amplitudes.

And secondly, carrying out framing processing on the normalized voice signals so as to generate a short-time audio frame signal set.

Wherein the frame lengths of the short-time audio frame signals in the short-time audio frame signal set are consistent. In practice, first, the above-described execution body may determine a frame length and a frame shift. Wherein the frame length characterizes the signal length of each short-time audio frame signal. For example, the frame length may be "20 milliseconds". The frame length may also characterize the window size of the sliding window. The frame shift characterizes the movement step of the sliding window. In practice, the executing body may use the frame length as the window size of the sliding window, and control the sliding window to move according to the frame movement, so as to perform frame division processing on the normalized voice signal, so as to generate a short-time audio frame signal set.

Third, for each short-time audio frame signal in the above-described short-time audio frame signal set, the following processing steps are performed:

and 1, windowing the short-time audio frame signal to obtain a window audio frame signal.

In practice, the execution body may perform windowing processing by multiplying the short-time audio frame signal by a window function to obtain a window audio frame signal. Wherein the window function is a function with time as an argument. For example, the window function may be a hamming window function.

And 2, performing fast Fourier transform on the window audio frame signal to generate a frequency domain signal.

And 3, performing Gaussian filtering noise reduction on the frequency domain signal to obtain a filtered frequency domain signal.

In practice, the execution body may perform convolution operation on the frequency domain signal by using a gaussian filter to implement filtering and noise reduction, so as to obtain a filtered frequency domain signal.

And step 4, performing smoothing processing on the filtered frequency domain signal to obtain a smoothed frequency domain signal.

In practice, the executing body may perform smoothing processing on the filtered frequency domain signal through median filtering, to obtain a smoothed frequency domain signal.

And a sub-step 5 of performing inverse Fourier transform on the frequency domain signal after the smoothing processing to generate a time domain signal.

And step 6, carrying out windowing processing on the time domain signals to obtain windowed time domain signals.

In practice, the execution body may perform windowing processing by multiplying the time domain signal by an inverse window function, so as to obtain a windowed time domain signal. Wherein the inverse window function is an inverse function to the window function in the previous substep 1. For example, if the window function in substep 1 is a hamming window function, the inverse window function formula may be: inverse window function = 1 ++hamming window function.

And fourthly, splicing all the windowed time domain signals in the obtained windowed time domain signal set to generate the preprocessed voice signals.

In practice, the execution entity may use an OLA method (Overlap-add method) to splice each of the windowed time-domain signals in the obtained windowed time-domain signal set, so as to generate the preprocessed speech signal.

Step 102, generating a voice signal category according to the preprocessed voice signal.

In some embodiments, the executing entity may generate a voice signal class according to the preprocessed voice signal, wherein the voice signal class characterizes an object class generating the voice signal.

In practice, the executing body may input the preprocessed speech signal into a pre-trained signal class generating model to obtain a speech signal class. Wherein the signal class generation model includes: and a feature extraction model and a feature classification model. The feature extraction model may be a model for extracting features of the pre-processed speech signal. In practice, the feature extraction model may be a convolutional neural network model. The feature classification model takes the output of the feature extraction model as input and is used for generating a voice signal class. In practice, the feature classification model may be a fully connected layer. Specifically, the feature classification model may output a feature vector of 1×2, that is, the feature classification model may be a classification model.

And step 103, in response to determining the voice signal category as the target category, generating text instruction information corresponding to the pre-processed voice signal according to the pre-trained audio signal conversion model and the pre-processed voice signal.

In some embodiments, in response to determining that the speech signal class is the target class, the execution body may generate text instruction information corresponding to the pre-processed speech signal according to the audio signal conversion model and the pre-processed speech signal. In practice, the target class characterizes the class of objects that may operate on the mobile terminal. For example, the target category may be "people". Wherein the audio signal conversion model is a model for converting an audio signal into text information. In practice, the audio signal conversion model may be a speech recognition model that takes a preprocessed speech signal as an input and text instruction information as an output. Specifically, the audio signal conversion model may be an OpenSeq2Seq model. The text instruction information is text information corresponding to the preprocessed voice signal.

The text instruction information may be, for example, "hello, please turn on the light and turn off the air conditioner-! ".

In practice, the execution body may take the preprocessed voice signal as an input of the audio signal conversion model to generate text instruction information corresponding to the preprocessed voice signal.

Optionally, the audio signal conversion model includes: a speech signal processing model and a text conversion model. Wherein the speech signal processing model is a model for converting a speech signal into speech feature vectors. In particular, the speech signal processing model may be a linear predictive coding (Linear Predictive Coding, LPC) model. The text conversion model is used for converting the voice characteristic vector into text instruction information. The text conversion model is used for determining text instructions corresponding to the preprocessed voice signals. In practice, the text conversion model may be a semantic feature extraction model. For example, the text conversion model may be a neural network model based on a transducer structure.

In some optional implementations of some embodiments, in response to determining that the speech signal class is the target class, the executing body generates text instruction information corresponding to the pre-processed speech signal according to a pre-trained audio signal conversion model and the pre-processed speech signal, and may include the steps of:

The first step, the pre-processed voice signal is subjected to framing processing to generate a short-time voice frame signal set.

It should be noted that, the specific implementation manner of generating the short-time voice frame signal set may refer to the specific implementation manner of generating the short-time audio frame signal set, which is not described herein again.

And secondly, windowing the short-time voice frame signal set to obtain a windowed short-time voice frame signal set.

It should be noted that, the specific implementation manner of generating the windowed short-time speech frame signal set may refer to the specific implementation manner of generating the windowed audio frame signal, which is not described herein again.

And thirdly, determining the instantaneous energy value of the windowed short-time voice frame signal for each windowed short-time voice frame signal in the windowed short-time voice frame signal set.

Wherein the instantaneous energy value characterizes the total signal strength of the speech signal.

In practice, the executing body sums the squares of the amplitudes corresponding to each time point in the windowed short-time voice frame signal to obtain the instantaneous energy value of the windowed short-time voice frame signal.

Fourth, short-time voice frame signals after windowing, corresponding to the instantaneous energy values meeting the removing conditions, are removed from the short-time voice frame signals after windowing, and a short-time voice frame signal set after removing is obtained.

Wherein, the above-mentioned rejection condition is: the instantaneous energy value is less than a preset instantaneous energy threshold. In practice, the preset instantaneous energy threshold may be an average value of amplitudes corresponding to each time point in the short-time voice frame signal after the above-mentioned windowing.

And fifthly, splicing all the short-time voice frame signals after the elimination in the short-time voice frame signal set after the elimination so as to generate voice signals after the elimination of silence.

It should be noted that, the specific implementation manner of generating the above voice signal after removing silence may refer to the specific implementation manner of generating the above voice signal after preprocessing, which is not described herein again.

And sixthly, performing speech speed scaling operation on the voice signals after the silence removal to obtain scaled voice signals.

In practice, the executing body performs time domain stretching on the high-frequency part of the voice signal after silence removal to obtain the voice signal after scaling.

In some optional implementations of some embodiments, the performing body performs a speech rate scaling operation on the unmuted speech signal to obtain a scaled speech signal, and may include the following steps:

and a first sub-step, carrying out framing processing on the voice signals subjected to the silence removal so as to generate a voice frame signal set subjected to the short-time silence removal.

It should be noted that, the specific implementation manner of generating the short-time de-muted speech frame signal set may refer to the specific implementation manner of generating the short-time audio frame signal set, which is not described herein again.

A second sub-step of performing the following time interval scaling step for each short-time de-muted voice frame signal of the above-described short-time de-muted voice frame signal set:

and a substep 1, determining a time point set of the voice frame signal after short-time de-muting.

In practice, firstly, the executing body performs downsampling on the short-time de-muted voice frame signal to obtain a de-muted voice frame signal sampling point set. Then, the executing body may determine, as the set of time points, a time point corresponding to the unmuted speech frame signal sampling point in the unmuted speech frame signal sampling point set.

And 2, determining the amplitude corresponding to each time point in the time point set to generate an amplitude value, and obtaining an amplitude value set.

And 3, in response to determining that the amplitude value set has an amplitude value larger than a preset amplitude threshold, performing interpolation processing on the time point set of the short-time de-muted voice frame signal to obtain a short-time voice frame signal after the scaling.

Wherein the preset amplitude threshold may be "10000Hz". In practice, the executing body may perform interpolation processing on the time point set of the short-time de-muted voice frame signal by using a linear interpolation method, so as to obtain a scaled short-time voice frame signal.

And a third sub-step, splicing the amplified short-time voice frame signals in the amplified short-time voice frame signal set to obtain the amplified short-time voice signals.

It should be noted that, the specific implementation manner of generating the scaled voice signal may refer to the specific implementation manner of generating the preprocessed voice signal, which is not described herein again.

And seventh, generating a voice characteristic vector according to the voice signal processing model and the scaled voice signal.

In practice, the execution body inputs the scaled speech signal into the speech signal processing model to generate a speech feature vector.

The content of the foregoing "in some optional implementations of some embodiments" is taken as an invention point of the disclosure, which solves the second technical problem mentioned in the background art, namely, "the voice signal is affected by the speech speed, the voice signal often has sparsity or dense characteristics, and the dense voice signal is directly subjected to feature extraction aiming at the dense voice signal, and because the dense voice signal is relatively dense, adjacent signal values are relatively close, signal feature extraction cannot be effectively performed, so that the accuracy of the text instruction information generated subsequently is affected. In practice, when the speech speed is faster, the voice signal often has the characteristic of being condensed, adjacent signal values in the condensed voice signal are relatively close and distributed compactly, if the feature extraction is directly carried out on the condensed voice signal, the signal feature extraction can not be effectively carried out, and therefore the accuracy of the text instruction information generated subsequently is affected. Based on the above, the present disclosure first performs framing processing on the above-mentioned de-muted speech signal to generate a short-time de-muted speech frame signal set. Helping to reduce the computational complexity of subsequent processing steps. Next, for each short-time de-muted speech frame signal of the above-described short-time de-muted speech frame signal set, the following time interval scaling step is performed: first, determining the time point set of the voice frame signal after short-time de-muting. And a second step of determining the amplitude corresponding to each time point in the time point set so as to generate an amplitude value and obtain an amplitude value set. And thirdly, in response to determining that the amplitude value set has an amplitude value larger than a preset amplitude threshold, performing interpolation processing on the time point set of the short-time de-muted voice frame signal to obtain a short-time voice frame signal after the expansion and contraction. The method is beneficial to enlarging the time scale of the voice frame signal after short-time silence removal and enlarging the high-frequency voice signal, thereby improving the accuracy of the text instruction information generated subsequently. And fourthly, splicing all the scaled short-time voice frame signals in the obtained scaled short-time voice frame signal set to obtain the scaled voice signals. By the method, reasonable scaling of dense voice signals is achieved, and accuracy of text control instructions obtained on the basis of scaled voice signals is guaranteed by the voice signal processing model and the text conversion model.

In some optional implementations of some embodiments, the executing body may generate the speech feature vector according to the speech signal processing model and the scaled speech signal, and may include the steps of:

and a first substep, pre-emphasis processing is carried out on the amplified and contracted voice signals to obtain emphasized voice signals.

And a second sub-step, carrying out framing treatment on the emphasized voice signals to obtain a voice frame signal set.

It should be noted that, the specific implementation manner of generating the above-mentioned voice frame signal set may refer to the specific implementation manner of generating the above-mentioned short-time audio frame signal set, which is not described herein again.

And a third sub-step of performing windowing processing on each voice frame signal of the voice frame signal set to generate a windowed voice frame signal.

It should be noted that, the specific implementation manner of generating the windowed speech frame signal may refer to the specific implementation manner of generating the windowed audio frame signal, which is not described herein.

A fourth sub-step of, for each windowed speech frame signal in the windowed speech frame signal set, performing the steps of:

and a substep 1, performing fast fourier transform on the windowed voice frame signal to generate a voice frequency domain signal.

And 2, generating the power spectrum density corresponding to the windowed voice frame signal according to the voice frequency domain signal.

Wherein the power spectral density characterizes the power or energy density of the signal at different frequencies.

In practice, the executing body generates a power spectral density corresponding to the windowed speech frame signal according to a power spectral density calculation formula and the speech frequency domain signal. Wherein, the power spectral density calculation formula is:

P(f)＝|F(f)| ²

where f is the frequency. P (f) is the power spectral density at frequency f. F (F) is the fourier transform result of the speech frequency domain signal at frequency F.

And 3, creating a filter sequence.

Wherein the filter sequence includes a first predetermined number of filters. The filters in the above-described filter sequence may be triangular filters. The center frequency interval of each two adjacent filters in the filter sequence is the same. For example, the first preset number may be 13.

As an example, the above-mentioned filter may be a mel filter.

And step 4, inputting the power spectrum density into each triangular filter in the filter sequence to generate a filtered output result, and obtaining a filtered output result set.

And a sub-step 5 of carrying out logarithmic processing on each filtered output result in the filtered output result set so as to generate a logarithmic output result and obtain a logarithmic output result set.

In practice, the execution body takes the logarithm of each filtered output result in the filtered output result set to generate a logarithmic output result, and obtains a logarithmic output result set.

And a sub-step 6 of performing discrete cosine transform on each logarithmic output result in the logarithmic output result set to obtain a candidate characteristic coefficient set.

And 7, carrying out mean normalization processing on the candidate characteristic coefficient set to obtain the characteristic coefficient.

And a fifth substep of generating the speech feature vector according to the obtained feature coefficient set and the speech signal processing model.

In practice, the execution body inputs the set of feature coefficients into the speech signal processing model to generate the speech feature vector.

Eighth, generating text instruction information corresponding to the preprocessed voice signal according to the voice feature vector and the text conversion model.

Alternatively, the text conversion model may include: a speech signal conversion coding model, an attention mechanism model, and a speech signal conversion decoding model.

In some optional implementations of some embodiments, the executing body inputting the speech feature vector into the text conversion model to generate text instruction information corresponding to the preprocessed speech signal may include the steps of:

and firstly, carrying out standardization processing on the voice feature vector to obtain the standardized voice feature vector.

In practice, first, the execution subject may determine the mean and standard deviation of the individual dimensions in the speech feature vector. And then, the execution main body performs standardization processing on the voice characteristic vector through a standardization formula to obtain the standardized voice characteristic vector. Wherein, the standardized formula is: normalized speech feature vector= (speech feature vector-mean)/standard deviation.

And secondly, carrying out data enhancement processing on the normalized voice feature vector so as to generate a voice feature vector after data enhancement.

In practice, the execution subject may perform data enhancement processing on the normalized speech feature vector by using an acoustic occlusion technique to generate a data enhanced speech feature vector. Among them, the acoustic occlusion technique is a technique of occluding or covering a part of a speech feature vector. Specifically, first, the execution body sets a second preset number of speech feature vector values in the speech feature vectors as preset shielding values, so as to obtain shielded speech feature vectors. Wherein, the preset shielding value may be zero. And then, the execution main body splices the shielded voice characteristic vector and the voice characteristic vector to obtain the voice characteristic vector after data enhancement. For example, the second preset number may be 3.

And thirdly, inputting the data enhanced voice characteristic vector into the voice signal conversion coding model to generate a coded voice characteristic vector.

The speech signal conversion coding model takes the speech characteristic vector after the data enhancement as input and takes the embedded vector as output. The embedded vector may be a high-level representation of speech signal features corresponding to the data-enhanced speech feature vector. In practice, the network structure of the speech signal transcoding model may include: a first convolution layer, a first pooling layer, a first attention mechanism layer, a second convolution layer, a second pooling layer, a second attention mechanism layer, and a fully connected layer. The first convolution layer may perform a first convolution process on the speech feature vector after the data enhancement. The first convolution layer may be formed from a third predetermined number of 3×3 convolution kernels and a ReLU activation function. The first pooling layer may perform a maximum pooling process on the output of the first convolution layer. The first attention mechanism layer may be a convolutional layer based on a channel attention mechanism. The second convolution layer may perform a second convolution process on the speech feature vector after the data enhancement. The second convolution layer may be formed by a third predetermined number of 1×1 convolution kernels and a Sigmoid activation function. The second pooling layer may perform an average pooling process on the output of the second convolution layer. The second attention mechanism layer may be a convolution layer based on a spatial attention mechanism model. The fully-connected layer can convert the output of the second attention mechanism layer into a coded voice feature vector. For example, the third preset number may be 64.

And fourthly, generating candidate voice feature vectors through the coded voice feature vectors and the attention mechanism model.

In practice, the execution body inputs the encoded speech feature vector into the attention mechanism model to generate a candidate speech feature vector. For example, the attention mechanism model may be a self-attention mechanism model.

And fifthly, generating text instruction information corresponding to the preprocessed voice signals according to the voice signal conversion decoding model and the candidate voice feature vectors.

In practice, the execution body inputs the candidate speech feature vector into the speech signal conversion decoding model to generate text instruction information corresponding to the preprocessed speech signal. The speech signal conversion decoding model is a model for converting candidate speech feature vectors into text instruction information. In practice, the speech signal conversion decoding model may be a speech processing sequence conversion model. For example, the speech signal transform decoding model may be a neural network model based on a Transducer module.

The content of the foregoing "in some optional implementations of some embodiments" is taken as an invention point of the disclosure, which solves the third technical problem mentioned in the background art, namely, the "conventional conversion manner between a voice feature vector and text instruction information" is strongly dependent on the vector quality of the voice feature vector, and when the vector quality is poor, that is, when the voice signal cannot be represented well, the extracted text instruction information may be caused to be inaccurate, so that the accurate control of the subsequent intelligent device is affected. In practice, for example, in the generation of speech feature vectors, feature loss may occur due to the variability of the different modes in the feature vector conversion process. As another example, due to the environmental influence of the generation of the voice signal, the quality of the obtained voice feature vector is poor, and thus the accuracy of the subsequently generated text instruction information may be affected. Based on this, the present disclosure firstly performs normalization processing on the above-mentioned speech feature vector to obtain a normalized speech feature vector, so as to implement scaling of the speech feature vector. And secondly, carrying out data enhancement processing on the normalized voice feature vector to generate a voice feature vector after data enhancement, thereby enhancing the corresponding feature of the low-frequency signal. And then, inputting the data enhanced voice feature vector into the voice signal conversion coding model to generate a coded voice feature vector, so that more signal features aiming at the voice signal are captured on the basis of the data enhanced voice feature vector. Finally, through the coded voice feature vector and the attention mechanism model, the candidate voice feature vector is generated, so that the global dependency relationship can be better captured, and feature loss is avoided. And finally, generating text instruction information corresponding to the preprocessed voice signal according to the voice signal conversion decoding model and the candidate voice feature vector. Through the mode, accurate text instruction information can be obtained, and accurate control on intelligent equipment is realized on the side face.

And 104, in response to determining that the text instruction information contains the target wake-up word, eliminating the target wake-up word from the text instruction information to obtain candidate control instruction information.

In some embodiments, in response to determining that the text instruction information includes the target wake word, the execution subject may reject the target wake word from the text instruction information to obtain candidate control instruction information. Wherein the target wake-up word is a wake-up word for triggering the mobile terminal (execution body) to perform the first processing operation. For example, the target wake word may be "hello".

Step 105, in response to determining that the candidate control instruction information satisfies the first selection condition, performing the following first processing operation on the candidate control instruction information:

wherein the first selection condition is that the candidate control instruction information is control instruction information.

Step 1051, extracting the model and the candidate control instruction information according to the pre-trained key information, and generating the control instruction information corresponding to the candidate control instruction information.

The pre-trained key information extraction model may include K serially connected convolutional layers, among others. Wherein, the control instruction information includes: control object information and control action information. Wherein the control object information includes: control object and control object address. The control object information includes a control object characterization temperature and humidity control device. The control object address characterizes the communication address of the control object in the local area network where the mobile terminal is located. Wherein, the control action information includes: control actions and control action logic relationships. The control action characterizes a control operation of the control instruction information with respect to the control object. The control action logic relationship characterizes execution logic of control operation of the control instruction information with respect to the control object.

Step 1052, the control operation corresponding to the control operation information is executed on the control object corresponding to the control object information.

In practice, the execution subject may execute the control action on the control object found by the control object address according to the control action logical relationship.

In some optional implementations of some embodiments, the method further includes:

in response to determining that the candidate control instruction information satisfies the second selection condition, performing the following second processing operation on the candidate control instruction information:

wherein the second selection condition is candidate control instruction information and non-control instruction information.

And a substep 1, generating voice transcription information corresponding to the candidate control instruction information according to a pre-trained language transcription model and the candidate control instruction information.

The pre-trained language transcription model is a model for translating text information. For example, the language transcription model may be a Seq2Seq model. The voice transcription information is information after the text information is translated into the target language.

In practice, the execution subject inputs the candidate control instruction information into a pre-trained language transcription model to generate speech transcription information corresponding to the candidate control instruction information.

And 2, displaying the voice transcription information in an input box.

As an example, the above-described input box may characterize an input area of the mobile terminal (execution subject).

With further reference to fig. 2, as an implementation of the method shown in the above figures, the present disclosure provides some embodiments of a speech signal processing apparatus, which correspond to those method embodiments shown in fig. 1, and which are particularly applicable in various electronic devices.

As shown in fig. 2, the speech signal processing apparatus 200 of some embodiments includes: a preprocessing unit 201, a first generation unit 202, a second generation unit 203, a culling unit 204, and an execution unit 205. Wherein the preprocessing unit 201 is configured to perform signal preprocessing on the above-mentioned voice signal in response to the recognition of the voice signal, so as to generate a preprocessed voice signal; the first generating unit 202 is configured to generate a speech signal class according to the preprocessed speech signal, wherein the speech signal class characterizes a class of objects that generate the speech signal; the second generating unit 203 is configured to generate text instruction information corresponding to the pre-processed voice signal according to a pre-trained audio signal conversion model and the pre-processed voice signal in response to determining that the voice signal class is a target class; the eliminating unit 204 is configured to eliminate the target wake-up word from the text instruction information to obtain candidate control instruction information in response to determining that the text instruction information contains the target wake-up word; the execution unit 205 is configured to execute the following processing operations on the candidate control instruction information in response to determining that the candidate control instruction information satisfies the first selection condition: generating control instruction information corresponding to the candidate control instruction information according to a pre-trained key information extraction model and the candidate control instruction information, wherein the control instruction information comprises: control object information and control action information, wherein the control object corresponding to the control object information represents an intelligent electrical appliance; and executing the control action corresponding to the control action information on the control object corresponding to the control object information.

It will be appreciated that the elements recited in the speech signal processing apparatus 200 correspond to the various steps in the method described with reference to fig. 1. Thus, the operations, features and advantages described above for the method are equally applicable to the speech signal processing device 200 and the units contained therein, and are not described here again.

Referring now to fig. 3, a schematic diagram of an electronic device (e.g., computing device) 300 suitable for use in implementing some embodiments of the present disclosure is shown. The electronic device shown in fig. 3 is merely an example and should not impose any limitations on the functionality and scope of use of embodiments of the present disclosure.

As shown in fig. 3, the electronic device 300 may include a processing means (e.g., a central processing unit, a graphics processor, etc.) 301 that may perform various suitable actions and processes in accordance with programs stored in a read-only memory 302 or programs loaded from a storage 308 into a random access memory 303. In the random access memory 303, various programs and data necessary for the operation of the electronic device 300 are also stored. The processing means 301, the read only memory 302 and the random access memory 303 are connected to each other by a bus 304. An input/output interface 305 is also connected to the bus 304.

In general, the following devices may be connected to the I/O interface 305: input devices 306 including, for example, a touch screen, touchpad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; an output device 307 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage 308 including, for example, magnetic tape, hard disk, etc.; and communication means 309. The communication means 309 may allow the electronic device 300 to communicate with other devices wirelessly or by wire to exchange data. While fig. 3 shows an electronic device 300 having various means, it is to be understood that not all of the illustrated means are required to be implemented or provided. More or fewer devices may be implemented or provided instead. Each block shown in fig. 3 may represent one device or a plurality of devices as needed.

In particular, according to some embodiments of the present disclosure, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, some embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method shown in the flow chart. In such embodiments, the computer program may be downloaded and installed from a network via communications device 309, or from storage device 308, or from read only memory 302. The above-described functions defined in the methods of some embodiments of the present disclosure are performed when the computer program is executed by the processing means 301.

It should be noted that, the computer readable medium described in some embodiments of the present disclosure may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In some embodiments of the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In some embodiments of the present disclosure, however, the computer-readable signal medium may comprise a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, fiber optic cables, RF (radio frequency), and the like, or any suitable combination of the foregoing.

In some implementations, the clients, servers may communicate using any currently known or future developed network protocol, such as HTTP (Hyper Text Transfer Protocol ), and may be interconnected with any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the internet (e.g., the internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed networks.

The computer readable medium may be contained in the electronic device; or may exist alone without being incorporated into the electronic device. The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: in response to recognizing the voice signal, performing signal preprocessing on the voice signal to generate a preprocessed voice signal; generating a voice signal category according to the preprocessed voice signal, wherein the voice signal category represents an object category for generating the voice signal; responding to the determination that the voice signal category is the target category, and generating text instruction information corresponding to the pre-processed voice signal according to a pre-trained audio signal conversion model and the pre-processed voice signal; in response to determining that the text instruction information contains a target wake-up word, eliminating the target wake-up word from the text instruction information to obtain candidate control instruction information; in response to determining that the candidate control instruction information satisfies a first selection condition, performing the following first processing operation on the candidate control instruction information: generating control instruction information corresponding to the candidate control instruction information according to a pre-trained key information extraction model and the candidate control instruction information, wherein the control instruction information comprises: control object information and control action information, wherein the control object information comprises a control object representation temperature and humidity control device; and executing the control action corresponding to the control action information on the control object corresponding to the control object information.

Computer program code for carrying out operations for some embodiments of the present disclosure may be written in one or more programming languages, including an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in some embodiments of the present disclosure may be implemented by means of software, or may be implemented by means of hardware. The described units may also be provided in a processor, for example, described as: a processor includes a preprocessing unit, a first generating unit, a second generating unit, a culling unit, and an executing unit. The names of these units do not limit the unit itself in some cases, and for example, the preprocessing unit may also be described as "a unit that performs signal preprocessing on the above-described voice signal to generate a preprocessed voice signal in response to recognition of the voice signal".

The functions described above herein may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: a Field Programmable Gate Array (FPGA), an Application Specific Integrated Circuit (ASIC), an Application Specific Standard Product (ASSP), a system on a chip (SOC), a Complex Programmable Logic Device (CPLD), and the like.

The foregoing description is only of the preferred embodiments of the present disclosure and description of the principles of the technology being employed. It will be appreciated by those skilled in the art that the scope of the invention in the embodiments of the present disclosure is not limited to the specific combination of the above technical features, but encompasses other technical features formed by any combination of the above technical features or their equivalents without departing from the spirit of the invention. Such as the above-described features, are mutually substituted with (but not limited to) the features having similar functions disclosed in the embodiments of the present disclosure.

Claims

1. A method of speech signal processing, comprising:

in response to identifying a speech signal, performing signal preprocessing on the speech signal to generate a preprocessed speech signal;

generating a voice signal category according to the preprocessed voice signal, wherein the voice signal category represents an object category for generating the voice signal;

responding to the determination that the voice signal class is the target class, and generating text instruction information corresponding to the pre-processed voice signal according to a pre-trained audio signal conversion model and the pre-processed voice signal;

in response to determining that the text instruction information contains a target wake-up word, eliminating the target wake-up word from the text instruction information to obtain candidate control instruction information;

in response to determining that the candidate control instruction information satisfies a first selection condition, performing the following first processing operation on the candidate control instruction information:

generating control instruction information corresponding to the candidate control instruction information according to a pre-trained key information extraction model and the candidate control instruction information, wherein the control instruction information comprises: control object information and control action information, wherein the control object information comprises control object representation temperature and humidity control devices;

And executing the control action corresponding to the control action information on the control object corresponding to the control object information.

2. The method of claim 1, wherein the method further comprises:

in response to determining that the candidate control instruction information satisfies a second selection condition, performing the following second processing operation on the candidate control instruction information:

generating voice transcription information corresponding to the candidate control instruction information according to a pre-trained language transcription model and the candidate control instruction information;

and displaying the voice transcription information in an input box.

3. The method of claim 2, wherein the signal pre-processing the speech signal to generate a pre-processed speech signal in response to identifying the speech signal comprises:

carrying out normalization processing on the voice signal to obtain a normalized voice signal;

framing the normalized voice signal to generate a short-time audio frame signal set;

for each short-time audio frame signal in the set of short-time audio frame signals, performing the following processing steps:

windowing is carried out on the short-time audio frame signals to obtain window audio frame signals;

Performing fast fourier transform on the windowed audio frame signal to generate a frequency domain signal;

gaussian filtering and noise reduction are carried out on the frequency domain signals, and filtered frequency domain signals are obtained;

smoothing the filtered frequency domain signal to obtain a smoothed frequency domain signal;

performing inverse Fourier transform on the frequency domain signal after the smoothing processing to generate a time domain signal;

performing windowing processing on the time domain signal to obtain a windowed time domain signal;

and splicing all the windowed time domain signals in the obtained windowed time domain signal set to generate the preprocessed voice signals.

4. A method according to claim 3, wherein the audio signal conversion model comprises: a speech signal processing model and a text conversion model; and

the responding to the determination that the voice signal category is the target category, generating text instruction information corresponding to the voice signal after preprocessing according to a pre-trained audio signal conversion model and the voice signal after preprocessing, and the method comprises the following steps:

framing the preprocessed voice signals to generate a short-time voice frame signal set;

windowing is carried out on the short-time voice frame signal set to obtain a windowed short-time voice frame signal set;

Determining an instantaneous energy value of the windowed short-time speech frame signal for each windowed short-time speech frame signal in the windowed short-time speech frame signal set;

removing the windowed short-time voice frame signals with the corresponding instantaneous energy values meeting the removing conditions from the windowed short-time voice frame signals in a concentrated manner to obtain a removed short-time voice frame signal set;

splicing all the short-time voice frame signals after the elimination in the short-time voice frame signal set after the elimination so as to generate voice signals after the elimination of silence;

performing speech speed scaling operation on the voice signals after silence removal to obtain scaled voice signals;

generating a voice characteristic vector according to the voice signal processing model and the scaled voice signal;

and generating text instruction information corresponding to the preprocessed voice signal according to the voice characteristic vector and the text conversion model.

5. The method of claim 4, wherein the generating speech feature vectors from the speech signal processing model and the scaled speech signal comprises:

pre-emphasis processing is carried out on the scaled voice signals to obtain emphasized voice signals:

Carrying out framing treatment on the emphasized voice signal to obtain a voice frame signal set;

windowing is carried out on each voice frame signal of the voice frame signal set so as to generate a windowed voice frame signal;

for each windowed speech frame signal in the set of windowed speech frame signals, performing the steps of:

performing fast fourier transform on the windowed speech frame signal to generate a speech frequency domain signal;

generating a power spectral density corresponding to the windowed speech frame signal according to the speech frequency domain signal;

creating a filter sequence, wherein the filter sequence comprises a first preset number of filters, and the center frequency interval of every two adjacent filters in the filter sequence is the same;

inputting the power spectral density into each filter in the filter sequence to generate a filtered output result, and obtaining a filtered output result set;

carrying out logarithmic processing on each filtered output result in the filtered output result set to generate a logarithmic output result, and obtaining a logarithmic output result set;

performing discrete cosine transform on each logarithmic output result in the logarithmic output result set to obtain a candidate characteristic coefficient set; carrying out mean normalization processing on the candidate characteristic coefficient set to obtain a characteristic coefficient;

And generating the voice characteristic vector according to the obtained characteristic coefficient set and the voice signal processing model.

6. A speech signal processing apparatus comprising:

a preprocessing unit configured to perform signal preprocessing on a voice signal in response to recognition of the voice signal, to generate a preprocessed voice signal;

a first generation unit configured to generate a speech signal class from the preprocessed speech signal, wherein the speech signal class characterizes a class of objects that generated the speech signal;

a second generation unit configured to generate text instruction information corresponding to the pre-processed voice signal according to a pre-trained audio signal conversion model and the pre-processed voice signal in response to determining that the voice signal class is a target class;

the rejecting unit is configured to reject the target wake-up word from the text instruction information to obtain candidate control instruction information in response to determining that the text instruction information contains the target wake-up word;

an execution unit configured to, in response to determining that the candidate control instruction information satisfies a first selection condition, perform the following processing operation on the candidate control instruction information: generating control instruction information corresponding to the candidate control instruction information according to a pre-trained key information extraction model and the candidate control instruction information, wherein the control instruction information comprises: control object information and control action information, wherein the control object information comprises control object representation temperature and humidity control devices; and executing the control action corresponding to the control action information on the control object corresponding to the control object information.

7. An electronic device, comprising:

one or more processors;

a storage device having one or more programs stored thereon;

when executed by the one or more processors, causes the one or more processors to implement the method of any of claims 1 to 5.

8. A computer readable medium having stored thereon a computer program, wherein the computer program, when executed by a processor, implements the method of any of claims 1 to 5.