CN110600012A - Fuzzy speech semantic recognition method and system for artificial intelligence learning - Google Patents
Fuzzy speech semantic recognition method and system for artificial intelligence learning Download PDFInfo
- Publication number
- CN110600012A CN110600012A CN201910713034.8A CN201910713034A CN110600012A CN 110600012 A CN110600012 A CN 110600012A CN 201910713034 A CN201910713034 A CN 201910713034A CN 110600012 A CN110600012 A CN 110600012A
- Authority
- CN
- China
- Prior art keywords
- speech
- fuzzy
- characteristic quantity
- sample
- spectral envelope
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
Abstract
The invention provides a fuzzy speech semantic recognition method and a system for artificial intelligence learning. The invention aims at the fuzzy speech existing in the user dictation speech instruction, reconstructs the fuzzy speech into clear standard speech by utilizing a GAN network architecture, and further realizes the conversion and identification of semantic information based on the standard speech. And in the training process of the GAN network, the input fuzzy speech is corresponding to a sample collection with a larger range by utilizing speech feature matching, and the training of the GAN network is realized by the sample collection.
Description
Technical Field
The application relates to the field of artificial intelligence control, in particular to a fuzzy speech semantic recognition method and system for artificial intelligence learning.
Background
With the maturity of voice recognition and semantic conversion technologies, people control service facilities by using voice commands more and more frequently, and the applications of the services in smart buildings, smart communities and smart families are more and more.
For example, people can dictate voice commands to control the operation of various services in a smart building, smart community, smart home. For example, people can send a voice command of "please rise to the 15 th floor" for an elevator of a smart building, can also send voice commands of "please call XXX room", "please open door", the door opening password is xxxxxxxx "," please lock door ", and the like beside an access control system of a smart community, or send voice commands of" please open air conditioner "," please turn on ceiling light ", and the like, facing the vicinity of a central control panel of a smart home. The service facility collects the voice command signal, after necessary enhancement processing, the voice command is converted and recognized into semantic information, then a control instruction in a machine code form is generated by the semantic information in a natural language character form, and the service facility can execute necessary work according to the control instruction. Compare in carrying out manual control to button, button type control panel and coming, the voice command mode can bring more convenient experience and bigger degree of freedom for the user, especially is the disabled person who does not have both hands or blindness at the user, or under the condition such as control panel can't be touched because of factors such as environmental barrier, distance are far away, can strengthen the convenience and the accessible of wisdom building, wisdom community, wisdom family.
However, there is a high probability of false conversion in the current recognition of the conversion of a voice command into semantic information, i.e., the recognition of a natural language character from an acoustic signal. Among them, clear speech can be recognized well, but it is especially difficult to realize correct semantic information conversion for fuzzy speech. In the process of transmitting the sound signal to the service facility for collection, fuzzy speech is generated due to attenuation of the sound itself and interference of ambient noise including unclear pronunciation of the user itself, accent and other factors, and the speech command cannot be correctly recognized as semantic information directly, so that the service facility cannot be controlled to operate.
In the prior art, semantic recognition for fuzzy speech mainly adopts preprocessing such as enhancement of sound signals and adopts a confidence evaluation mode, so that the problem of realizing accurate semantic recognition through the fuzzy speech cannot be effectively solved.
With the development of artificial intelligence, recognition models such as an SVM (support vector machine) vector machine and a neural network are applied to a semantic recognition technology of voice in the prior art, specifically, feature quantities are extracted by using voice samples to train the recognition models, and then the feature quantities of the voice to be recognized are input into the recognition models to obtain semantic information. However, if the above recognition model is directly used for semantic recognition of the fuzzy speech, the change forms of the fuzzy speech are very rich, so that the characteristic quantities of the fuzzy speech are very rich and diversified, and the fuzzy speech samples are often lack of representativeness, so that the problems of insufficient training of the artificial intelligent recognition model and poor applicability of the trained recognition model to other fuzzy speech are caused
Disclosure of Invention
Aiming at the problems in the prior art, the invention provides a fuzzy speech semantic recognition method and system for artificial intelligence learning. The invention aims at the fuzzy speech existing in the user dictation speech instruction, reconstructs the fuzzy speech into clear standard speech by utilizing a GAN network architecture, and further realizes the conversion and identification of semantic information based on the standard speech. And in the training process of the GAN network, the input fuzzy speech is corresponding to a sample collection with a larger range by utilizing speech feature matching, and the training of the GAN network is realized by the sample collection.
The invention provides a fuzzy speech semantic recognition method for artificial intelligence learning, which comprises the following steps:
step 1, acquiring a fuzzy voice signal input by a user, and extracting high-dimensional characteristic quantity of the fuzzy voice signal;
step 2, determining a sample collection matched with the characteristics of the fuzzy speech signal according to the spectral envelope characteristic quantity of the fuzzy speech signal;
step 3, constructing a reconstruction model of a GAN framework for reconstructing the fuzzy speech into the standard speech, and training the reconstruction model by utilizing the sample collection;
step 4, constructing a converter for converting the fuzzy voice fundamental frequency into the standard voice fundamental frequency;
step 5, inputting the spectral envelope characteristic quantity of the fuzzy speech signal input by the user into the trained reconstruction model to obtain the spectral envelope characteristic quantity of the reconstructed standard speech output by the generator of the reconstruction model, and inputting the fundamental frequency of the fuzzy speech into a converter to convert the fundamental frequency of the reconstructed standard speech;
step 6, synthesizing and reconstructing the standard voice according to the spectral envelope characteristic quantity and the fundamental frequency of the reconstructed standard voice;
and 7, recognizing semantic information by using the reconstructed standard voice.
Preferably, a plurality of sample collections are established in the step 2, each voice sample comprises a fuzzy voice sample and a standard voice sample, and the similarity of the characteristic quantity of the fuzzy voice sample is within a preset similarity range; and matching the spectral envelope characteristic quantity of the fuzzy speech signal extracted in the step 1 with the collection representative characteristic quantity of each sample collection, thereby selecting the sample collection matched with the spectral envelope characteristic quantity of the fuzzy speech signal.
It is further preferable that, in step 2, the sample collection has n speech samples, and the spectral envelope characteristic quantity of the blurred speech sample corresponding to each speech sample is X1s,X2s...XnsEach frequency spectrum envelope characteristic quantity is d-dimension characteristic vector, and a characteristic quantity matrix X of the sample collection is formedS={X1s,X2s...Xns}; for the r-th dimension in the d-dimension, its entire feature quantity matrix X is calculatedSIs expressed asAnd selecting a feature quantity matrix XSIn nkA submatrix composed of characteristic quantities, denoted as a submatrix kThereby the characteristic quantity matrix XSIn each nkForming a sub-matrix by the feature vectors, wherein the total number of the sub-matrices is c, namely k is 1,2.. c; the mean of the r-dimension of the sub-matrix k in the d-dimension is expressed asThen calculate the inter-class distance of the c sub-matrices:
and calculating the intra-class distance of each sub-matrix of the c sub-matrices:
wherein xk s,rIs Xk SThe value of each feature vector in r dimension;
calculating the intra-class inter-class proportion of each submatrix of the c submatrixes:
σ=Db/Dw
and further determining the submatrix with the highest intra-class inter-class proportion value as the collection representative characteristic quantity of the sample collection.
Preferably, the step 3 of reconstructing the GAN architecture includes: a generator G and a discriminator D; the generator reconstructs the spectral envelope characteristic quantity of the standard voice according to the spectral envelope characteristic quantity of the fuzzy voice input into the generator; the discriminator is used for judging the authenticity of the spectral envelope characteristic quantity reconstructed by the generator.
Preferably, the loss function I of the generator G in step 3G(G) Expressed as:
whereinRepresenting the penalty, L, of the generator Gc(G) Representing the loss of cyclic agreement of the generator G,regularization parameter, L, representing cyclic consistency lossid(G) Representing the loss of the feature map of the generator G,a regularization parameter that represents a feature mapping penalty.
Preferably, the loss function of discriminator D in step 3 is expressed as:
wherein D (x)S) A discrimination value representing a spectral envelope characteristic quantity of a standard speech sample in the input sample collection by discriminator D,representing an expectation of a probability distribution for a standard speech sample;
D(G(xt) Representation discriminator D to generator G based on the fuzzy speech feature xtThe discrimination value of the spectral envelope characteristic quantity of the generated standard voice sample,representing feature x of fuzzy speechtExpectation of probability distribution.
Preferably, the fundamental frequency conversion function constructed in step 4 is:
wherein muGAnd σGMean and variance, mu, in the log domain for the standard speech generated by the generatortAnd σtMean and variance in the log domain for blurred speech, ftFor blurring the fundamental frequency of speech, fGIs converted standard speech fundamental frequency.
Furthermore, the invention provides a fuzzy speech semantic recognition system for artificial intelligence learning, which comprises:
the fuzzy voice signal characteristic quantity extraction module is used for collecting a fuzzy voice signal input by a user and extracting high-dimensional characteristic quantity of the fuzzy voice signal;
the sample selection matching module is used for determining a sample selection matched with the characteristics of the fuzzy speech signal according to the spectral envelope characteristic quantity of the fuzzy speech signal;
the GAN reconstruction model building and training module is used for building a reconstruction model of a GAN framework for reconstructing the fuzzy speech into the standard speech, and training the reconstruction model by utilizing the sample collection;
the converter construction module is used for constructing a converter for converting the fuzzy voice fundamental frequency into the standard voice fundamental frequency;
the reconstruction conversion module is used for inputting the spectral envelope characteristic quantity of the fuzzy voice signal input by the user into the trained reconstruction model to obtain the spectral envelope characteristic quantity of the reconstructed standard voice output by the generator of the reconstruction model, inputting the spectral envelope characteristic quantity of the reconstructed standard voice into the converter and converting the fundamental frequency of the reconstructed standard voice;
and the standard voice synthesis module synthesizes and reconstructs the standard voice according to the spectral envelope characteristic quantity and the fundamental frequency of the reconstructed standard voice.
And the semantic information recognition module is used for recognizing the semantic information by utilizing the reconstructed standard voice.
Preferably, the sample collection matching module has a plurality of sample collections, each voice sample includes a fuzzy voice sample and a standard voice sample, and the similarity of the characteristic quantity of the fuzzy voice sample is within a preset similarity range; and matching the spectral envelope characteristic quantity of the fuzzy speech signal with the collection representative characteristic quantity of each sample collection based on the spectral envelope characteristic quantity of the fuzzy speech signal, so as to select the sample collection matched with the spectral envelope characteristic quantity of the fuzzy speech signal.
It is further preferred that the sample collection in the sample collection matching module has n speech samples, and each speech sample corresponds to a spectral envelope of a blurred speech sampleCharacteristic quantity of X1s,X2s...XnsEach frequency spectrum envelope characteristic quantity is d-dimension characteristic vector, and a characteristic quantity matrix X of the sample collection is formedS={X1s,X2s...Xns}; for the r-th dimension in the d-dimension, its entire feature quantity matrix X is calculatedSIs expressed asAnd selecting a feature quantity matrix XSIn nkA submatrix composed of characteristic quantities, denoted as a submatrix kThereby the characteristic quantity matrix XSIn each nkForming a sub-matrix by the feature vectors, wherein the total number of the sub-matrices is c, namely k is 1,2.. c; the mean of the r-dimension of the sub-matrix k in the d-dimension is expressed asThen calculate the inter-class distance of the c sub-matrices:
and calculating the intra-class distance of each sub-matrix of the c sub-matrices:
wherein xk s,rIs Xk SThe value of each feature vector in r dimension;
calculating the intra-class inter-class proportion of each submatrix of the c submatrixes:
σ=Db/Dw
and further determining the submatrix with the highest intra-class inter-class proportion value as the collection representative characteristic quantity of the sample collection.
Preferably, the GAN architecture reconstruction model constructed by the GAN reconstruction model construction and training module includes: a generator G and a discriminator D; the generator reconstructs the spectral envelope characteristic quantity of the standard voice according to the spectral envelope characteristic quantity of the fuzzy voice input into the generator; the discriminator is used for judging the authenticity of the spectral envelope characteristic quantity reconstructed by the generator.
Preferably, the loss function I of the generator GG(G) Expressed as:
whereinRepresenting the penalty, L, of the generator Gc(G) Representing the loss of cyclic agreement of the generator G,regularization parameter, L, representing cyclic consistency lossid(G) Representing the loss of the feature map of the generator G,a regularization parameter that represents a feature mapping penalty.
Preferably, the loss function of discriminator D is expressed as:
wherein D (x)S) A discrimination value representing a spectral envelope characteristic quantity of a standard speech sample in the input sample collection by discriminator D,representing an expectation of a probability distribution for a standard speech sample;
D(G(xt) Representation discriminator D to generator G based on the fuzzy speech feature xtThe discrimination value of the spectral envelope characteristic quantity of the generated standard voice sample,representing feature x of fuzzy speechtExpectation of probability distribution.
Preferably, the fundamental frequency conversion function constructed by the converter construction module is:
wherein muGAnd σGMean and variance, mu, in the log domain for the standard speech generated by the generatortAnd σtMean and variance in the log domain for blurred speech, ftFor blurring the fundamental frequency of speech, fGIs converted standard speech fundamental frequency.
Therefore, the invention provides a fuzzy speech semantic recognition method and system for artificial intelligence learning. The invention aims at the fuzzy speech existing in the user dictation speech instruction, reconstructs the fuzzy speech into clear standard speech by utilizing a GAN network architecture, and further realizes the conversion and identification of semantic information based on the standard speech. According to the invention, the input fuzzy speech is corresponding to the sample collection with a larger range through the speech feature matching, and the training of the GAN network is realized by the sample collection, so that the GAN network training is sufficient and is fully adapted to the feature distribution of the current fuzzy speech, the accuracy and reliability of reconstructing the standard speech are further improved, the accuracy rate from speech to semantic information recognition is obviously improved, and the correct conversion rate can reach more than 95.6% through experimental verification.
Drawings
Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:
FIG. 1 is a flowchart of a fuzzy speech semantic recognition method for artificial intelligence learning according to an embodiment of the present application;
fig. 2 is a structural diagram of a fuzzy speech semantic recognition system for artificial intelligence learning according to an embodiment of the present application.
Detailed Description
The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.
It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.
As shown in FIG. 1, the invention provides a fuzzy speech semantic recognition method for artificial intelligence learning, which comprises the following steps:
step 1, acquiring a fuzzy voice signal input by a user, and extracting high-dimensional characteristic quantity of the fuzzy voice signal.
The fuzzy speech semantic recognition method for artificial intelligence learning of the invention can be applied to speech control functions of service facilities in intelligent communities, intelligent buildings and intelligent families, a user speaks a speech command to the service facilities, the service facilities collect speech signals by using components such as a microphone and the like, necessary front-end enhancement processing such as filtering, noise suppression, time spectrum estimation and the like is carried out, and windowing and framing processing of the speech signals are carried out, so that the method belongs to the prior art and is not described in detail. If the processed speech signal belongs to clear speech, the semantic information is directly identified and converted, and the speech signal is not an improvement point of the invention and is not specifically described here. The invention focuses on the recognition processing of the fuzzy speech signal after the acquisition and enhancement processing.
In the step, extracting high-dimensional characteristic quantity of the fuzzy speech signal, specifically, the high-dimensional characteristic quantity is the spectrum envelope characteristic of each fuzzy speech signal frame, the extraction process of the spectrum envelope characteristic is to perform short-time FTT conversion on each fuzzy speech signal frame to obtain the spectrum of the fuzzy speech signal frame, obtain the Mei spectrum by the Mei filter on the spectrum of the fuzzy speech signal, then perform logarithm taking and DCT discrete cosine transform on the basis of the Mei spectrum to obtain MFCC coefficients, intercept 12-16 MFCC coefficients as fuzzy speech signalsSpectral envelope characteristic quantity X of speech signal framet。
And 2, determining a sample selection set matched with the characteristics of the fuzzy speech signal according to the spectral envelope characteristic quantity of the fuzzy speech signal.
In the subsequent steps, a clear speech reconstruction model based on GAN needs to be trained by using a sample set with a certain sample capacity; the fuzzy speech presents abundant diversity, if the general fuzzy speech samples are adopted, the representativeness is often insufficient, and the GAN training is insufficient, so the invention establishes a plurality of sample collections, wherein each sample collection can contain about 1000 sections of speech samples, each speech sample comprises the fuzzy speech sample and the standard speech sample, and the similarity of the characteristic quantity of the fuzzy speech sample is in a preset similarity range. In this step, based on the spectral envelope characteristic quantity of the blurred speech signal extracted in step 1, matching is performed with the collection representative characteristic quantity of each sample collection, so that the sample collection matched with the spectral envelope characteristic quantity of the blurred speech signal is selected.
For the sample collection, assume that there are n segments of speech samples, and the spectral envelope characteristic quantity of the blurred speech sample corresponding to each speech sample is X1s,X2s...XnsEach frequency spectrum envelope characteristic quantity is d-dimension characteristic vector, and a characteristic quantity matrix X of the sample collection is formedS={X1s,X2s...Xns}; for the r-th dimension in the d-dimension, its entire feature quantity matrix X is calculatedSIs expressed asAnd selecting a feature quantity matrix XSIn nkA submatrix composed of characteristic quantities, denoted as a submatrix kThereby the characteristic quantity matrix XSIn each nkForming a sub-matrix by the feature vectors, wherein the total number of the sub-matrices is c, namely k is 1,2.. c; the mean of the r-dimension of the sub-matrix k in the d-dimension is expressed asThen calculate the inter-class distance of the c sub-matrices:
and calculating the intra-class distance of each sub-matrix of the c sub-matrices:
wherein xk s,rIs Xk SThe value of each feature vector in r dimension.
Calculating the intra-class inter-class proportion of each submatrix of the c submatrixes:
σ=Db/Dw
and further determining the submatrix with the highest intra-class inter-class proportion value as the collection representative characteristic quantity of the sample collection.
And matching the spectral envelope characteristic quantity of the fuzzy speech signal with the collection representative characteristic quantity of each sample collection, namely calculating the spectral envelope characteristic quantity of the fuzzy speech signal and the characteristic quantity in the submatrix serving as the collection representative characteristic quantity to calculate the average vector distance, and selecting the sample collection with the minimum average vector distance, thereby selecting the sample collection matched with the spectral envelope characteristic quantity of the fuzzy speech signal.
And 3, constructing a reconstruction model of the GAN architecture for reconstructing the fuzzy speech into the standard speech, and training the reconstruction model by utilizing the sample collection.
The reconstruction model of the GAN architecture comprises the following steps: a generator G and a discriminator D; the generator reconstructs the spectral envelope characteristic quantity of the standard voice according to the spectral envelope characteristic quantity of the fuzzy voice input into the generator; the discriminator is used for judging the authenticity of the spectral envelope characteristic quantity reconstructed by the generator.
The generator adopts a two-dimensional convolution neural network and consists of an encoding network and a decoding network. The coding network comprises 5 convolutional layers, the decoding network comprises 5 deconvolution layers, ResNet is established between the coding network and the decoding network, and standardization is carried out after each convolutional layer. The discriminator uses a two-dimensional convolutional neural network, comprising 5 convolutional layers, standardized after each convolutional layer.
In the training process, for the fuzzy voice samples in the sample selection set, the spectral envelope characteristic quantity X of the fuzzy voice samplestThe input generator and the training generator minimize the loss function of the generator, and the generator outputs the spectral envelope characteristic quantity of the reconstructed standard voice.
Loss function I of generator GG(G) Expressed as:
whereinRepresenting the penalty, L, of the generator Gc(G) Representing the loss of cyclic agreement of the generator G,regularization parameter, L, representing cyclic consistency lossid(G) Representing the loss of the feature map of the generator G,a regularization parameter that represents a feature mapping penalty.
In the training process, the spectrum envelope characteristic quantity of the reconstructed standard voice and the spectrum envelope characteristic quantity of the standard voice sample in the sample selection are input into a discriminator, and the discriminator is trained to minimize the loss function of the discriminator.
The loss function of discriminator D is expressed as:
wherein D (x)S) A discrimination value representing a spectral envelope characteristic quantity of a standard speech sample in the input sample collection by discriminator D,representing the expectation of a probability distribution of standard speech samples
D(G(xt) Representation discriminator D to generator G based on the fuzzy speech feature xtThe discrimination value of the spectral envelope characteristic quantity of the generated standard voice sample,representing feature x of fuzzy speechtExpectation of probability distribution
Through the training, the loss functions of the generator and the discriminator are minimized, and the trained GAN framework reconstruction model for reconstructing the fuzzy speech into the standard speech is obtained through the preset iteration times.
Step 4, constructing a converter for converting the fuzzy voice fundamental frequency into the standard voice fundamental frequency; the fundamental transfer function is:
wherein muGAnd σGMean and variance, mu, in the log domain for the standard speech generated by the generatortAnd σtMean and variance in the log domain for blurred speech, ftFor blurring the fundamental frequency of speech, fGIs converted standard speech fundamental frequency.
Step 5, inputting the spectral envelope characteristic quantity of the fuzzy speech signal input by the user into the trained reconstruction model to obtain the spectral envelope characteristic quantity of the reconstructed standard speech output by the generator of the reconstruction model, and inputting the fundamental frequency of the fuzzy speech into a converter to convert the fundamental frequency of the reconstructed standard speech;
step 6, synthesizing and reconstructing the standard voice according to the spectral envelope characteristic quantity and the fundamental frequency of the reconstructed standard voice; specifically, the spectral envelope characteristic quantity and fundamental frequency of the reconstructed standard speech can be substituted into the existing speech synthesizer, such as the WORLD speech synthesizer, to obtain the synthesized reconstructed standard speech
And 7, recognizing semantic information by using the reconstructed standard voice.
Furthermore, as shown in fig. 2, the present invention provides a fuzzy speech semantic recognition system for artificial intelligence learning, comprising:
and the fuzzy voice signal characteristic quantity extraction module is used for collecting the fuzzy voice signal input by the user and extracting the high-dimensional characteristic quantity of the fuzzy voice signal.
The fuzzy speech semantic recognition system for artificial intelligence learning of the invention can be applied to speech control functions of service facilities in intelligent communities, intelligent buildings and intelligent families, a user speaks a speech command to the service facilities, the service facilities collect speech signals by using components such as a microphone and the like, necessary front-end enhancement processing such as filtering, noise suppression, time spectrum estimation and the like is carried out, and windowing and framing processing of the speech signals are carried out, so that the system is not described in detail in the prior art. If the processed speech signal belongs to clear speech, the semantic information is directly identified and converted, and the speech signal is not an improvement point of the invention and is not specifically described here.
The fuzzy speech signal characteristic quantity extraction module extracts high-dimensional characteristic quantity of a fuzzy speech signal, wherein the high-dimensional characteristic quantity is specifically the spectrum envelope characteristic of each fuzzy speech signal frame, the spectrum envelope characteristic extraction process is to perform short-time FTT conversion on each fuzzy speech signal frame to obtain the spectrum of each fuzzy speech signal frame, obtain the Mei spectrum of the spectrum of each fuzzy speech signal frame through an Mei filter, then perform logarithm taking and DCT (discrete cosine transform) on the basis of the Mei spectrum to obtain MFCC (Mel-based coefficient), intercept 12-16 MFCC coefficients and use the MFCC coefficients as the spectrum envelope characteristic quantity X of the fuzzy speech signal framet。
And the sample selection matching module is used for determining a sample selection matched with the characteristics of the fuzzy speech signal according to the spectral envelope characteristic quantity of the fuzzy speech signal.
The method establishes a plurality of sample collections, wherein each sample collection can contain about 1000 sections of voice samples, each voice sample comprises a fuzzy voice sample and a standard voice sample, and the similarity of the characteristic quantity of the fuzzy voice sample is within a preset similarity range. The sample collection may be stored in a sample library of the sample collection matching module.
And for the extracted spectral envelope characteristic quantity of the fuzzy speech signal, the sample selection matching module matches the selection representative characteristic quantity of each sample selection, so that the sample selection matched with the spectral envelope characteristic quantity of the fuzzy speech signal is selected.
For a sample collection, assuming that n sections of voice samples exist, a sample collection matching module determines the spectral envelope characteristic quantity of a fuzzy voice sample corresponding to each voice sample to be X1s,X2s...XnsEach frequency spectrum envelope characteristic quantity is d-dimension characteristic vector, and a characteristic quantity matrix X of the sample collection is formedS={X1s,X2s...Xns}; for the r-th dimension in the d-dimension, its entire feature quantity matrix X is calculatedSIs expressed asAnd selecting a feature quantity matrix XSIn nkA submatrix composed of characteristic quantities, denoted as a submatrix kThereby the characteristic quantity matrix XSIn each nkForming a sub-matrix by the feature vectors, wherein the total number of the sub-matrices is c, namely k is 1,2.. c; the mean of the r-dimension of the sub-matrix k in the d-dimension is expressed asThen calculate the inter-class distance of the c sub-matrices:
and calculating the intra-class distance of each sub-matrix of the c sub-matrices:
wherein xk s,rIs Xk SThe value of each feature vector in r dimension.
Calculating the intra-class inter-class proportion of each submatrix of the c submatrixes:
σ=Db/Dw
and further determining the submatrix with the highest intra-class inter-class proportion value as the collection representative characteristic quantity of the sample collection.
The sample selection matching module matches the spectrum envelope characteristic quantity of the fuzzy speech signal with the selection representative characteristic quantity of each sample selection, namely calculates the spectrum envelope characteristic quantity of the fuzzy speech signal and the characteristic quantity in the submatrix as the selection representative characteristic quantity to calculate the average vector distance, selects the sample selection with the minimum average vector distance, and accordingly selects the sample selection matched with the spectrum envelope characteristic quantity of the fuzzy speech signal.
And the GAN reconstruction model building and training module is used for building a reconstruction model of a GAN framework for reconstructing the fuzzy speech into the standard speech, and training the reconstruction model by utilizing the sample collection.
The reconstruction model of the GAN architecture comprises the following steps: a generator and a discriminator; the generator reconstructs the spectral envelope characteristic quantity of the standard voice according to the spectral envelope characteristic quantity of the fuzzy voice input into the generator; the discriminator is used for judging the authenticity of the spectral envelope characteristic quantity reconstructed by the generator.
The generator adopts a two-dimensional convolution neural network and consists of an encoding network and a decoding network. The coding network comprises 5 convolutional layers, the decoding network comprises 5 deconvolution layers, ResNet is established between the coding network and the decoding network, and standardization is carried out after each convolutional layer. The discriminator uses a two-dimensional convolutional neural network, comprising 5 convolutional layers, standardized after each convolutional layer.
In the training process, for the fuzzy voice samples in the sample selection set, the spectral envelope characteristic quantity of the fuzzy voice samplesXtThe input generator and the training generator minimize the loss function of the generator, and the generator outputs the spectral envelope characteristic quantity of the reconstructed standard voice.
Loss function I of generator GG(G) Expressed as:
whereinRepresenting the penalty, L, of the generator Gc(G) Representing the loss of cyclic agreement of the generator G,regularization parameter, L, representing cyclic consistency lossid(G) Representing the loss of the feature map of the generator G,a regularization parameter that represents a feature mapping penalty.
In the training process, the spectrum envelope characteristic quantity of the reconstructed standard voice and the spectrum envelope characteristic quantity of the standard voice sample in the sample selection are input into a discriminator, and the discriminator is trained to minimize the loss function of the discriminator.
The loss function of discriminator D is expressed as:
wherein D (x)S) A discrimination value representing a spectral envelope characteristic quantity of a standard speech sample in the input sample collection by discriminator D,representing expectations of probability distribution for standard speech samples
D(G(xt) Representation discriminator D to generator G based on the fuzzy speech feature xtGenerated standard speechThe discrimination value of the spectral envelope characteristic quantity of the sample,representing feature x of fuzzy speechtExpectation of probability distribution
Through the training, the loss functions of the generator and the discriminator are minimized, and the trained GAN framework reconstruction model for reconstructing the fuzzy speech into the standard speech is obtained through the preset iteration times.
The converter construction module is used for constructing a converter for converting the fuzzy voice fundamental frequency into the standard voice fundamental frequency; the fundamental transfer function is:
wherein muGAnd σGMean and variance, mu, in the log domain for the standard speech generated by the generatortAnd σtMean and variance in the log domain for blurred speech, ftFor blurring the fundamental frequency of speech, fGIs converted standard speech fundamental frequency.
The reconstruction conversion module is used for inputting the spectral envelope characteristic quantity of the fuzzy voice signal input by the user into the trained reconstruction model to obtain the spectral envelope characteristic quantity of the reconstructed standard voice output by the generator of the reconstruction model, inputting the spectral envelope characteristic quantity of the reconstructed standard voice into the converter and converting the fundamental frequency of the reconstructed standard voice;
and the standard voice synthesis module synthesizes and reconstructs the standard voice according to the spectral envelope characteristic quantity and the fundamental frequency of the reconstructed standard voice.
And the semantic information recognition module is used for recognizing the semantic information by utilizing the reconstructed standard voice.
Therefore, the invention provides a fuzzy speech semantic recognition method and system for artificial intelligence learning. The invention aims at the fuzzy speech existing in the user dictation speech instruction, reconstructs the fuzzy speech into clear standard speech by utilizing a GAN network architecture, and further realizes the conversion and identification of semantic information based on the standard speech. According to the invention, the input fuzzy speech is corresponding to the sample collection with a larger range through the speech feature matching, and the training of the GAN network is realized by the sample collection, so that the GAN network training is sufficient and is fully adapted to the feature distribution of the current fuzzy speech, the accuracy and reliability of reconstructing the standard speech are further improved, the accuracy rate from speech to semantic information recognition is obviously improved, and the correct conversion rate can reach more than 95.6% through experimental verification.
The above description is only a preferred embodiment of the application and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention herein disclosed is not limited to the particular combination of features described above, but also encompasses other arrangements formed by any combination of the above features or their equivalents without departing from the spirit of the invention. For example, the above features may be replaced with (but not limited to) features having similar functions disclosed in the present application.
Claims (10)
1. A fuzzy speech semantic recognition method for artificial intelligence learning comprises the following steps:
step 1, acquiring a fuzzy voice signal input by a user, and extracting high-dimensional characteristic quantity of the fuzzy voice signal;
step 2, determining a sample collection matched with the characteristics of the fuzzy speech signal according to the spectral envelope characteristic quantity of the fuzzy speech signal;
step 3, constructing a reconstruction model of a GAN framework for reconstructing the fuzzy speech into the standard speech, and training the reconstruction model by utilizing the sample collection;
step 4, constructing a converter for converting the fuzzy voice fundamental frequency into the standard voice fundamental frequency;
step 5, inputting the spectral envelope characteristic quantity of the fuzzy speech signal input by the user into the trained reconstruction model to obtain the spectral envelope characteristic quantity of the reconstructed standard speech output by the generator of the reconstruction model, and inputting the fundamental frequency of the fuzzy speech into a converter to convert the fundamental frequency of the reconstructed standard speech;
step 6, synthesizing and reconstructing the standard voice according to the spectral envelope characteristic quantity and the fundamental frequency of the reconstructed standard voice;
and 7, recognizing semantic information by using the reconstructed standard voice.
2. The fuzzy speech semantic recognition method according to claim 1, wherein a plurality of sample collections are established in step 2, each speech sample comprises a fuzzy speech sample and a standard speech sample, and the similarity of the feature quantity of the fuzzy speech sample is within a preset similarity range; and matching the spectral envelope characteristic quantity of the fuzzy speech signal extracted in the step 1 with the collection representative characteristic quantity of each sample collection, thereby selecting the sample collection matched with the spectral envelope characteristic quantity of the fuzzy speech signal.
3. The fuzzy speech semantic recognition method according to claim 2, wherein in step 2, the sample collection has n speech samples, and the spectral envelope characteristic quantity of the fuzzy speech sample corresponding to each speech sample is X1s,X2s…XnsEach frequency spectrum envelope characteristic quantity is d-dimension characteristic vector, and a characteristic quantity matrix X of the sample collection is formedS={X1s,X2s…Xns}; for the r-th dimension in the d-dimension, its entire feature quantity matrix X is calculatedSIs expressed asAnd selecting a feature quantity matrix XSIn nkA submatrix composed of characteristic quantities, denoted as a submatrix kThereby the characteristic quantity matrix XSIn each nkForming a sub-matrix by the feature vectors, wherein the total number of the sub-matrices is c, namely k is 1,2.. c; the mean of the r-dimension of the sub-matrix k in the d-dimension is expressed asThen calculate the inter-class distance of the c sub-matrices:
and calculating the intra-class distance of each sub-matrix of the c sub-matrices:
wherein xk s,rIs Xk SThe value of each feature vector in r dimension;
calculating the intra-class inter-class proportion of each submatrix of the c submatrixes:
σ=Db/Dw
and further determining the submatrix with the highest intra-class inter-class proportion value as the collection representative characteristic quantity of the sample collection.
4. The fuzzy speech semantic recognition method of claim 1, wherein the step 3 of reconstructing the GAN architecture comprises: a generator G and a discriminator D; the generator reconstructs the spectral envelope characteristic quantity of the standard voice according to the spectral envelope characteristic quantity of the fuzzy voice input into the generator; the discriminator is used for judging the authenticity of the spectral envelope characteristic quantity reconstructed by the generator.
5. The fuzzy speech semantic recognition method of claim 4, wherein the loss function I of the generator G in step 3G(G) Expressed as:
whereinRepresenting the penalty, L, of the generator Gc(G) Representing the loss of cyclic agreement of the generator G,regularization parameter, L, representing cyclic consistency lossid(G) Representing the loss of the feature map of the generator G,a regularization parameter that represents a feature mapping penalty.
6. The fuzzy speech semantic recognition method of claim 4, wherein the loss function of the discriminator D in step 3 is expressed as:
wherein D (x)S) A discrimination value representing a spectral envelope characteristic quantity of a standard speech sample in the input sample collection by discriminator D,representing an expectation of a probability distribution for a standard speech sample;
D(G(xt) Representation discriminator D to generator G based on the fuzzy speech feature xtThe discrimination value of the spectral envelope characteristic quantity of the generated standard voice sample,representing feature x of fuzzy speechtExpectation of probability distribution.
7. The fuzzy speech semantic recognition method according to claim 1, wherein the fundamental frequency conversion function constructed in step 4 is:
wherein muGAnd σGMean and variance, mu, in the log domain for the standard speech generated by the generatortAnd σtMean and variance in the log domain for blurred speech, ftFor blurring the fundamental frequency of speech, fGIs converted standard speech fundamental frequency.
8. An artificial intelligence learning fuzzy speech semantic recognition system, comprising:
the fuzzy voice signal characteristic quantity extraction module is used for collecting a fuzzy voice signal input by a user and extracting high-dimensional characteristic quantity of the fuzzy voice signal;
the sample selection matching module is used for determining a sample selection matched with the characteristics of the fuzzy speech signal according to the spectral envelope characteristic quantity of the fuzzy speech signal;
the GAN reconstruction model building and training module is used for building a reconstruction model of a GAN framework for reconstructing the fuzzy speech into the standard speech, and training the reconstruction model by utilizing the sample collection;
the converter construction module is used for constructing a converter for converting the fuzzy voice fundamental frequency into the standard voice fundamental frequency;
the reconstruction conversion module is used for inputting the spectral envelope characteristic quantity of the fuzzy voice signal input by the user into the trained reconstruction model to obtain the spectral envelope characteristic quantity of the reconstructed standard voice output by the generator of the reconstruction model, inputting the spectral envelope characteristic quantity of the reconstructed standard voice into the converter and converting the fundamental frequency of the reconstructed standard voice;
and the standard voice synthesis module synthesizes and reconstructs the standard voice according to the spectral envelope characteristic quantity and the fundamental frequency of the reconstructed standard voice.
And the semantic information recognition module is used for recognizing the semantic information by utilizing the reconstructed standard voice.
9. The system according to claim 8, wherein the sample collection matching module has a plurality of sample collections, each of the speech samples includes a fuzzy speech sample and a standard speech sample, and the similarity of the feature quantity of the fuzzy speech sample is within a preset similarity range; and matching the spectral envelope characteristic quantity of the fuzzy speech signal with the collection representative characteristic quantity of each sample collection based on the spectral envelope characteristic quantity of the fuzzy speech signal, so as to select the sample collection matched with the spectral envelope characteristic quantity of the fuzzy speech signal.
10. The system according to claim 9, wherein the sample collection in the sample collection matching module has n speech samples, and the spectral envelope characteristic quantity of the fuzzy speech sample corresponding to each speech sample is X1s,X2s…XnsEach frequency spectrum envelope characteristic quantity is d-dimension characteristic vector, and a characteristic quantity matrix X of the sample collection is formeds={X1s,X2s…Xns}; for the r-th dimension in the d-dimension, its entire feature quantity matrix X is calculatedSIs expressed asAnd selecting a feature quantity matrix XSIn nkA submatrix composed of characteristic quantities, denoted as a submatrix kThereby the characteristic quantity matrix XSIn each nkForming a sub-matrix by the feature vectors, wherein the total number of the sub-matrices is c, namely k is 1,2.. c; the mean of the r-dimension of the sub-matrix k in the d-dimension is expressed asThen calculate the inter-class distance of the c sub-matrices:
and calculating the intra-class distance of each sub-matrix of the c sub-matrices:
wherein xk s,rIs Xk sThe value of each feature vector in r dimension;
calculating the intra-class inter-class proportion of each submatrix of the c submatrixes:
σ=Db/Dw
and further determining the submatrix with the highest intra-class inter-class proportion value as the collection representative characteristic quantity of the sample collection.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910713034.8A CN110600012B (en) | 2019-08-02 | 2019-08-02 | Fuzzy speech semantic recognition method and system for artificial intelligence learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910713034.8A CN110600012B (en) | 2019-08-02 | 2019-08-02 | Fuzzy speech semantic recognition method and system for artificial intelligence learning |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110600012A true CN110600012A (en) | 2019-12-20 |
CN110600012B CN110600012B (en) | 2020-12-04 |
Family
ID=68853447
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910713034.8A Active CN110600012B (en) | 2019-08-02 | 2019-08-02 | Fuzzy speech semantic recognition method and system for artificial intelligence learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110600012B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113053360A (en) * | 2021-03-09 | 2021-06-29 | 南京师范大学 | High-precision software recognition method based on voice |
Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20090271002A1 (en) * | 2008-04-29 | 2009-10-29 | David Asofsky | System and Method for Remotely Controlling Electronic Devices |
CN102982809A (en) * | 2012-12-11 | 2013-03-20 | 中国科学技术大学 | Conversion method for sound of speaker |
US20130191129A1 (en) * | 2012-01-19 | 2013-07-25 | International Business Machines Corporation | Information Processing Device, Large Vocabulary Continuous Speech Recognition Method, and Program |
CN106448684A (en) * | 2016-11-16 | 2017-02-22 | 北京大学深圳研究生院 | Deep-belief-network-characteristic-vector-based channel-robust voiceprint recognition system |
CN107945805A (en) * | 2017-12-19 | 2018-04-20 | 程海波 | A kind of intelligent across language voice identification method for transformation |
CN108766409A (en) * | 2018-05-25 | 2018-11-06 | 中国传媒大学 | A kind of opera synthetic method, device and computer readable storage medium |
US20190013012A1 (en) * | 2017-07-04 | 2019-01-10 | Minds Lab., Inc. | System and method for learning sentences |
CN109326283A (en) * | 2018-11-23 | 2019-02-12 | 南京邮电大学 | Multi-to-multi phonetics transfer method under non-parallel text condition based on text decoder |
CN109599091A (en) * | 2019-01-14 | 2019-04-09 | 南京邮电大学 | Multi-to-multi voice conversion method based on STARWGAN-GP and x vector |
CN110060691A (en) * | 2019-04-16 | 2019-07-26 | 南京邮电大学 | Multi-to-multi phonetics transfer method based on i vector sum VARSGAN |
-
2019
- 2019-08-02 CN CN201910713034.8A patent/CN110600012B/en active Active
Patent Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20090271002A1 (en) * | 2008-04-29 | 2009-10-29 | David Asofsky | System and Method for Remotely Controlling Electronic Devices |
US20130191129A1 (en) * | 2012-01-19 | 2013-07-25 | International Business Machines Corporation | Information Processing Device, Large Vocabulary Continuous Speech Recognition Method, and Program |
CN102982809A (en) * | 2012-12-11 | 2013-03-20 | 中国科学技术大学 | Conversion method for sound of speaker |
CN106448684A (en) * | 2016-11-16 | 2017-02-22 | 北京大学深圳研究生院 | Deep-belief-network-characteristic-vector-based channel-robust voiceprint recognition system |
US20190013012A1 (en) * | 2017-07-04 | 2019-01-10 | Minds Lab., Inc. | System and method for learning sentences |
CN107945805A (en) * | 2017-12-19 | 2018-04-20 | 程海波 | A kind of intelligent across language voice identification method for transformation |
CN108766409A (en) * | 2018-05-25 | 2018-11-06 | 中国传媒大学 | A kind of opera synthetic method, device and computer readable storage medium |
CN109326283A (en) * | 2018-11-23 | 2019-02-12 | 南京邮电大学 | Multi-to-multi phonetics transfer method under non-parallel text condition based on text decoder |
CN109599091A (en) * | 2019-01-14 | 2019-04-09 | 南京邮电大学 | Multi-to-multi voice conversion method based on STARWGAN-GP and x vector |
CN110060691A (en) * | 2019-04-16 | 2019-07-26 | 南京邮电大学 | Multi-to-multi phonetics transfer method based on i vector sum VARSGAN |
Non-Patent Citations (2)
Title |
---|
范自柱: "《新型特征抽取算法研究》", 31 December 2016 * |
韩志艳: "《面向语音与面部表情信号的多模式情感识别技术研究》", 31 January 2017, 东北大学出版社 * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113053360A (en) * | 2021-03-09 | 2021-06-29 | 南京师范大学 | High-precision software recognition method based on voice |
Also Published As
Publication number | Publication date |
---|---|
CN110600012B (en) | 2020-12-04 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
RU2373584C2 (en) | Method and device for increasing speech intelligibility using several sensors | |
CN110120227A (en) | A kind of depth stacks the speech separating method of residual error network | |
CN102509547A (en) | Method and system for voiceprint recognition based on vector quantization based | |
CN112349297A (en) | Depression detection method based on microphone array | |
CN113327626A (en) | Voice noise reduction method, device, equipment and storage medium | |
CN112735435A (en) | Voiceprint open set identification method with unknown class internal division capability | |
CN111461173A (en) | Attention mechanism-based multi-speaker clustering system and method | |
CN107358947A (en) | Speaker recognition methods and system again | |
CN113628612A (en) | Voice recognition method and device, electronic equipment and computer readable storage medium | |
CN106971724A (en) | A kind of anti-tampering method for recognizing sound-groove and system | |
JP2015022112A (en) | Voice activity detection device and method | |
CN111489763B (en) | GMM model-based speaker recognition self-adaption method in complex environment | |
CN111554279A (en) | Multi-mode man-machine interaction system based on Kinect | |
CN110211609A (en) | A method of promoting speech recognition accuracy | |
CN110600012B (en) | Fuzzy speech semantic recognition method and system for artificial intelligence learning | |
CN110347426B (en) | Intelligent release APP platform system and method thereof | |
CN105845131A (en) | Far-talking voice recognition method and device | |
CN112002307B (en) | Voice recognition method and device | |
KR101863098B1 (en) | Apparatus and method for speech recognition | |
CN107945807B (en) | Voice recognition method and system based on silence run | |
Bora et al. | Speaker identification for biometric access control using hybrid features | |
CN110767238A (en) | Blacklist identification method, apparatus, device and storage medium based on address information | |
CN110060692A (en) | A kind of Voiceprint Recognition System and its recognition methods | |
CN114121004B (en) | Voice recognition method, system, medium and equipment based on deep learning | |
CN117762372A (en) | Multi-mode man-machine interaction system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
TA01 | Transfer of patent application right |
Effective date of registration: 20200914 Address after: 200232 floor 18, building 2, No. 277, Longlan Road, Xuhui District, Shanghai Applicant after: LIGHT CONTROLS TESILIAN (SHANGHAI) INFORMATION TECHNOLOGY Co.,Ltd. Address before: 100027 West Tower 11 floor, Kai Hao building, 8 Xinyuan South Road, Chaoyang District, Beijing. Applicant before: Terminus(Beijing) Technology Co.,Ltd. Applicant before: LIGHT CONTROLS TESILIAN (SHANGHAI) INFORMATION TECHNOLOGY Co.,Ltd. |
|
TA01 | Transfer of patent application right | ||
GR01 | Patent grant | ||
GR01 | Patent grant |