CN110600012A - Fuzzy speech semantic recognition method and system for artificial intelligence learning - Google Patents

Fuzzy speech semantic recognition method and system for artificial intelligence learning Download PDF

Info

Publication number
CN110600012A
CN110600012A CN201910713034.8A CN201910713034A CN110600012A CN 110600012 A CN110600012 A CN 110600012A CN 201910713034 A CN201910713034 A CN 201910713034A CN 110600012 A CN110600012 A CN 110600012A
Authority
CN
China
Prior art keywords
speech
fuzzy
characteristic quantity
sample
spectral envelope
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910713034.8A
Other languages
Chinese (zh)
Other versions
CN110600012B (en
Inventor
孙斌
李东晓
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
LIGHT CONTROLS TESILIAN (SHANGHAI) INFORMATION TECHNOLOGY Co.,Ltd.
Original Assignee
Optical Control Teslian (shanghai) Information Technology Co Ltd
Terminus Beijing Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Optical Control Teslian (shanghai) Information Technology Co Ltd, Terminus Beijing Technology Co Ltd filed Critical Optical Control Teslian (shanghai) Information Technology Co Ltd
Priority to CN201910713034.8A priority Critical patent/CN110600012B/en
Publication of CN110600012A publication Critical patent/CN110600012A/en
Application granted granted Critical
Publication of CN110600012B publication Critical patent/CN110600012B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training

Abstract

The invention provides a fuzzy speech semantic recognition method and a system for artificial intelligence learning. The invention aims at the fuzzy speech existing in the user dictation speech instruction, reconstructs the fuzzy speech into clear standard speech by utilizing a GAN network architecture, and further realizes the conversion and identification of semantic information based on the standard speech. And in the training process of the GAN network, the input fuzzy speech is corresponding to a sample collection with a larger range by utilizing speech feature matching, and the training of the GAN network is realized by the sample collection.

Description

Fuzzy speech semantic recognition method and system for artificial intelligence learning
Technical Field
The application relates to the field of artificial intelligence control, in particular to a fuzzy speech semantic recognition method and system for artificial intelligence learning.
Background
With the maturity of voice recognition and semantic conversion technologies, people control service facilities by using voice commands more and more frequently, and the applications of the services in smart buildings, smart communities and smart families are more and more.
For example, people can dictate voice commands to control the operation of various services in a smart building, smart community, smart home. For example, people can send a voice command of "please rise to the 15 th floor" for an elevator of a smart building, can also send voice commands of "please call XXX room", "please open door", the door opening password is xxxxxxxx "," please lock door ", and the like beside an access control system of a smart community, or send voice commands of" please open air conditioner "," please turn on ceiling light ", and the like, facing the vicinity of a central control panel of a smart home. The service facility collects the voice command signal, after necessary enhancement processing, the voice command is converted and recognized into semantic information, then a control instruction in a machine code form is generated by the semantic information in a natural language character form, and the service facility can execute necessary work according to the control instruction. Compare in carrying out manual control to button, button type control panel and coming, the voice command mode can bring more convenient experience and bigger degree of freedom for the user, especially is the disabled person who does not have both hands or blindness at the user, or under the condition such as control panel can't be touched because of factors such as environmental barrier, distance are far away, can strengthen the convenience and the accessible of wisdom building, wisdom community, wisdom family.
However, there is a high probability of false conversion in the current recognition of the conversion of a voice command into semantic information, i.e., the recognition of a natural language character from an acoustic signal. Among them, clear speech can be recognized well, but it is especially difficult to realize correct semantic information conversion for fuzzy speech. In the process of transmitting the sound signal to the service facility for collection, fuzzy speech is generated due to attenuation of the sound itself and interference of ambient noise including unclear pronunciation of the user itself, accent and other factors, and the speech command cannot be correctly recognized as semantic information directly, so that the service facility cannot be controlled to operate.
In the prior art, semantic recognition for fuzzy speech mainly adopts preprocessing such as enhancement of sound signals and adopts a confidence evaluation mode, so that the problem of realizing accurate semantic recognition through the fuzzy speech cannot be effectively solved.
With the development of artificial intelligence, recognition models such as an SVM (support vector machine) vector machine and a neural network are applied to a semantic recognition technology of voice in the prior art, specifically, feature quantities are extracted by using voice samples to train the recognition models, and then the feature quantities of the voice to be recognized are input into the recognition models to obtain semantic information. However, if the above recognition model is directly used for semantic recognition of the fuzzy speech, the change forms of the fuzzy speech are very rich, so that the characteristic quantities of the fuzzy speech are very rich and diversified, and the fuzzy speech samples are often lack of representativeness, so that the problems of insufficient training of the artificial intelligent recognition model and poor applicability of the trained recognition model to other fuzzy speech are caused
Disclosure of Invention
Aiming at the problems in the prior art, the invention provides a fuzzy speech semantic recognition method and system for artificial intelligence learning. The invention aims at the fuzzy speech existing in the user dictation speech instruction, reconstructs the fuzzy speech into clear standard speech by utilizing a GAN network architecture, and further realizes the conversion and identification of semantic information based on the standard speech. And in the training process of the GAN network, the input fuzzy speech is corresponding to a sample collection with a larger range by utilizing speech feature matching, and the training of the GAN network is realized by the sample collection.
The invention provides a fuzzy speech semantic recognition method for artificial intelligence learning, which comprises the following steps:
step 1, acquiring a fuzzy voice signal input by a user, and extracting high-dimensional characteristic quantity of the fuzzy voice signal;
step 2, determining a sample collection matched with the characteristics of the fuzzy speech signal according to the spectral envelope characteristic quantity of the fuzzy speech signal;
step 3, constructing a reconstruction model of a GAN framework for reconstructing the fuzzy speech into the standard speech, and training the reconstruction model by utilizing the sample collection;
step 4, constructing a converter for converting the fuzzy voice fundamental frequency into the standard voice fundamental frequency;
step 5, inputting the spectral envelope characteristic quantity of the fuzzy speech signal input by the user into the trained reconstruction model to obtain the spectral envelope characteristic quantity of the reconstructed standard speech output by the generator of the reconstruction model, and inputting the fundamental frequency of the fuzzy speech into a converter to convert the fundamental frequency of the reconstructed standard speech;
step 6, synthesizing and reconstructing the standard voice according to the spectral envelope characteristic quantity and the fundamental frequency of the reconstructed standard voice;
and 7, recognizing semantic information by using the reconstructed standard voice.
Preferably, a plurality of sample collections are established in the step 2, each voice sample comprises a fuzzy voice sample and a standard voice sample, and the similarity of the characteristic quantity of the fuzzy voice sample is within a preset similarity range; and matching the spectral envelope characteristic quantity of the fuzzy speech signal extracted in the step 1 with the collection representative characteristic quantity of each sample collection, thereby selecting the sample collection matched with the spectral envelope characteristic quantity of the fuzzy speech signal.
It is further preferable that, in step 2, the sample collection has n speech samples, and the spectral envelope characteristic quantity of the blurred speech sample corresponding to each speech sample is X1s,X2s...XnsEach frequency spectrum envelope characteristic quantity is d-dimension characteristic vector, and a characteristic quantity matrix X of the sample collection is formedS={X1s,X2s...Xns}; for the r-th dimension in the d-dimension, its entire feature quantity matrix X is calculatedSIs expressed asAnd selecting a feature quantity matrix XSIn nkA submatrix composed of characteristic quantities, denoted as a submatrix kThereby the characteristic quantity matrix XSIn each nkForming a sub-matrix by the feature vectors, wherein the total number of the sub-matrices is c, namely k is 1,2.. c; the mean of the r-dimension of the sub-matrix k in the d-dimension is expressed asThen calculate the inter-class distance of the c sub-matrices:
and calculating the intra-class distance of each sub-matrix of the c sub-matrices:
wherein xk s,rIs Xk SThe value of each feature vector in r dimension;
calculating the intra-class inter-class proportion of each submatrix of the c submatrixes:
σ=Db/Dw
and further determining the submatrix with the highest intra-class inter-class proportion value as the collection representative characteristic quantity of the sample collection.
Preferably, the step 3 of reconstructing the GAN architecture includes: a generator G and a discriminator D; the generator reconstructs the spectral envelope characteristic quantity of the standard voice according to the spectral envelope characteristic quantity of the fuzzy voice input into the generator; the discriminator is used for judging the authenticity of the spectral envelope characteristic quantity reconstructed by the generator.
Preferably, the loss function I of the generator G in step 3G(G) Expressed as:
whereinRepresenting the penalty, L, of the generator Gc(G) Representing the loss of cyclic agreement of the generator G,regularization parameter, L, representing cyclic consistency lossid(G) Representing the loss of the feature map of the generator G,a regularization parameter that represents a feature mapping penalty.
Preferably, the loss function of discriminator D in step 3 is expressed as:
wherein D (x)S) A discrimination value representing a spectral envelope characteristic quantity of a standard speech sample in the input sample collection by discriminator D,representing an expectation of a probability distribution for a standard speech sample;
D(G(xt) Representation discriminator D to generator G based on the fuzzy speech feature xtThe discrimination value of the spectral envelope characteristic quantity of the generated standard voice sample,representing feature x of fuzzy speechtExpectation of probability distribution.
Preferably, the fundamental frequency conversion function constructed in step 4 is:
wherein muGAnd σGMean and variance, mu, in the log domain for the standard speech generated by the generatortAnd σtMean and variance in the log domain for blurred speech, ftFor blurring the fundamental frequency of speech, fGIs converted standard speech fundamental frequency.
Furthermore, the invention provides a fuzzy speech semantic recognition system for artificial intelligence learning, which comprises:
the fuzzy voice signal characteristic quantity extraction module is used for collecting a fuzzy voice signal input by a user and extracting high-dimensional characteristic quantity of the fuzzy voice signal;
the sample selection matching module is used for determining a sample selection matched with the characteristics of the fuzzy speech signal according to the spectral envelope characteristic quantity of the fuzzy speech signal;
the GAN reconstruction model building and training module is used for building a reconstruction model of a GAN framework for reconstructing the fuzzy speech into the standard speech, and training the reconstruction model by utilizing the sample collection;
the converter construction module is used for constructing a converter for converting the fuzzy voice fundamental frequency into the standard voice fundamental frequency;
the reconstruction conversion module is used for inputting the spectral envelope characteristic quantity of the fuzzy voice signal input by the user into the trained reconstruction model to obtain the spectral envelope characteristic quantity of the reconstructed standard voice output by the generator of the reconstruction model, inputting the spectral envelope characteristic quantity of the reconstructed standard voice into the converter and converting the fundamental frequency of the reconstructed standard voice;
and the standard voice synthesis module synthesizes and reconstructs the standard voice according to the spectral envelope characteristic quantity and the fundamental frequency of the reconstructed standard voice.
And the semantic information recognition module is used for recognizing the semantic information by utilizing the reconstructed standard voice.
Preferably, the sample collection matching module has a plurality of sample collections, each voice sample includes a fuzzy voice sample and a standard voice sample, and the similarity of the characteristic quantity of the fuzzy voice sample is within a preset similarity range; and matching the spectral envelope characteristic quantity of the fuzzy speech signal with the collection representative characteristic quantity of each sample collection based on the spectral envelope characteristic quantity of the fuzzy speech signal, so as to select the sample collection matched with the spectral envelope characteristic quantity of the fuzzy speech signal.
It is further preferred that the sample collection in the sample collection matching module has n speech samples, and each speech sample corresponds to a spectral envelope of a blurred speech sampleCharacteristic quantity of X1s,X2s...XnsEach frequency spectrum envelope characteristic quantity is d-dimension characteristic vector, and a characteristic quantity matrix X of the sample collection is formedS={X1s,X2s...Xns}; for the r-th dimension in the d-dimension, its entire feature quantity matrix X is calculatedSIs expressed asAnd selecting a feature quantity matrix XSIn nkA submatrix composed of characteristic quantities, denoted as a submatrix kThereby the characteristic quantity matrix XSIn each nkForming a sub-matrix by the feature vectors, wherein the total number of the sub-matrices is c, namely k is 1,2.. c; the mean of the r-dimension of the sub-matrix k in the d-dimension is expressed asThen calculate the inter-class distance of the c sub-matrices:
and calculating the intra-class distance of each sub-matrix of the c sub-matrices:
wherein xk s,rIs Xk SThe value of each feature vector in r dimension;
calculating the intra-class inter-class proportion of each submatrix of the c submatrixes:
σ=Db/Dw
and further determining the submatrix with the highest intra-class inter-class proportion value as the collection representative characteristic quantity of the sample collection.
Preferably, the GAN architecture reconstruction model constructed by the GAN reconstruction model construction and training module includes: a generator G and a discriminator D; the generator reconstructs the spectral envelope characteristic quantity of the standard voice according to the spectral envelope characteristic quantity of the fuzzy voice input into the generator; the discriminator is used for judging the authenticity of the spectral envelope characteristic quantity reconstructed by the generator.
Preferably, the loss function I of the generator GG(G) Expressed as:
whereinRepresenting the penalty, L, of the generator Gc(G) Representing the loss of cyclic agreement of the generator G,regularization parameter, L, representing cyclic consistency lossid(G) Representing the loss of the feature map of the generator G,a regularization parameter that represents a feature mapping penalty.
Preferably, the loss function of discriminator D is expressed as:
wherein D (x)S) A discrimination value representing a spectral envelope characteristic quantity of a standard speech sample in the input sample collection by discriminator D,representing an expectation of a probability distribution for a standard speech sample;
D(G(xt) Representation discriminator D to generator G based on the fuzzy speech feature xtThe discrimination value of the spectral envelope characteristic quantity of the generated standard voice sample,representing feature x of fuzzy speechtExpectation of probability distribution.
Preferably, the fundamental frequency conversion function constructed by the converter construction module is:
wherein muGAnd σGMean and variance, mu, in the log domain for the standard speech generated by the generatortAnd σtMean and variance in the log domain for blurred speech, ftFor blurring the fundamental frequency of speech, fGIs converted standard speech fundamental frequency.
Therefore, the invention provides a fuzzy speech semantic recognition method and system for artificial intelligence learning. The invention aims at the fuzzy speech existing in the user dictation speech instruction, reconstructs the fuzzy speech into clear standard speech by utilizing a GAN network architecture, and further realizes the conversion and identification of semantic information based on the standard speech. According to the invention, the input fuzzy speech is corresponding to the sample collection with a larger range through the speech feature matching, and the training of the GAN network is realized by the sample collection, so that the GAN network training is sufficient and is fully adapted to the feature distribution of the current fuzzy speech, the accuracy and reliability of reconstructing the standard speech are further improved, the accuracy rate from speech to semantic information recognition is obviously improved, and the correct conversion rate can reach more than 95.6% through experimental verification.
Drawings
Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:
FIG. 1 is a flowchart of a fuzzy speech semantic recognition method for artificial intelligence learning according to an embodiment of the present application;
fig. 2 is a structural diagram of a fuzzy speech semantic recognition system for artificial intelligence learning according to an embodiment of the present application.
Detailed Description
The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.
It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.
As shown in FIG. 1, the invention provides a fuzzy speech semantic recognition method for artificial intelligence learning, which comprises the following steps:
step 1, acquiring a fuzzy voice signal input by a user, and extracting high-dimensional characteristic quantity of the fuzzy voice signal.
The fuzzy speech semantic recognition method for artificial intelligence learning of the invention can be applied to speech control functions of service facilities in intelligent communities, intelligent buildings and intelligent families, a user speaks a speech command to the service facilities, the service facilities collect speech signals by using components such as a microphone and the like, necessary front-end enhancement processing such as filtering, noise suppression, time spectrum estimation and the like is carried out, and windowing and framing processing of the speech signals are carried out, so that the method belongs to the prior art and is not described in detail. If the processed speech signal belongs to clear speech, the semantic information is directly identified and converted, and the speech signal is not an improvement point of the invention and is not specifically described here. The invention focuses on the recognition processing of the fuzzy speech signal after the acquisition and enhancement processing.
In the step, extracting high-dimensional characteristic quantity of the fuzzy speech signal, specifically, the high-dimensional characteristic quantity is the spectrum envelope characteristic of each fuzzy speech signal frame, the extraction process of the spectrum envelope characteristic is to perform short-time FTT conversion on each fuzzy speech signal frame to obtain the spectrum of the fuzzy speech signal frame, obtain the Mei spectrum by the Mei filter on the spectrum of the fuzzy speech signal, then perform logarithm taking and DCT discrete cosine transform on the basis of the Mei spectrum to obtain MFCC coefficients, intercept 12-16 MFCC coefficients as fuzzy speech signalsSpectral envelope characteristic quantity X of speech signal framet
And 2, determining a sample selection set matched with the characteristics of the fuzzy speech signal according to the spectral envelope characteristic quantity of the fuzzy speech signal.
In the subsequent steps, a clear speech reconstruction model based on GAN needs to be trained by using a sample set with a certain sample capacity; the fuzzy speech presents abundant diversity, if the general fuzzy speech samples are adopted, the representativeness is often insufficient, and the GAN training is insufficient, so the invention establishes a plurality of sample collections, wherein each sample collection can contain about 1000 sections of speech samples, each speech sample comprises the fuzzy speech sample and the standard speech sample, and the similarity of the characteristic quantity of the fuzzy speech sample is in a preset similarity range. In this step, based on the spectral envelope characteristic quantity of the blurred speech signal extracted in step 1, matching is performed with the collection representative characteristic quantity of each sample collection, so that the sample collection matched with the spectral envelope characteristic quantity of the blurred speech signal is selected.
For the sample collection, assume that there are n segments of speech samples, and the spectral envelope characteristic quantity of the blurred speech sample corresponding to each speech sample is X1s,X2s...XnsEach frequency spectrum envelope characteristic quantity is d-dimension characteristic vector, and a characteristic quantity matrix X of the sample collection is formedS={X1s,X2s...Xns}; for the r-th dimension in the d-dimension, its entire feature quantity matrix X is calculatedSIs expressed asAnd selecting a feature quantity matrix XSIn nkA submatrix composed of characteristic quantities, denoted as a submatrix kThereby the characteristic quantity matrix XSIn each nkForming a sub-matrix by the feature vectors, wherein the total number of the sub-matrices is c, namely k is 1,2.. c; the mean of the r-dimension of the sub-matrix k in the d-dimension is expressed asThen calculate the inter-class distance of the c sub-matrices:
and calculating the intra-class distance of each sub-matrix of the c sub-matrices:
wherein xk s,rIs Xk SThe value of each feature vector in r dimension.
Calculating the intra-class inter-class proportion of each submatrix of the c submatrixes:
σ=Db/Dw
and further determining the submatrix with the highest intra-class inter-class proportion value as the collection representative characteristic quantity of the sample collection.
And matching the spectral envelope characteristic quantity of the fuzzy speech signal with the collection representative characteristic quantity of each sample collection, namely calculating the spectral envelope characteristic quantity of the fuzzy speech signal and the characteristic quantity in the submatrix serving as the collection representative characteristic quantity to calculate the average vector distance, and selecting the sample collection with the minimum average vector distance, thereby selecting the sample collection matched with the spectral envelope characteristic quantity of the fuzzy speech signal.
And 3, constructing a reconstruction model of the GAN architecture for reconstructing the fuzzy speech into the standard speech, and training the reconstruction model by utilizing the sample collection.
The reconstruction model of the GAN architecture comprises the following steps: a generator G and a discriminator D; the generator reconstructs the spectral envelope characteristic quantity of the standard voice according to the spectral envelope characteristic quantity of the fuzzy voice input into the generator; the discriminator is used for judging the authenticity of the spectral envelope characteristic quantity reconstructed by the generator.
The generator adopts a two-dimensional convolution neural network and consists of an encoding network and a decoding network. The coding network comprises 5 convolutional layers, the decoding network comprises 5 deconvolution layers, ResNet is established between the coding network and the decoding network, and standardization is carried out after each convolutional layer. The discriminator uses a two-dimensional convolutional neural network, comprising 5 convolutional layers, standardized after each convolutional layer.
In the training process, for the fuzzy voice samples in the sample selection set, the spectral envelope characteristic quantity X of the fuzzy voice samplestThe input generator and the training generator minimize the loss function of the generator, and the generator outputs the spectral envelope characteristic quantity of the reconstructed standard voice.
Loss function I of generator GG(G) Expressed as:
whereinRepresenting the penalty, L, of the generator Gc(G) Representing the loss of cyclic agreement of the generator G,regularization parameter, L, representing cyclic consistency lossid(G) Representing the loss of the feature map of the generator G,a regularization parameter that represents a feature mapping penalty.
In the training process, the spectrum envelope characteristic quantity of the reconstructed standard voice and the spectrum envelope characteristic quantity of the standard voice sample in the sample selection are input into a discriminator, and the discriminator is trained to minimize the loss function of the discriminator.
The loss function of discriminator D is expressed as:
wherein D (x)S) A discrimination value representing a spectral envelope characteristic quantity of a standard speech sample in the input sample collection by discriminator D,representing the expectation of a probability distribution of standard speech samples
D(G(xt) Representation discriminator D to generator G based on the fuzzy speech feature xtThe discrimination value of the spectral envelope characteristic quantity of the generated standard voice sample,representing feature x of fuzzy speechtExpectation of probability distribution
Through the training, the loss functions of the generator and the discriminator are minimized, and the trained GAN framework reconstruction model for reconstructing the fuzzy speech into the standard speech is obtained through the preset iteration times.
Step 4, constructing a converter for converting the fuzzy voice fundamental frequency into the standard voice fundamental frequency; the fundamental transfer function is:
wherein muGAnd σGMean and variance, mu, in the log domain for the standard speech generated by the generatortAnd σtMean and variance in the log domain for blurred speech, ftFor blurring the fundamental frequency of speech, fGIs converted standard speech fundamental frequency.
Step 5, inputting the spectral envelope characteristic quantity of the fuzzy speech signal input by the user into the trained reconstruction model to obtain the spectral envelope characteristic quantity of the reconstructed standard speech output by the generator of the reconstruction model, and inputting the fundamental frequency of the fuzzy speech into a converter to convert the fundamental frequency of the reconstructed standard speech;
step 6, synthesizing and reconstructing the standard voice according to the spectral envelope characteristic quantity and the fundamental frequency of the reconstructed standard voice; specifically, the spectral envelope characteristic quantity and fundamental frequency of the reconstructed standard speech can be substituted into the existing speech synthesizer, such as the WORLD speech synthesizer, to obtain the synthesized reconstructed standard speech
And 7, recognizing semantic information by using the reconstructed standard voice.
Furthermore, as shown in fig. 2, the present invention provides a fuzzy speech semantic recognition system for artificial intelligence learning, comprising:
and the fuzzy voice signal characteristic quantity extraction module is used for collecting the fuzzy voice signal input by the user and extracting the high-dimensional characteristic quantity of the fuzzy voice signal.
The fuzzy speech semantic recognition system for artificial intelligence learning of the invention can be applied to speech control functions of service facilities in intelligent communities, intelligent buildings and intelligent families, a user speaks a speech command to the service facilities, the service facilities collect speech signals by using components such as a microphone and the like, necessary front-end enhancement processing such as filtering, noise suppression, time spectrum estimation and the like is carried out, and windowing and framing processing of the speech signals are carried out, so that the system is not described in detail in the prior art. If the processed speech signal belongs to clear speech, the semantic information is directly identified and converted, and the speech signal is not an improvement point of the invention and is not specifically described here.
The fuzzy speech signal characteristic quantity extraction module extracts high-dimensional characteristic quantity of a fuzzy speech signal, wherein the high-dimensional characteristic quantity is specifically the spectrum envelope characteristic of each fuzzy speech signal frame, the spectrum envelope characteristic extraction process is to perform short-time FTT conversion on each fuzzy speech signal frame to obtain the spectrum of each fuzzy speech signal frame, obtain the Mei spectrum of the spectrum of each fuzzy speech signal frame through an Mei filter, then perform logarithm taking and DCT (discrete cosine transform) on the basis of the Mei spectrum to obtain MFCC (Mel-based coefficient), intercept 12-16 MFCC coefficients and use the MFCC coefficients as the spectrum envelope characteristic quantity X of the fuzzy speech signal framet
And the sample selection matching module is used for determining a sample selection matched with the characteristics of the fuzzy speech signal according to the spectral envelope characteristic quantity of the fuzzy speech signal.
The method establishes a plurality of sample collections, wherein each sample collection can contain about 1000 sections of voice samples, each voice sample comprises a fuzzy voice sample and a standard voice sample, and the similarity of the characteristic quantity of the fuzzy voice sample is within a preset similarity range. The sample collection may be stored in a sample library of the sample collection matching module.
And for the extracted spectral envelope characteristic quantity of the fuzzy speech signal, the sample selection matching module matches the selection representative characteristic quantity of each sample selection, so that the sample selection matched with the spectral envelope characteristic quantity of the fuzzy speech signal is selected.
For a sample collection, assuming that n sections of voice samples exist, a sample collection matching module determines the spectral envelope characteristic quantity of a fuzzy voice sample corresponding to each voice sample to be X1s,X2s...XnsEach frequency spectrum envelope characteristic quantity is d-dimension characteristic vector, and a characteristic quantity matrix X of the sample collection is formedS={X1s,X2s...Xns}; for the r-th dimension in the d-dimension, its entire feature quantity matrix X is calculatedSIs expressed asAnd selecting a feature quantity matrix XSIn nkA submatrix composed of characteristic quantities, denoted as a submatrix kThereby the characteristic quantity matrix XSIn each nkForming a sub-matrix by the feature vectors, wherein the total number of the sub-matrices is c, namely k is 1,2.. c; the mean of the r-dimension of the sub-matrix k in the d-dimension is expressed asThen calculate the inter-class distance of the c sub-matrices:
and calculating the intra-class distance of each sub-matrix of the c sub-matrices:
wherein xk s,rIs Xk SThe value of each feature vector in r dimension.
Calculating the intra-class inter-class proportion of each submatrix of the c submatrixes:
σ=Db/Dw
and further determining the submatrix with the highest intra-class inter-class proportion value as the collection representative characteristic quantity of the sample collection.
The sample selection matching module matches the spectrum envelope characteristic quantity of the fuzzy speech signal with the selection representative characteristic quantity of each sample selection, namely calculates the spectrum envelope characteristic quantity of the fuzzy speech signal and the characteristic quantity in the submatrix as the selection representative characteristic quantity to calculate the average vector distance, selects the sample selection with the minimum average vector distance, and accordingly selects the sample selection matched with the spectrum envelope characteristic quantity of the fuzzy speech signal.
And the GAN reconstruction model building and training module is used for building a reconstruction model of a GAN framework for reconstructing the fuzzy speech into the standard speech, and training the reconstruction model by utilizing the sample collection.
The reconstruction model of the GAN architecture comprises the following steps: a generator and a discriminator; the generator reconstructs the spectral envelope characteristic quantity of the standard voice according to the spectral envelope characteristic quantity of the fuzzy voice input into the generator; the discriminator is used for judging the authenticity of the spectral envelope characteristic quantity reconstructed by the generator.
The generator adopts a two-dimensional convolution neural network and consists of an encoding network and a decoding network. The coding network comprises 5 convolutional layers, the decoding network comprises 5 deconvolution layers, ResNet is established between the coding network and the decoding network, and standardization is carried out after each convolutional layer. The discriminator uses a two-dimensional convolutional neural network, comprising 5 convolutional layers, standardized after each convolutional layer.
In the training process, for the fuzzy voice samples in the sample selection set, the spectral envelope characteristic quantity of the fuzzy voice samplesXtThe input generator and the training generator minimize the loss function of the generator, and the generator outputs the spectral envelope characteristic quantity of the reconstructed standard voice.
Loss function I of generator GG(G) Expressed as:
whereinRepresenting the penalty, L, of the generator Gc(G) Representing the loss of cyclic agreement of the generator G,regularization parameter, L, representing cyclic consistency lossid(G) Representing the loss of the feature map of the generator G,a regularization parameter that represents a feature mapping penalty.
In the training process, the spectrum envelope characteristic quantity of the reconstructed standard voice and the spectrum envelope characteristic quantity of the standard voice sample in the sample selection are input into a discriminator, and the discriminator is trained to minimize the loss function of the discriminator.
The loss function of discriminator D is expressed as:
wherein D (x)S) A discrimination value representing a spectral envelope characteristic quantity of a standard speech sample in the input sample collection by discriminator D,representing expectations of probability distribution for standard speech samples
D(G(xt) Representation discriminator D to generator G based on the fuzzy speech feature xtGenerated standard speechThe discrimination value of the spectral envelope characteristic quantity of the sample,representing feature x of fuzzy speechtExpectation of probability distribution
Through the training, the loss functions of the generator and the discriminator are minimized, and the trained GAN framework reconstruction model for reconstructing the fuzzy speech into the standard speech is obtained through the preset iteration times.
The converter construction module is used for constructing a converter for converting the fuzzy voice fundamental frequency into the standard voice fundamental frequency; the fundamental transfer function is:
wherein muGAnd σGMean and variance, mu, in the log domain for the standard speech generated by the generatortAnd σtMean and variance in the log domain for blurred speech, ftFor blurring the fundamental frequency of speech, fGIs converted standard speech fundamental frequency.
The reconstruction conversion module is used for inputting the spectral envelope characteristic quantity of the fuzzy voice signal input by the user into the trained reconstruction model to obtain the spectral envelope characteristic quantity of the reconstructed standard voice output by the generator of the reconstruction model, inputting the spectral envelope characteristic quantity of the reconstructed standard voice into the converter and converting the fundamental frequency of the reconstructed standard voice;
and the standard voice synthesis module synthesizes and reconstructs the standard voice according to the spectral envelope characteristic quantity and the fundamental frequency of the reconstructed standard voice.
And the semantic information recognition module is used for recognizing the semantic information by utilizing the reconstructed standard voice.
Therefore, the invention provides a fuzzy speech semantic recognition method and system for artificial intelligence learning. The invention aims at the fuzzy speech existing in the user dictation speech instruction, reconstructs the fuzzy speech into clear standard speech by utilizing a GAN network architecture, and further realizes the conversion and identification of semantic information based on the standard speech. According to the invention, the input fuzzy speech is corresponding to the sample collection with a larger range through the speech feature matching, and the training of the GAN network is realized by the sample collection, so that the GAN network training is sufficient and is fully adapted to the feature distribution of the current fuzzy speech, the accuracy and reliability of reconstructing the standard speech are further improved, the accuracy rate from speech to semantic information recognition is obviously improved, and the correct conversion rate can reach more than 95.6% through experimental verification.
The above description is only a preferred embodiment of the application and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention herein disclosed is not limited to the particular combination of features described above, but also encompasses other arrangements formed by any combination of the above features or their equivalents without departing from the spirit of the invention. For example, the above features may be replaced with (but not limited to) features having similar functions disclosed in the present application.

Claims (10)

1. A fuzzy speech semantic recognition method for artificial intelligence learning comprises the following steps:
step 1, acquiring a fuzzy voice signal input by a user, and extracting high-dimensional characteristic quantity of the fuzzy voice signal;
step 2, determining a sample collection matched with the characteristics of the fuzzy speech signal according to the spectral envelope characteristic quantity of the fuzzy speech signal;
step 3, constructing a reconstruction model of a GAN framework for reconstructing the fuzzy speech into the standard speech, and training the reconstruction model by utilizing the sample collection;
step 4, constructing a converter for converting the fuzzy voice fundamental frequency into the standard voice fundamental frequency;
step 5, inputting the spectral envelope characteristic quantity of the fuzzy speech signal input by the user into the trained reconstruction model to obtain the spectral envelope characteristic quantity of the reconstructed standard speech output by the generator of the reconstruction model, and inputting the fundamental frequency of the fuzzy speech into a converter to convert the fundamental frequency of the reconstructed standard speech;
step 6, synthesizing and reconstructing the standard voice according to the spectral envelope characteristic quantity and the fundamental frequency of the reconstructed standard voice;
and 7, recognizing semantic information by using the reconstructed standard voice.
2. The fuzzy speech semantic recognition method according to claim 1, wherein a plurality of sample collections are established in step 2, each speech sample comprises a fuzzy speech sample and a standard speech sample, and the similarity of the feature quantity of the fuzzy speech sample is within a preset similarity range; and matching the spectral envelope characteristic quantity of the fuzzy speech signal extracted in the step 1 with the collection representative characteristic quantity of each sample collection, thereby selecting the sample collection matched with the spectral envelope characteristic quantity of the fuzzy speech signal.
3. The fuzzy speech semantic recognition method according to claim 2, wherein in step 2, the sample collection has n speech samples, and the spectral envelope characteristic quantity of the fuzzy speech sample corresponding to each speech sample is X1s,X2s…XnsEach frequency spectrum envelope characteristic quantity is d-dimension characteristic vector, and a characteristic quantity matrix X of the sample collection is formedS={X1s,X2s…Xns}; for the r-th dimension in the d-dimension, its entire feature quantity matrix X is calculatedSIs expressed asAnd selecting a feature quantity matrix XSIn nkA submatrix composed of characteristic quantities, denoted as a submatrix kThereby the characteristic quantity matrix XSIn each nkForming a sub-matrix by the feature vectors, wherein the total number of the sub-matrices is c, namely k is 1,2.. c; the mean of the r-dimension of the sub-matrix k in the d-dimension is expressed asThen calculate the inter-class distance of the c sub-matrices:
and calculating the intra-class distance of each sub-matrix of the c sub-matrices:
wherein xk s,rIs Xk SThe value of each feature vector in r dimension;
calculating the intra-class inter-class proportion of each submatrix of the c submatrixes:
σ=Db/Dw
and further determining the submatrix with the highest intra-class inter-class proportion value as the collection representative characteristic quantity of the sample collection.
4. The fuzzy speech semantic recognition method of claim 1, wherein the step 3 of reconstructing the GAN architecture comprises: a generator G and a discriminator D; the generator reconstructs the spectral envelope characteristic quantity of the standard voice according to the spectral envelope characteristic quantity of the fuzzy voice input into the generator; the discriminator is used for judging the authenticity of the spectral envelope characteristic quantity reconstructed by the generator.
5. The fuzzy speech semantic recognition method of claim 4, wherein the loss function I of the generator G in step 3G(G) Expressed as:
whereinRepresenting the penalty, L, of the generator Gc(G) Representing the loss of cyclic agreement of the generator G,regularization parameter, L, representing cyclic consistency lossid(G) Representing the loss of the feature map of the generator G,a regularization parameter that represents a feature mapping penalty.
6. The fuzzy speech semantic recognition method of claim 4, wherein the loss function of the discriminator D in step 3 is expressed as:
wherein D (x)S) A discrimination value representing a spectral envelope characteristic quantity of a standard speech sample in the input sample collection by discriminator D,representing an expectation of a probability distribution for a standard speech sample;
D(G(xt) Representation discriminator D to generator G based on the fuzzy speech feature xtThe discrimination value of the spectral envelope characteristic quantity of the generated standard voice sample,representing feature x of fuzzy speechtExpectation of probability distribution.
7. The fuzzy speech semantic recognition method according to claim 1, wherein the fundamental frequency conversion function constructed in step 4 is:
wherein muGAnd σGMean and variance, mu, in the log domain for the standard speech generated by the generatortAnd σtMean and variance in the log domain for blurred speech, ftFor blurring the fundamental frequency of speech, fGIs converted standard speech fundamental frequency.
8. An artificial intelligence learning fuzzy speech semantic recognition system, comprising:
the fuzzy voice signal characteristic quantity extraction module is used for collecting a fuzzy voice signal input by a user and extracting high-dimensional characteristic quantity of the fuzzy voice signal;
the sample selection matching module is used for determining a sample selection matched with the characteristics of the fuzzy speech signal according to the spectral envelope characteristic quantity of the fuzzy speech signal;
the GAN reconstruction model building and training module is used for building a reconstruction model of a GAN framework for reconstructing the fuzzy speech into the standard speech, and training the reconstruction model by utilizing the sample collection;
the converter construction module is used for constructing a converter for converting the fuzzy voice fundamental frequency into the standard voice fundamental frequency;
the reconstruction conversion module is used for inputting the spectral envelope characteristic quantity of the fuzzy voice signal input by the user into the trained reconstruction model to obtain the spectral envelope characteristic quantity of the reconstructed standard voice output by the generator of the reconstruction model, inputting the spectral envelope characteristic quantity of the reconstructed standard voice into the converter and converting the fundamental frequency of the reconstructed standard voice;
and the standard voice synthesis module synthesizes and reconstructs the standard voice according to the spectral envelope characteristic quantity and the fundamental frequency of the reconstructed standard voice.
And the semantic information recognition module is used for recognizing the semantic information by utilizing the reconstructed standard voice.
9. The system according to claim 8, wherein the sample collection matching module has a plurality of sample collections, each of the speech samples includes a fuzzy speech sample and a standard speech sample, and the similarity of the feature quantity of the fuzzy speech sample is within a preset similarity range; and matching the spectral envelope characteristic quantity of the fuzzy speech signal with the collection representative characteristic quantity of each sample collection based on the spectral envelope characteristic quantity of the fuzzy speech signal, so as to select the sample collection matched with the spectral envelope characteristic quantity of the fuzzy speech signal.
10. The system according to claim 9, wherein the sample collection in the sample collection matching module has n speech samples, and the spectral envelope characteristic quantity of the fuzzy speech sample corresponding to each speech sample is X1s,X2s…XnsEach frequency spectrum envelope characteristic quantity is d-dimension characteristic vector, and a characteristic quantity matrix X of the sample collection is formeds={X1s,X2s…Xns}; for the r-th dimension in the d-dimension, its entire feature quantity matrix X is calculatedSIs expressed asAnd selecting a feature quantity matrix XSIn nkA submatrix composed of characteristic quantities, denoted as a submatrix kThereby the characteristic quantity matrix XSIn each nkForming a sub-matrix by the feature vectors, wherein the total number of the sub-matrices is c, namely k is 1,2.. c; the mean of the r-dimension of the sub-matrix k in the d-dimension is expressed asThen calculate the inter-class distance of the c sub-matrices:
and calculating the intra-class distance of each sub-matrix of the c sub-matrices:
wherein xk s,rIs Xk sThe value of each feature vector in r dimension;
calculating the intra-class inter-class proportion of each submatrix of the c submatrixes:
σ=Db/Dw
and further determining the submatrix with the highest intra-class inter-class proportion value as the collection representative characteristic quantity of the sample collection.
CN201910713034.8A 2019-08-02 2019-08-02 Fuzzy speech semantic recognition method and system for artificial intelligence learning Active CN110600012B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910713034.8A CN110600012B (en) 2019-08-02 2019-08-02 Fuzzy speech semantic recognition method and system for artificial intelligence learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910713034.8A CN110600012B (en) 2019-08-02 2019-08-02 Fuzzy speech semantic recognition method and system for artificial intelligence learning

Publications (2)

Publication Number Publication Date
CN110600012A true CN110600012A (en) 2019-12-20
CN110600012B CN110600012B (en) 2020-12-04

Family

ID=68853447

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910713034.8A Active CN110600012B (en) 2019-08-02 2019-08-02 Fuzzy speech semantic recognition method and system for artificial intelligence learning

Country Status (1)

Country Link
CN (1) CN110600012B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113053360A (en) * 2021-03-09 2021-06-29 南京师范大学 High-precision software recognition method based on voice

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090271002A1 (en) * 2008-04-29 2009-10-29 David Asofsky System and Method for Remotely Controlling Electronic Devices
CN102982809A (en) * 2012-12-11 2013-03-20 中国科学技术大学 Conversion method for sound of speaker
US20130191129A1 (en) * 2012-01-19 2013-07-25 International Business Machines Corporation Information Processing Device, Large Vocabulary Continuous Speech Recognition Method, and Program
CN106448684A (en) * 2016-11-16 2017-02-22 北京大学深圳研究生院 Deep-belief-network-characteristic-vector-based channel-robust voiceprint recognition system
CN107945805A (en) * 2017-12-19 2018-04-20 程海波 A kind of intelligent across language voice identification method for transformation
CN108766409A (en) * 2018-05-25 2018-11-06 中国传媒大学 A kind of opera synthetic method, device and computer readable storage medium
US20190013012A1 (en) * 2017-07-04 2019-01-10 Minds Lab., Inc. System and method for learning sentences
CN109326283A (en) * 2018-11-23 2019-02-12 南京邮电大学 Multi-to-multi phonetics transfer method under non-parallel text condition based on text decoder
CN109599091A (en) * 2019-01-14 2019-04-09 南京邮电大学 Multi-to-multi voice conversion method based on STARWGAN-GP and x vector
CN110060691A (en) * 2019-04-16 2019-07-26 南京邮电大学 Multi-to-multi phonetics transfer method based on i vector sum VARSGAN

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090271002A1 (en) * 2008-04-29 2009-10-29 David Asofsky System and Method for Remotely Controlling Electronic Devices
US20130191129A1 (en) * 2012-01-19 2013-07-25 International Business Machines Corporation Information Processing Device, Large Vocabulary Continuous Speech Recognition Method, and Program
CN102982809A (en) * 2012-12-11 2013-03-20 中国科学技术大学 Conversion method for sound of speaker
CN106448684A (en) * 2016-11-16 2017-02-22 北京大学深圳研究生院 Deep-belief-network-characteristic-vector-based channel-robust voiceprint recognition system
US20190013012A1 (en) * 2017-07-04 2019-01-10 Minds Lab., Inc. System and method for learning sentences
CN107945805A (en) * 2017-12-19 2018-04-20 程海波 A kind of intelligent across language voice identification method for transformation
CN108766409A (en) * 2018-05-25 2018-11-06 中国传媒大学 A kind of opera synthetic method, device and computer readable storage medium
CN109326283A (en) * 2018-11-23 2019-02-12 南京邮电大学 Multi-to-multi phonetics transfer method under non-parallel text condition based on text decoder
CN109599091A (en) * 2019-01-14 2019-04-09 南京邮电大学 Multi-to-multi voice conversion method based on STARWGAN-GP and x vector
CN110060691A (en) * 2019-04-16 2019-07-26 南京邮电大学 Multi-to-multi phonetics transfer method based on i vector sum VARSGAN

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
范自柱: "《新型特征抽取算法研究》", 31 December 2016 *
韩志艳: "《面向语音与面部表情信号的多模式情感识别技术研究》", 31 January 2017, 东北大学出版社 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113053360A (en) * 2021-03-09 2021-06-29 南京师范大学 High-precision software recognition method based on voice

Also Published As

Publication number Publication date
CN110600012B (en) 2020-12-04

Similar Documents

Publication Publication Date Title
RU2373584C2 (en) Method and device for increasing speech intelligibility using several sensors
CN110120227A (en) A kind of depth stacks the speech separating method of residual error network
CN102509547A (en) Method and system for voiceprint recognition based on vector quantization based
CN112349297A (en) Depression detection method based on microphone array
CN113327626A (en) Voice noise reduction method, device, equipment and storage medium
CN112735435A (en) Voiceprint open set identification method with unknown class internal division capability
CN111461173A (en) Attention mechanism-based multi-speaker clustering system and method
CN107358947A (en) Speaker recognition methods and system again
CN113628612A (en) Voice recognition method and device, electronic equipment and computer readable storage medium
CN106971724A (en) A kind of anti-tampering method for recognizing sound-groove and system
JP2015022112A (en) Voice activity detection device and method
CN111489763B (en) GMM model-based speaker recognition self-adaption method in complex environment
CN111554279A (en) Multi-mode man-machine interaction system based on Kinect
CN110211609A (en) A method of promoting speech recognition accuracy
CN110600012B (en) Fuzzy speech semantic recognition method and system for artificial intelligence learning
CN110347426B (en) Intelligent release APP platform system and method thereof
CN105845131A (en) Far-talking voice recognition method and device
CN112002307B (en) Voice recognition method and device
KR101863098B1 (en) Apparatus and method for speech recognition
CN107945807B (en) Voice recognition method and system based on silence run
Bora et al. Speaker identification for biometric access control using hybrid features
CN110767238A (en) Blacklist identification method, apparatus, device and storage medium based on address information
CN110060692A (en) A kind of Voiceprint Recognition System and its recognition methods
CN114121004B (en) Voice recognition method, system, medium and equipment based on deep learning
CN117762372A (en) Multi-mode man-machine interaction system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20200914

Address after: 200232 floor 18, building 2, No. 277, Longlan Road, Xuhui District, Shanghai

Applicant after: LIGHT CONTROLS TESILIAN (SHANGHAI) INFORMATION TECHNOLOGY Co.,Ltd.

Address before: 100027 West Tower 11 floor, Kai Hao building, 8 Xinyuan South Road, Chaoyang District, Beijing.

Applicant before: Terminus(Beijing) Technology Co.,Ltd.

Applicant before: LIGHT CONTROLS TESILIAN (SHANGHAI) INFORMATION TECHNOLOGY Co.,Ltd.

TA01 Transfer of patent application right
GR01 Patent grant
GR01 Patent grant