CN117711413A - Voice recognition data processing method, system, device and storage medium - Google Patents

Voice recognition data processing method, system, device and storage medium Download PDF

Info

Publication number
CN117711413A
CN117711413A CN202311454059.3A CN202311454059A CN117711413A CN 117711413 A CN117711413 A CN 117711413A CN 202311454059 A CN202311454059 A CN 202311454059A CN 117711413 A CN117711413 A CN 117711413A
Authority
CN
China
Prior art keywords
matrix
feature
emotion
recognition data
low
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311454059.3A
Other languages
Chinese (zh)
Inventor
吴隶妍
陈章
林雄
林少穗
李耀坚
黎嘉宏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong Guangxin Communications Services Co Ltd
Original Assignee
Guangdong Guangxin Communications Services Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong Guangxin Communications Services Co Ltd filed Critical Guangdong Guangxin Communications Services Co Ltd
Priority to CN202311454059.3A priority Critical patent/CN117711413A/en
Publication of CN117711413A publication Critical patent/CN117711413A/en
Pending legal-status Critical Current

Links

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention discloses a voice recognition data processing method, a system, a device and a storage medium, comprising the following steps: acquiring target voice recognition data, and performing feature extraction on the target voice recognition data to obtain a semantic feature vector matrix, an emotion feature vector matrix and a speech speed feature vector matrix; performing dimension reduction processing on the semantic feature vector matrix, the emotion feature vector matrix and the speech speed feature vector matrix to obtain a semantic feature low-dimensional matrix, an emotion feature low-dimensional matrix and a speech speed feature low-dimensional matrix; performing feature fusion on the semantic feature low-dimensional matrix, the emotion feature low-dimensional matrix and the speech speed feature low-dimensional matrix to obtain a fusion feature matrix; and inputting the fusion feature matrix into a pre-trained noise detection model to obtain a noise detection result, and denoising the target voice recognition data according to the noise detection result. The invention improves the comprehensiveness and accuracy of denoising voice recognition data, and can be widely applied to the technical field of data processing.

Description

Voice recognition data processing method, system, device and storage medium
Technical Field
The present invention relates to the field of data processing technologies, and in particular, to a method, a system, an apparatus, and a storage medium for processing voice recognition data.
Background
With the rise of electronic commerce, electronic government affairs and mobile internet application, the acquisition channels of information and products are more and more, the consultation and consumption behaviors show a perceptual trend, and the service quality of customers is more and more focused on the masses of citizens and consumers. Compared with text customer service, the voice customer service has better affinity, and the call center always keeps a good development trend. However, the operation cost of the traditional call center is high, and the human cost is mainly high, so that the intelligence is the key of the breaking. To improve the intelligentized recognition capability, the speech quality must be improved. Noise and noise are unavoidable in various open environments of the clients, pronunciation habits are quite different, and purification is needed before recognition, so that irregular noise, noise of non-clients and redundant data caused by the problems of client accents, speaking habits or conversation quality are removed. The existing voice denoising method can only recognize and filter noise, noise and the like, can not accurately recognize redundant data in customer voices and comprehensively denoise, influences the comprehensiveness and accuracy of denoising voice recognition data, and further influences the efficiency and accuracy of follow-up voice recognition.
Disclosure of Invention
The present invention aims to solve at least one of the technical problems existing in the prior art to a certain extent.
Therefore, an object of the embodiments of the present invention is to provide a method for processing speech recognition data, which improves the comprehensiveness and accuracy of denoising the speech recognition data, thereby improving the efficiency and accuracy of speech recognition.
It is another object of an embodiment of the present invention to provide a speech recognition data processing system.
In order to achieve the technical purpose, the technical scheme adopted by the embodiment of the invention comprises the following steps:
in a first aspect, an embodiment of the present invention provides a method for processing speech recognition data, including the following steps:
acquiring target voice recognition data, and performing feature extraction on the target voice recognition data to obtain a semantic feature vector matrix, an emotion feature vector matrix and a speech speed feature vector matrix;
performing dimension reduction processing on the semantic feature vector matrix, the emotion feature vector matrix and the speech speed feature vector matrix to obtain a semantic feature low-dimensional matrix, an emotion feature low-dimensional matrix and a speech speed feature low-dimensional matrix;
performing feature fusion on the semantic feature low-dimensional matrix, the emotion feature low-dimensional matrix and the speech speed feature low-dimensional matrix to obtain a fusion feature matrix;
and inputting the fusion feature matrix into a pre-trained noise detection model to obtain a noise detection result, and denoising the target voice recognition data according to the noise detection result.
Further, in one embodiment of the present invention, the semantic feature vector matrix is obtained by:
extracting a voice characteristic sequence corresponding to the target voice recognition data through a preset filter bank;
acquiring acoustic characterization corresponding to the voice feature in the voice feature sequence through a pre-constructed first encoder;
and mapping the hidden vector of the acoustic characterization to a source language word list through a pre-constructed word embedding matrix to obtain the semantic feature vector matrix.
Further, in one embodiment of the present invention, the emotion feature vector matrix is obtained by:
sequentially performing pre-emphasis, framing, windowing, fast Fourier transformation, triangular window filtering, logarithmic operation and discrete cosine transformation on the target voice recognition data to obtain a Mel frequency cepstrum coefficient of the target voice recognition data;
determining a plurality of voice emotion characteristics according to the mel frequency cepstrum coefficient, and generating the emotion characteristic vector matrix according to the voice emotion characteristics;
the voice emotion characteristics comprise the mean value, standard deviation, variance, median value, maximum value, minimum value, quartile, polar error, steepness and skewness of the mel frequency cepstrum coefficient.
Further, in one embodiment of the present invention, the speech rate feature vector matrix is obtained by:
extracting cepstrum features, fundamental frequency values and energy values of the target voice recognition data;
generating a voice sequence carrying syllable boundary information according to the cepstrum features, and extracting a fundamental frequency value and an energy value corresponding to each syllable according to the fundamental frequency value, the energy value and the syllable boundary information;
and calculating the rhythm characteristics of each syllable according to the fundamental frequency value and the energy value corresponding to each syllable, and further generating a speech speed characteristic vector matrix according to the rhythm characteristics.
Further, in one embodiment of the present invention, the step of performing a dimension reduction process on the semantic feature vector matrix, the emotion feature vector matrix, and the speech rate feature vector matrix to obtain a semantic feature low-dimensional matrix, an emotion feature low-dimensional matrix, and a speech rate feature low-dimensional matrix specifically includes:
and respectively carrying out dimension reduction processing on the semantic feature vector matrix, the emotion feature vector matrix and the speech speed feature vector matrix by a single-view semi-supervised dimension reduction method to obtain the semantic feature low-dimension matrix, the emotion feature low-dimension matrix and the speech speed feature low-dimension matrix.
Further, in an embodiment of the present invention, the step of performing feature fusion on the semantic feature low-dimensional matrix, the emotion feature low-dimensional matrix, and the speech speed feature low-dimensional matrix to obtain a fused feature matrix specifically includes:
normalizing the semantic feature low-dimensional matrix, the emotion feature low-dimensional matrix and the speech speed feature low-dimensional matrix to obtain a semantic feature normalization matrix, an emotion feature normalization matrix and a speech speed feature normalization matrix;
and carrying out matrix combination on the semantic feature normalization matrix, the emotion feature normalization matrix and the speech speed feature normalization matrix to obtain the fusion feature matrix.
Further, in one embodiment of the present invention, the voice recognition data processing method further includes a step of pre-training the noise detection model, which specifically includes:
acquiring a plurality of preset noise detection sample data, and determining noise label information of each noise detection sample data, wherein the noise detection sample data comprises a fusion characteristic sample matrix of a plurality of voice samples;
constructing a training data set according to the noise detection sample data and the corresponding noise label information;
and inputting the training data set into a pre-constructed convolutional neural network for training to obtain the trained noise detection model.
In a second aspect, an embodiment of the present invention provides a speech recognition data processing system, including:
the feature extraction module is used for obtaining target voice recognition data, and carrying out feature extraction on the target voice recognition data to obtain a semantic feature vector matrix, an emotion feature vector matrix and a speech speed feature vector matrix;
the dimension reduction processing module is used for carrying out dimension reduction processing on the semantic feature vector matrix, the emotion feature vector matrix and the speech speed feature vector matrix to obtain a semantic feature low-dimension matrix, an emotion feature low-dimension matrix and a speech speed feature low-dimension matrix;
the feature fusion module is used for carrying out feature fusion on the semantic feature low-dimensional matrix, the emotion feature low-dimensional matrix and the speech speed feature low-dimensional matrix to obtain a fusion feature matrix;
the noise detection module is used for inputting the fusion feature matrix into a pre-trained noise detection model to obtain a noise detection result, and denoising the target voice recognition data according to the noise detection result.
In a third aspect, an embodiment of the present invention provides a voice recognition data processing apparatus, including:
at least one processor;
at least one memory for storing at least one program;
the at least one program, when executed by the at least one processor, causes the at least one processor to implement a speech recognition data processing method as described above.
In a fourth aspect, embodiments of the present invention also provide a computer-readable storage medium in which a processor-executable program is stored, which when executed by a processor is configured to perform a speech recognition data processing method as described above.
The advantages and benefits of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.
According to the embodiment of the invention, target voice recognition data are acquired, feature extraction is carried out on the target voice recognition data to obtain a semantic feature vector matrix, an emotion feature vector matrix and a speech speed feature vector matrix, then dimension reduction processing is carried out on the semantic feature vector matrix, the emotion feature vector matrix and the speech speed feature vector matrix to obtain a semantic feature low-dimensional matrix, an emotion feature low-dimensional matrix and a speech speed feature low-dimensional matrix, feature fusion is carried out on the semantic feature low-dimensional matrix, the emotion feature low-dimensional matrix and the speech speed feature low-dimensional matrix to obtain a fusion feature matrix, the fusion feature matrix is input into a noise detection model trained in advance to obtain a noise detection result, and denoising processing is carried out on the target voice recognition data according to the noise detection result. According to the embodiment of the invention, the feature extraction is carried out on the target voice recognition data based on three dimensions of semantics, emotion and speech speed, and the feature reduction and feature fusion are carried out on the extracted semantic feature vector matrix, emotion feature vector matrix and speech speed feature vector matrix, so that the fusion feature matrix containing feature representations of semantics, emotion, speech speed and the like is obtained, the fusion feature matrix is input into a pre-trained noise detection model, so that the noise data and voice redundancy data in the target voice recognition data can be accurately detected, the comprehensiveness and accuracy of denoising the voice recognition data are improved, and the efficiency and accuracy of voice recognition are improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the following description will refer to the drawings that are needed in the embodiments of the present invention, and it should be understood that the drawings in the following description are only for convenience and clarity to describe some embodiments in the technical solutions of the present invention, and other drawings may be obtained according to these drawings without any inventive effort for those skilled in the art.
FIG. 1 is a flowchart illustrating steps of a method for processing speech recognition data according to an embodiment of the present invention;
FIG. 2 is a block diagram of a speech recognition data processing system according to an embodiment of the present invention;
fig. 3 is a block diagram of a voice recognition data processing device according to an embodiment of the present invention.
Detailed Description
Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are illustrative only and are not to be construed as limiting the invention. The step numbers in the following embodiments are set for convenience of illustration only, and the order between the steps is not limited in any way, and the execution order of the steps in the embodiments may be adaptively adjusted according to the understanding of those skilled in the art.
In the description of the present invention, the plurality means two or more, and if the description is made to the first and second for the purpose of distinguishing technical features, it should not be construed as indicating or implying relative importance or implicitly indicating the number of the indicated technical features or implicitly indicating the precedence of the indicated technical features. Furthermore, unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art.
Referring to fig. 1, an embodiment of the present invention provides a method for processing voice recognition data, which specifically includes the following steps:
s101, acquiring target voice recognition data, and performing feature extraction on the target voice recognition data to obtain a semantic feature vector matrix, an emotion feature vector matrix and a speech speed feature vector matrix.
Further as an alternative embodiment, the semantic feature vector matrix is obtained by:
s1011, extracting a voice characteristic sequence corresponding to the target voice recognition data through a preset filter bank;
s1012, acquiring acoustic characterization corresponding to the voice feature in the voice feature sequence through a pre-constructed first encoder;
s1013, mapping the hidden vector of the acoustic characterization to a source language word list through a pre-constructed word embedding matrix to obtain a semantic feature vector matrix.
Further as an alternative embodiment, the emotion feature vector matrix is obtained by:
s1014, sequentially performing pre-emphasis, framing, windowing, fast Fourier transformation, triangular window filtering, logarithmic operation and discrete cosine transformation on target voice recognition data to obtain a Mel frequency cepstrum coefficient of the target voice recognition data;
s1015, determining a plurality of voice emotion characteristics according to the Mel frequency cepstrum coefficients, and generating an emotion characteristic vector matrix according to the voice emotion characteristics;
the voice emotion characteristics comprise mean value, standard deviation, variance, median value, maximum value, minimum value, quartile, polar error, steepness and skewness of the mel frequency cepstrum coefficient.
Specifically, the pre-emphasis process is to pass the voice signal through a high-pass filter, so as to raise the high-frequency part, flatten the spectrum of the signal, keep the spectrum in the whole frequency band from low frequency to high frequency, and use the same signal-to-noise ratio to calculate the spectrum. At the same time, the effect of vocal cords and lips in the occurrence process is eliminated to compensate the high-frequency part of the voice signal restrained by the pronunciation system, and the resonance peak of the high frequency is highlighted.
During framing, N sampling points are integrated into one observation unit, which is called a frame. Typically, N has a value of 256 or 512 and covers a period of about 20 to 30 ms. To avoid excessive variation between two adjacent frames, there is an overlap region between two adjacent frames, which includes M sampling points, where M is typically about 1/2 or 1/3 of N. Typically speech recognition uses speech signals with a sampling frequency of 8KHz or 16KHz, and for 8KHz, if the frame length is 256 samples, the corresponding time length is 32ms.
Windowing (Hamming Window) is the multiplication of each frame by a Hamming Window to increase the continuity at the left and right ends of the frame. Assuming that the signal after framing is S (N), n=0, 1, …, N-1, N is the frame size, then S' (N) =s (N) xW (N) is the hamming window after multiplying the hamming window.
Since the transformation of a signal in the time domain is generally difficult to see the characteristics of the signal, it is generally converted into an energy distribution in the frequency domain for observation, and different energy distributions can represent the characteristics of different voices. After multiplication by the hamming window, each frame must also undergo a fast fourier transform to obtain the energy distribution over the spectrum. And performing fast Fourier transform on each frame of signals subjected to framing and windowing to obtain the frequency spectrum of each frame, and performing modulo square on the frequency spectrum of the voice signal to obtain the power spectrum of the voice signal.
Triangular window filtering is a triangular filter bank that passes energy spectra through a set of Mel scales, with two main purposes: smoothing the frequency spectrum and eliminating the action of harmonic waves to highlight formants of the original voice, so that the tone or pitch of a section of voice is not presented in the MFCC coefficient, in other words, the voice recognition system characterized by the MFCC coefficient is not affected by the different tones of the input voice; in addition, the amount of computation can be reduced.
The logarithmic energy output by each filter bank is calculated and then the MFCC coefficients of the audio sample data are obtained via Discrete Cosine Transform (DCT).
And determining a plurality of voice emotion characteristics of the target voice recognition data according to the MFCC coefficients, so as to generate an emotion vector characteristic matrix.
In some alternative embodiments, the present embodiments use 10 features of mean, standard deviation, variance, median, maximum, minimum, quartile, polar error, steepness, and skewness of MFCC coefficients to generate a 240-dimensional column vector as the emotion vector feature matrix.
Further as an alternative embodiment, the speech rate feature vector matrix is obtained by:
s1016, extracting cepstrum features, fundamental frequency values and energy values of target voice recognition data;
s1017, generating a voice sequence carrying syllable boundary information according to the cepstrum features, and extracting a fundamental frequency value and an energy value corresponding to each syllable according to the fundamental frequency value, the energy value and the syllable boundary information;
s1018, calculating the rhythm feature of each syllable according to the fundamental frequency value and the energy value corresponding to each syllable, and further generating a speech speed feature vector matrix according to the rhythm feature.
S102, performing dimension reduction processing on the semantic feature vector matrix, the emotion feature vector matrix and the speech speed feature vector matrix to obtain a semantic feature low-dimensional matrix, an emotion feature low-dimensional matrix and a speech speed feature low-dimensional matrix.
Further as an optional implementation manner, the step of performing dimension reduction processing on the semantic feature vector matrix, the emotion feature vector matrix and the speech speed feature vector matrix to obtain a semantic feature low-dimension matrix, an emotion feature low-dimension matrix and a speech speed feature low-dimension matrix specifically comprises the following steps:
and respectively carrying out dimension reduction processing on the semantic feature vector matrix, the emotion feature vector matrix and the speech speed feature vector matrix by a single-view semi-supervised dimension reduction method to obtain a semantic feature low-dimension matrix, an emotion feature low-dimension matrix and a speech speed feature low-dimension matrix.
Specifically, the embodiment of the invention needs to process voice data which is mostly sourced from noisy and open environments, has stronger signal noise, and effective data denoising is the key of follow-up accurate recognition. The single-view semi-supervised dimension reduction method mainly aims at denoising and dimension reduction of single-view data containing a large amount of unlabeled voice data and limited pair constraint. Its objective function can be expressed as:
wherein w= [ W1, W2, …, wd]Showing a set of mapping vectors, n being the number of samples, M representing a set of paired samples with a mut-link constraint (a pair of samples belonging to the same class), C representing a set of paired samples with a cannot-link constraint (a pair of samples not belonging to the same class), n c And n M The number of the cannot-link constraint and the multiple-link constraint are respectively, and alpha and beta are contribution proportion parameters for balancing different constraints on an objective function.
After a series of simplifications and laplace transformations are performed on the above objective function, the above equation may be changed to:
J(W)=W T XLX T w, where D represents a diagonal matrix and l=d-S is a laplace matrix.
The semantic feature vector matrix, the emotion feature vector matrix and the speech speed feature vector matrix can be subjected to dimension reduction processing respectively by the single-view semi-supervised dimension reduction method to obtain a semantic feature low-dimensional matrix, an emotion feature low-dimensional matrix and a speech speed feature low-dimensional matrix.
And S103, carrying out feature fusion on the semantic feature low-dimensional matrix, the emotion feature low-dimensional matrix and the speech speed feature low-dimensional matrix to obtain a fusion feature matrix.
Further as an optional implementation manner, the step of performing feature fusion on the semantic feature low-dimensional matrix, the emotion feature low-dimensional matrix and the speech speed feature low-dimensional matrix to obtain a fusion feature matrix specifically includes:
s1031, carrying out normalization processing on the semantic feature low-dimensional matrix, the emotion feature low-dimensional matrix and the speech speed feature low-dimensional matrix to obtain a semantic feature normalization matrix, an emotion feature normalization matrix and a speech speed feature normalization matrix;
s1032, carrying out matrix combination on the semantic feature normalization matrix, the emotion feature normalization matrix and the speech speed feature normalization matrix to obtain a fusion feature matrix.
Specifically, in a completely open environment, data requests submitted by different users may have phenomena of repetition, superposition and the like due to accents, speaking habits and other unpredictable factors, and the data needs to be subjected to purification treatment such as decomposition, redundancy removal and the like, so that the aim can be fulfilled by expanding a single-view semi-supervised dimension reduction method of signals into a multi-view semi-supervised dimension reduction and learning method. The method comprises the steps of respectively carrying out single-view semi-supervised learning on each view of multi-view data, obtaining a low-dimensional model of each data view based on a limited history learning basis, and then fusing the low-dimensional models to learn to obtain a low-dimensional model represented by consistency. The objective function of multi-view semi-supervised dimension reduction can be written as:
wherein Y ε R d n is a consistent low-dimensional representation, d is the dimension of the consistent low-dimensional representation, d is the transformation matrix from the consistent low-dimensional representation to the v-th view model, dv is the dimension of the v-th view model after single-view semi-supervised dimension reduction, λ is a balance parameter, and l is the number of views. And then the minimum value of the above equation can be found out in an alternate optimization mode.
After noise and redundancy are removed from the voice data, the data quality is greatly improved, and the voice data is very important for improving the accuracy of subsequent voice recognition and matching.
S104, inputting the fusion feature matrix into a pre-trained noise detection model to obtain a noise detection result, and denoising the target voice recognition data according to the noise detection result.
Specifically, the noise detection model of the embodiment of the invention is obtained through convolutional neural network training. The obtained fusion feature matrix is input into a trained noise detection model, and the noise detection result corresponding to the target voice recognition data can be used for accurately and comprehensively denoising the target voice recognition data based on the noise detection result.
Further as an optional implementation manner, the voice recognition data processing method further includes a step of pre-training a noise detection model, which specifically includes:
s201, acquiring a plurality of preset noise detection sample data, and determining noise label information of each noise detection sample data, wherein the noise detection sample data comprises a fusion characteristic sample matrix of a plurality of voice samples;
s202, constructing a training data set according to noise detection sample data and corresponding noise label information;
s203, inputting the training data set into a pre-constructed convolutional neural network for training, and obtaining a trained noise detection model.
Specifically, when a training data set is constructed, a fusion characteristic sample matrix of a plurality of noise detection sample data in a test environment is obtained, and the obtaining mode refers to the fusion characteristic matrix, and the embodiment of the invention is not described herein; meanwhile, noise label information corresponding to each noise detection sample data is determined based on manual labeling, and a training data set can be generated according to the noise detection sample data and the corresponding noise label information.
Further as an optional implementation manner, the step of inputting the training data set into a pre-constructed convolutional neural network to perform training to obtain a trained noise detection model specifically includes:
s2031, inputting a training data set into a convolutional neural network to obtain a noise identification result;
s2032, determining a loss value of the convolutional neural network according to the noise identification result and the noise label information;
s2033, updating model parameters of the convolutional neural network through a back propagation algorithm according to the loss value, and returning to the step of inputting the training data set into the convolutional neural network;
and S2034, stopping training when the loss value reaches a preset first threshold value or the iteration number reaches a preset second threshold value, and obtaining a trained noise detection model.
Specifically, after data in the training data set is input into the initialized convolutional neural network model, a recognition result output by the model, namely a noise recognition result, can be obtained, and the accuracy of model prediction can be evaluated according to the noise recognition result and the noise label information, so that parameters of the model are updated. For the noise detection model, the accuracy of the model prediction result can be measured by a Loss Function (Loss Function), which is defined on a single training data and is used for measuring the prediction error of one training data, specifically determining the Loss value of the training data through the label of the single training data and the model for the prediction result of the training data. In actual training, one training data set has a lot of training data, so that a Cost Function (Cost Function) is generally adopted to measure the overall error of the training data set, and the Cost Function is defined on the whole training data set and is used for calculating the average value of the prediction errors of all the training data, so that the prediction effect of the model can be better measured. For a general machine learning model, based on the cost function, a regular term for measuring the complexity of the model can be used as a training objective function, and based on the objective function, the loss value of the whole training data set can be obtained. There are many kinds of common loss functions, such as 0-1 loss function, square loss function, absolute loss function, logarithmic loss function, cross entropy loss function, etc., which can be used as the loss function of the machine learning model, and will not be described in detail herein. In the embodiment of the invention, one loss function can be selected to determine the loss value of training. Based on the trained loss value, updating the parameters of the model by adopting a back propagation algorithm, and iterating for several rounds to obtain the trained noise detection model. Specifically, the number of iteration rounds may be preset, or training may be considered complete when the test set meets the accuracy requirements.
The method steps of the embodiments of the present invention are described above. It can be understood that, in the embodiment of the invention, the feature extraction is performed on the target voice recognition data based on three dimensions of semantics, emotion and speech speed, and the feature fusion is performed on the extracted semantic feature vector matrix, emotion feature vector matrix and speech speed feature vector matrix, so as to obtain a fusion feature matrix containing feature representations of semantics, emotion, speech speed and the like, and the fusion feature matrix is input into a pre-trained noise detection model to accurately detect and obtain the noise data and voice redundancy data in the target voice recognition data, thereby improving the comprehensiveness and accuracy of denoising the voice recognition data, and further improving the efficiency and accuracy of voice recognition.
Referring to fig. 2, an embodiment of the present invention provides a voice recognition data processing system, including:
the feature extraction module is used for acquiring target voice recognition data, and carrying out feature extraction on the target voice recognition data to obtain a semantic feature vector matrix, an emotion feature vector matrix and a speech speed feature vector matrix;
the dimension reduction processing module is used for carrying out dimension reduction processing on the semantic feature vector matrix, the emotion feature vector matrix and the speech speed feature vector matrix to obtain a semantic feature low-dimensional matrix, an emotion feature low-dimensional matrix and a speech speed feature low-dimensional matrix;
the feature fusion module is used for carrying out feature fusion on the semantic feature low-dimensional matrix, the emotion feature low-dimensional matrix and the speech speed feature low-dimensional matrix to obtain a fusion feature matrix;
the noise detection module is used for inputting the fusion feature matrix into a pre-trained noise detection model to obtain a noise detection result, and denoising the target voice recognition data according to the noise detection result.
The content in the method embodiment is applicable to the system embodiment, the functions specifically realized by the system embodiment are the same as those of the method embodiment, and the achieved beneficial effects are the same as those of the method embodiment.
Referring to fig. 3, an embodiment of the present invention provides a voice recognition data processing apparatus, including:
at least one processor;
at least one memory for storing at least one program;
the at least one program, when executed by the at least one processor, causes the at least one processor to implement a speech recognition data processing method as described above.
The content in the method embodiment is applicable to the embodiment of the device, and the functions specifically realized by the embodiment of the device are the same as those of the method embodiment, and the obtained beneficial effects are the same as those of the method embodiment.
The embodiment of the present invention also provides a computer-readable storage medium in which a processor-executable program is stored, which when executed by a processor is for performing a voice recognition data processing method as described above.
The computer readable storage medium of the embodiment of the invention can execute the voice recognition data processing method provided by the embodiment of the method of the invention, and can execute the steps of any combination implementation of the embodiment of the method, thereby having the corresponding functions and beneficial effects of the method.
Embodiments of the present invention also disclose a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The computer instructions may be read from a computer-readable storage medium by a processor of a computer device, and executed by the processor, to cause the computer device to perform the method shown in fig. 1.
In some alternative embodiments, the functions/acts noted in the block diagrams may occur out of the order noted in the operational illustrations. For example, two blocks shown in succession may in fact be executed substantially concurrently or the blocks may sometimes be executed in the reverse order, depending upon the functionality/acts involved. Furthermore, the embodiments presented and described in the flowcharts of the present invention are provided by way of example in order to provide a more thorough understanding of the technology. The disclosed methods are not limited to the operations and logic flows presented herein. Alternative embodiments are contemplated in which the order of various operations is changed, and in which sub-operations described as part of a larger operation are performed independently.
Furthermore, while the present invention has been described in the context of functional modules, it should be appreciated that, unless otherwise indicated, one or more of the functions and/or features described above may be integrated in a single physical device and/or software module or one or more of the functions and/or features may be implemented in separate physical devices or software modules. It will also be appreciated that a detailed discussion of the actual implementation of each module is not necessary to an understanding of the present invention. Rather, the actual implementation of the various functional modules in the apparatus disclosed herein will be apparent to those skilled in the art from consideration of their attributes, functions and internal relationships. Accordingly, one of ordinary skill in the art can implement the invention as set forth in the claims without undue experimentation. It is also to be understood that the specific concepts disclosed are merely illustrative and are not intended to be limiting upon the scope of the invention, which is to be defined in the appended claims and their full scope of equivalents.
The above functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on this understanding, the technical solution of the present invention may be embodied in essence or a part contributing to the prior art or a part of the technical solution in the form of a software product stored in a storage medium, including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the above-described method of the various embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
Logic and/or steps represented in the flowcharts or otherwise described herein, e.g., a ordered listing of executable instructions for implementing logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). In addition, the computer-readable medium may even be paper or other suitable medium upon which the program described above is printed, as the program described above may be electronically captured, via, for instance, optical scanning of the paper or other medium, then compiled, interpreted, or otherwise processed in a suitable manner, if necessary, and then stored in a computer memory.
It is to be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above-described embodiments, the various steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, may be implemented using any one or combination of the following techniques, as is well known in the art: discrete logic circuits having logic gates for implementing logic functions on data signals, application specific integrated circuits having suitable combinational logic gates, programmable Gate Arrays (PGAs), field Programmable Gate Arrays (FPGAs), and the like.
In the foregoing description of the present specification, reference has been made to the terms "one embodiment/example", "another embodiment/example", "certain embodiments/examples", and the like, means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiments or examples. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
While embodiments of the present invention have been shown and described, it will be understood by those of ordinary skill in the art that: many changes, modifications, substitutions and variations may be made to the embodiments without departing from the spirit and principles of the invention, the scope of which is defined by the claims and their equivalents.
While the preferred embodiment of the present invention has been described in detail, the present invention is not limited to the above embodiments, and various equivalent modifications and substitutions can be made by those skilled in the art without departing from the spirit of the present invention, and these equivalent modifications and substitutions are intended to be included in the scope of the present invention as defined in the appended claims.

Claims (10)

1. A method of processing speech recognition data, comprising the steps of:
acquiring target voice recognition data, and performing feature extraction on the target voice recognition data to obtain a semantic feature vector matrix, an emotion feature vector matrix and a speech speed feature vector matrix;
performing dimension reduction processing on the semantic feature vector matrix, the emotion feature vector matrix and the speech speed feature vector matrix to obtain a semantic feature low-dimensional matrix, an emotion feature low-dimensional matrix and a speech speed feature low-dimensional matrix;
performing feature fusion on the semantic feature low-dimensional matrix, the emotion feature low-dimensional matrix and the speech speed feature low-dimensional matrix to obtain a fusion feature matrix;
and inputting the fusion feature matrix into a pre-trained noise detection model to obtain a noise detection result, and denoising the target voice recognition data according to the noise detection result.
2. The method for processing voice recognition data according to claim 1, wherein the semantic feature vector matrix is obtained by:
extracting a voice characteristic sequence corresponding to the target voice recognition data through a preset filter bank;
acquiring acoustic characterization corresponding to the voice feature in the voice feature sequence through a pre-constructed first encoder;
and mapping the hidden vector of the acoustic characterization to a source language word list through a pre-constructed word embedding matrix to obtain the semantic feature vector matrix.
3. The method for processing speech recognition data according to claim 1, wherein the emotion feature vector matrix is obtained by:
sequentially performing pre-emphasis, framing, windowing, fast Fourier transformation, triangular window filtering, logarithmic operation and discrete cosine transformation on the target voice recognition data to obtain a Mel frequency cepstrum coefficient of the target voice recognition data;
determining a plurality of voice emotion characteristics according to the mel frequency cepstrum coefficient, and generating the emotion characteristic vector matrix according to the voice emotion characteristics;
the voice emotion characteristics comprise the mean value, standard deviation, variance, median value, maximum value, minimum value, quartile, polar error, steepness and skewness of the mel frequency cepstrum coefficient.
4. The method for processing speech recognition data according to claim 1, wherein the speech rate feature vector matrix is obtained by:
extracting cepstrum features, fundamental frequency values and energy values of the target voice recognition data;
generating a voice sequence carrying syllable boundary information according to the cepstrum features, and extracting a fundamental frequency value and an energy value corresponding to each syllable according to the fundamental frequency value, the energy value and the syllable boundary information;
and calculating the rhythm characteristics of each syllable according to the fundamental frequency value and the energy value corresponding to each syllable, and further generating a speech speed characteristic vector matrix according to the rhythm characteristics.
5. The method for processing voice recognition data according to claim 1, wherein the step of performing dimension reduction processing on the semantic feature vector matrix, the emotion feature vector matrix and the speech speed feature vector matrix to obtain a semantic feature low-dimensional matrix, an emotion feature low-dimensional matrix and a speech speed feature low-dimensional matrix comprises the following steps:
and respectively carrying out dimension reduction processing on the semantic feature vector matrix, the emotion feature vector matrix and the speech speed feature vector matrix by a single-view semi-supervised dimension reduction method to obtain the semantic feature low-dimension matrix, the emotion feature low-dimension matrix and the speech speed feature low-dimension matrix.
6. The method for processing speech recognition data according to claim 1, wherein the step of feature-fusing the semantic feature low-dimensional matrix, the emotion feature low-dimensional matrix, and the speech rate feature low-dimensional matrix to obtain a fused feature matrix specifically comprises:
normalizing the semantic feature low-dimensional matrix, the emotion feature low-dimensional matrix and the speech speed feature low-dimensional matrix to obtain a semantic feature normalization matrix, an emotion feature normalization matrix and a speech speed feature normalization matrix;
and carrying out matrix combination on the semantic feature normalization matrix, the emotion feature normalization matrix and the speech speed feature normalization matrix to obtain the fusion feature matrix.
7. The method according to any one of claims 1 to 6, characterized in that the method further comprises the step of pre-training the noise detection model, which specifically comprises:
acquiring a plurality of preset noise detection sample data, and determining noise label information of each noise detection sample data, wherein the noise detection sample data comprises a fusion characteristic sample matrix of a plurality of voice samples;
constructing a training data set according to the noise detection sample data and the corresponding noise label information;
and inputting the training data set into a pre-constructed convolutional neural network for training to obtain the trained noise detection model.
8. A speech recognition data processing system, comprising:
the feature extraction module is used for obtaining target voice recognition data, and carrying out feature extraction on the target voice recognition data to obtain a semantic feature vector matrix, an emotion feature vector matrix and a speech speed feature vector matrix;
the dimension reduction processing module is used for carrying out dimension reduction processing on the semantic feature vector matrix, the emotion feature vector matrix and the speech speed feature vector matrix to obtain a semantic feature low-dimension matrix, an emotion feature low-dimension matrix and a speech speed feature low-dimension matrix;
the feature fusion module is used for carrying out feature fusion on the semantic feature low-dimensional matrix, the emotion feature low-dimensional matrix and the speech speed feature low-dimensional matrix to obtain a fusion feature matrix;
the noise detection module is used for inputting the fusion feature matrix into a pre-trained noise detection model to obtain a noise detection result, and denoising the target voice recognition data according to the noise detection result.
9. A speech recognition data processing apparatus, comprising:
at least one processor;
at least one memory for storing at least one program;
when said at least one program is executed by said at least one processor, said at least one processor is caused to implement a speech recognition data processing method as claimed in any one of claims 1 to 7.
10. A computer-readable storage medium in which a processor-executable program is stored, characterized in that the processor-executable program is for performing a speech recognition data processing method according to any one of claims 1 to 7 when being executed by a processor.
CN202311454059.3A 2023-11-02 2023-11-02 Voice recognition data processing method, system, device and storage medium Pending CN117711413A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311454059.3A CN117711413A (en) 2023-11-02 2023-11-02 Voice recognition data processing method, system, device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311454059.3A CN117711413A (en) 2023-11-02 2023-11-02 Voice recognition data processing method, system, device and storage medium

Publications (1)

Publication Number Publication Date
CN117711413A true CN117711413A (en) 2024-03-15

Family

ID=90155938

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311454059.3A Pending CN117711413A (en) 2023-11-02 2023-11-02 Voice recognition data processing method, system, device and storage medium

Country Status (1)

Country Link
CN (1) CN117711413A (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112201249A (en) * 2020-09-29 2021-01-08 北京达佳互联信息技术有限公司 Voice processing method and device, electronic equipment and storage medium
CN113257225A (en) * 2021-05-31 2021-08-13 之江实验室 Emotional voice synthesis method and system fusing vocabulary and phoneme pronunciation characteristics
CN113436608A (en) * 2021-06-25 2021-09-24 平安科技(深圳)有限公司 Double-stream voice conversion method, device, equipment and storage medium
US20210341989A1 (en) * 2018-09-28 2021-11-04 Shanghai Cambricon Information Technology Co., Ltd Signal processing device and related products
CN113807249A (en) * 2021-09-17 2021-12-17 广州大学 Multi-mode feature fusion based emotion recognition method, system, device and medium
WO2022027423A1 (en) * 2020-08-06 2022-02-10 大象声科(深圳)科技有限公司 Deep learning noise reduction method and system fusing signal of bone vibration sensor with signals of two microphones

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210341989A1 (en) * 2018-09-28 2021-11-04 Shanghai Cambricon Information Technology Co., Ltd Signal processing device and related products
WO2022027423A1 (en) * 2020-08-06 2022-02-10 大象声科(深圳)科技有限公司 Deep learning noise reduction method and system fusing signal of bone vibration sensor with signals of two microphones
CN112201249A (en) * 2020-09-29 2021-01-08 北京达佳互联信息技术有限公司 Voice processing method and device, electronic equipment and storage medium
CN113257225A (en) * 2021-05-31 2021-08-13 之江实验室 Emotional voice synthesis method and system fusing vocabulary and phoneme pronunciation characteristics
CN113436608A (en) * 2021-06-25 2021-09-24 平安科技(深圳)有限公司 Double-stream voice conversion method, device, equipment and storage medium
CN113807249A (en) * 2021-09-17 2021-12-17 广州大学 Multi-mode feature fusion based emotion recognition method, system, device and medium

Similar Documents

Publication Publication Date Title
CN109767756B (en) Sound characteristic extraction algorithm based on dynamic segmentation inverse discrete cosine transform cepstrum coefficient
McLoughlin Line spectral pairs
CN108305639B (en) Speech emotion recognition method, computer-readable storage medium and terminal
Eskimez et al. Front-end speech enhancement for commercial speaker verification systems
Bulut et al. Low-latency single channel speech enhancement using u-net convolutional neural networks
CN110970036B (en) Voiceprint recognition method and device, computer storage medium and electronic equipment
CN113807249B (en) Emotion recognition method, system, device and medium based on multi-mode feature fusion
Yuan A time–frequency smoothing neural network for speech enhancement
Gu et al. Waveform Modeling Using Stacked Dilated Convolutional Neural Networks for Speech Bandwidth Extension.
CN108682432B (en) Speech emotion recognition device
Wang et al. Adversarially learning disentangled speech representations for robust multi-factor voice conversion
CN114023300A (en) Chinese speech synthesis method based on diffusion probability model
Li et al. Deep neural network‐based linear predictive parameter estimations for speech enhancement
Cheng et al. DNN-based speech enhancement with self-attention on feature dimension
CN113782032B (en) Voiceprint recognition method and related device
Zheng et al. Effects of skip connections in CNN-based architectures for speech enhancement
Goh et al. Robust speech recognition using harmonic features
Gupta et al. High‐band feature extraction for artificial bandwidth extension using deep neural network and H∞ optimisation
CN114495969A (en) Voice recognition method integrating voice enhancement
CN116691699A (en) Driving mode adjusting method, system, device and medium based on emotion recognition
Tai et al. Idanet: An information distillation and aggregation network for speech enhancement
CN116312617A (en) Voice conversion method, device, electronic equipment and storage medium
CN117711413A (en) Voice recognition data processing method, system, device and storage medium
Bouchakour et al. Noise-robust speech recognition in mobile network based on convolution neural networks
Arun Sankar et al. Design of MELPe-based variable-bit-rate speech coding with mel scale approach using low-order linear prediction filter and representing excitation signal using glottal closure instants

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination