CN114283791A

CN114283791A - Speech recognition method based on high-dimensional acoustic features and model training method

Info

Publication number: CN114283791A
Application number: CN202111443194.9A
Authority: CN
Inventors: 郑颖龙; 赖蔚蔚; 吴广财; 郑杰生; 周昉昉; 林嘉鑫; 陈颖璇; 叶杭; 梁运德; 黄宏恩
Original assignee: Guangdong Electric Power Information Technology Co Ltd
Current assignee: Guangdong Electric Power Information Technology Co Ltd
Priority date: 2021-11-30
Filing date: 2021-11-30
Publication date: 2022-04-05

Abstract

The application discloses a speech recognition method and a model training method based on high-dimensional acoustic features, and relates to the technical field of speech recognition. The method comprises the following steps: acquiring audio to be identified; acquiring high-dimensional features corresponding to the audio to be recognized based on a pre-trained acoustic feature extraction model, and taking the high-dimensional features as the high-dimensional acoustic features of the audio to be recognized; acquiring an identification scene corresponding to the audio to be identified as a target identification scene; and inputting the high-dimensional acoustic features into a pre-trained voice recognition model corresponding to the target recognition scene to obtain a text recognition result corresponding to the audio to be recognized. Therefore, the text recognition result based on the high-dimensional acoustic feature recognition is more accurate by extracting the feature information which is more beneficial to the voice recognition, and the accuracy of the voice recognition is improved; and moreover, the voice recognition model corresponding to the recognition scene of the audio to be recognized is called, so that more targeted voice recognition is realized, and the accuracy of the text recognition result is also improved.

Description

Speech recognition method based on high-dimensional acoustic features and model training method

Technical Field

The present application relates to the field of speech recognition technology, and more particularly, to a speech recognition method and a model training method based on high-dimensional acoustic features.

Background

The speech recognition is a technology covering the subjects of acoustics and linguistics, mathematics and statistics computer, artificial intelligence and the like, and is a key link in the man-machine natural interaction technology. Text content information in speech spoken by a speaker is recognized by speech recognition techniques. The voice recognition technology is applied to multiple scenes, such as telephones, mobile phones, application programs, access control systems, intelligent sound equipment, robots and the like.

In the related art, a speech recognition model for speech recognition is generally trained in advance by means of model training. However, for a specific recognition area such as a vertical area and a characteristic accent, the recognition accuracy of the speech recognition model is lowered, and text content information in speech in the specific recognition area cannot be accurately recognized.

Disclosure of Invention

In view of this, the present application provides a speech recognition method and a model training method based on high-dimensional acoustic features.

In a first aspect, an embodiment of the present application provides a speech recognition method based on high-dimensional acoustic features, where the method includes: acquiring audio to be identified; acquiring a high-dimensional feature corresponding to the audio to be recognized based on a pre-trained acoustic feature extraction model, and taking the high-dimensional feature as the high-dimensional acoustic feature of the audio to be recognized; acquiring an identification scene corresponding to the audio to be identified as a target identification scene; and inputting the high-dimensional acoustic features into a pre-trained voice recognition model corresponding to the target recognition scene to obtain a text recognition result corresponding to the audio to be recognized.

In a second aspect, an embodiment of the present application provides a method for training a speech recognition model based on high-dimensional acoustic features, where the method includes: acquiring a first audio sample set, wherein first audio samples contained in the first audio sample set are all under the same identification scene; acquiring high-dimensional features corresponding to the first audio sample set based on a pre-trained voiceprint feature extraction model to obtain a plurality of high-dimensional acoustic features, wherein each high-dimensional acoustic feature in the plurality of high-dimensional acoustic features corresponds to each first audio sample in the first audio sample set in a one-to-one mode; training a first initial model based on the plurality of high-dimensional acoustic features until the first initial model meets a first preset condition, and obtaining a voice recognition model under a recognition scene corresponding to the first audio sample set.

According to the scheme provided by the application, audio to be identified is obtained; acquiring high-dimensional features corresponding to the audio to be recognized based on a pre-trained acoustic feature extraction model, and taking the high-dimensional features as the high-dimensional acoustic features of the audio to be recognized; acquiring an identification scene corresponding to the audio to be identified as a target identification scene; and inputting the high-dimensional acoustic features into a pre-trained voice recognition model corresponding to the target recognition scene to obtain a text recognition result corresponding to the audio to be recognized. Therefore, the high-dimensional acoustic features in the audio to be recognized are extracted through the acoustic feature extraction model trained in advance, more feature information beneficial to speech recognition is contained, namely the audio to be recognized has better characteristics, so that a text recognition result recognized based on the high-dimensional acoustic features is more accurate, and the accuracy of the speech recognition is improved; and calling a voice recognition model corresponding to the recognition scene of the audio to be recognized, and performing voice recognition on the audio to be recognized, namely selecting a voice recognition model more suitable for the audio to be recognized for voice recognition, so that more targeted voice recognition is realized, and the accuracy of a text recognition result is further improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 shows a flowchart of a speech recognition method based on high-dimensional acoustic features according to an embodiment of the present application.

Fig. 2 is a flowchart illustrating a speech recognition method based on high-dimensional acoustic features according to another embodiment of the present application.

Fig. 3 shows a network structure diagram of an acoustic feature extraction model in the present application.

Fig. 4 shows a schematic network structure diagram of a transformer module in the present application.

Fig. 5 is a flowchart illustrating a method for training a speech recognition model based on high-dimensional acoustic features according to another embodiment of the present application.

Fig. 6 is a flow chart illustrating the sub-steps of step S330 in fig. 5 in one embodiment.

Fig. 7 is a block diagram of a speech recognition apparatus based on high-dimensional acoustic features according to an embodiment of the present application.

Fig. 8 is a block diagram of a training apparatus for a speech recognition model based on high-dimensional acoustic features according to an embodiment of the present application.

Fig. 9 is a block diagram of a computer device for executing a speech recognition method based on high-dimensional acoustic features according to an embodiment of the present application.

Fig. 10 is a storage unit for storing or carrying program code implementing a speech recognition method based on high-dimensional acoustic features according to an embodiment of the present application.

Detailed Description

In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application.

In the related art, a model training mode is generally used, and a speech recognition model for speech recognition is trained in advance. However, a speech recognition model for speech recognition is generally trained in advance by means of model training. However, for a specific recognition area such as a vertical area and a characteristic accent, the recognition accuracy of the speech recognition model is lowered, and text content information in speech in the specific recognition area cannot be accurately recognized.

Moreover, due to the dependence of model training on massive training data samples, when a speech recognition model needs to be subjected to data optimization within a small range, such as vertical fields, special accents and the like, the model needs to be retrained, and related parameters of the training model cannot be reused, so that the optimization cost is high.

In order to solve the above problems, the inventor proposes a speech recognition method and a model training method based on high-dimensional acoustic features, wherein a model is extracted based on acoustic features trained in advance, the high-dimensional acoustic features of an audio to be recognized are extracted, and then the high-dimensional acoustic features are input to a speech recognition model corresponding to a target recognition scene corresponding to the audio to be recognized, so as to obtain a text recognition result corresponding to the audio to be recognized. This is described in detail below.

Referring to fig. 1, fig. 1 is a flowchart illustrating a speech recognition method based on high-dimensional acoustic features according to an embodiment of the present application. The speech recognition method based on high-dimensional acoustic features provided by the embodiment of the present application will be described in detail below with reference to fig. 1. The speech recognition method based on the high-dimensional acoustic features can comprise the following steps:

step S110: and acquiring the audio to be identified.

In this embodiment, the audio to be identified may be acquired by an audio acquisition device configured by the computer device itself; or the received audio collected by the external audio collecting equipment; it may also be audio downloaded via a network, which is not limited in this embodiment.

Step S120: and acquiring the high-dimensional characteristics corresponding to the audio to be recognized based on a pre-trained acoustic characteristic extraction model, and taking the high-dimensional characteristics as the high-dimensional acoustic characteristics of the audio to be recognized.

In this embodiment, after the audio to be recognized is obtained, the audio features of the audio to be recognized may be extracted, and then the extracted audio features are input into the acoustic feature model trained in advance, and the high-dimensional features output by the acoustic feature model are used as the high-dimensional acoustic features of the audio to be recognized. The audio features are two-dimensional spectral features extracted through a signal processing manner, for example, Mel-Frequency Cepstral Coefficients (MFCCs) features or filter bank (Fbank) features, and understandably, the two-dimensional spectral features obtained through simple signal processing are input into an acoustic feature extraction model, so that higher-dimensional acoustic features can be obtained, and feature information included in the obtained higher-dimensional acoustic features is more than that included in the two-dimensional spectral features, so that the audio to be recognized is better characterized, for example, the higher-dimensional acoustic features may be 512-dimensional feature vectors. Therefore, more and more comprehensive feature information contained in the acquired high-dimensional acoustic features can be acquired based on the acoustic feature extraction model, and the accuracy of subsequent speech recognition based on the high-dimensional acoustic features can be improved.

The acoustic feature extraction model may be obtained by performing improved training based on a transform model, and specifically, a feature vector output by a decoder in the transform model may be obtained as the high-dimensional acoustic feature; the feature vector output by the decoder may be further subjected to encoding, aligning, and decoding operations to obtain the above-mentioned high-dimensional acoustic feature, which is not limited in this embodiment.

Step S130: and acquiring an identification scene corresponding to the audio to be identified as a target identification scene.

Further, in order to identify the audio to be identified more specifically, an identification scene corresponding to the audio to be identified may be obtained as a target identification scene, and then a pre-trained speech recognition model corresponding to the target identification scene is obtained.

In some embodiments, an identification scene corresponding to a scene identifier may be determined as a target identification scene according to the scene identifier carried by the audio to be identified, where the scene identifier may be added to a related person who collects the audio to be identified. Therefore, the scene identification is added when the audio to be recognized is collected, the speed of acquiring the target recognition scene during voice recognition can be improved, the voice recognition model suitable for the audio to be recognized is acquired more quickly, and the efficiency and the accuracy of the voice recognition are improved.

In other embodiments, for example, if the audio to be recognized is obtained through network downloading, it may not carry the scene identifier, and based on this, the environmental voiceprint feature of the audio to be recognized may be obtained, and then the environmental voiceprint feature is matched with the preset voiceprint feature library, and the recognition scene corresponding to the preset voiceprint feature matched with the environmental voiceprint feature is obtained as the target recognition scene. Therefore, the target recognition scene is determined by recognizing the environmental voiceprint characteristics of the audio to be recognized, the accuracy of scene recognition is improved, the accuracy of the acquired voice recognition model corresponding to the target recognition scene is improved, and the accuracy of voice recognition is improved.

In still other embodiments, an input scene selection instruction may be received, where the scene selection instruction carries a scene identifier, and an identification scene corresponding to the scene identifier carried by the scene selection instruction is further acquired as a target identification scene. Therefore, in the voice under some complex environments, the user can select a target recognition scene, and then the voice recognition model more suitable for the audio to be recognized can be obtained, and the accuracy of voice recognition is ensured.

Step S140: and inputting the high-dimensional acoustic features into a pre-trained voice recognition model corresponding to the target recognition scene to obtain a text recognition result corresponding to the audio to be recognized.

Since the high-dimensional acoustic features extracted by the acoustic feature extraction model include a speech context, in order to ensure that the features of the current speech frame are not affected by other speech frames, normalization of the high-dimensional acoustic features needs to be performed inside each frame of the high-dimensional acoustic features.

In some embodiments, each frame of the plurality of frames of high-dimensional acoustic features is normalized, wherein the high-dimensional acoustic features can be normalized by the following formula:

the method comprises the steps of A [ i ] the ith characteristic value in the current audio frame, A [ i ]' the characteristic value after the ith characteristic value in the current audio frame is normalized, M represents the maximum value in the high-dimensional acoustic characteristics of the current audio frame, N represents the minimum value in the high-dimensional acoustic characteristics of the current audio frame, and u represents the characteristic value mean value in the high-dimensional acoustic characteristics.

And further, inputting the multi-frame high-dimensional acoustic features subjected to normalization into a pre-trained voice recognition model corresponding to the target recognition scene for voice recognition, and obtaining a text recognition result corresponding to the audio to be recognized. Therefore, before the high-dimensional acoustic features are input into the voice recognition model, each frame of high-dimensional acoustic features are normalized, the high-dimensional acoustic features of each frame of audio frames are guaranteed not to be influenced by other audio frames, and the accuracy of voice recognition is improved; moreover, the high-dimensional acoustic features are normalized, so that the recognition speed of the voice recognition model can be improved, and the voice recognition efficiency is improved.

The speech recognition model may be a DNN-HMM model, an LSTM-HMM model, a CNN-HMM model, or the like, which is not limited in this embodiment.

In the embodiment, the pre-trained acoustic feature extraction model is used for extracting high-dimensional acoustic features in the audio to be recognized, and compared with other audio features such as MFCC (Mel frequency cepstrum coefficient), Fbank and the like in the prior art, the audio to be recognized contains more feature information which is beneficial to speech recognition, namely the audio to be recognized has better characteristics, so that a text recognition result recognized based on the high-dimensional acoustic features is more accurate, and the accuracy of speech recognition is improved; and calling a voice recognition model corresponding to the recognition scene of the audio to be recognized, and performing voice recognition on the audio to be recognized, namely selecting a voice recognition model more suitable for the audio to be recognized for voice recognition, so that more targeted voice recognition is realized, and the accuracy of a text recognition result is further improved.

Referring to fig. 2, fig. 2 is a flowchart illustrating a speech recognition method based on high-dimensional acoustic features according to another embodiment of the present application. The speech recognition method based on high-dimensional acoustic features provided by the embodiment of the present application will be described in detail below with reference to fig. 2. The speech recognition method based on the high-dimensional acoustic features can comprise the following steps:

step S210: and acquiring the audio to be identified.

In this embodiment, the specific implementation manner of step S210 may refer to the contents in the foregoing embodiments, and is not described herein again.

Step S220: and acquiring the high-dimensional characteristics of the audio to be identified through a characteristic extraction module of an acoustic characteristic extraction model, wherein the acoustic characteristic extraction model comprises a characteristic extraction module, a coding module, an alignment module and a decoding module.

In this embodiment, please refer to fig. 3, where fig. 3 shows a network structure diagram of an acoustic feature extraction model, specifically, the acoustic feature extraction model includes a feature extraction module, an encoding module, an alignment module, and a decoding module. The feature extraction module is obtained based on transform model improvement, please refer to fig. 4, the output module of the original transform model is removed, and only the input module, the encoding module and the decoding module in the original transform model are retained. It can be understood that after the audio features of the audio to be recognized are input into the acoustic feature extraction model, the high-dimensional features output by the decoding module in the transform model are the above-mentioned high-dimensional features acquired by the feature extraction module.

Step S230: and coding the high-dimensional characteristics through the coding module to obtain a first coding result.

Further, since the features output by the decoding module in the transform model are vector features, i.e., features in the form of vectors, the features input to the speech recognition model need to be matrix features, i.e., features in the form of matrices. If the vector features are directly input into the speech recognition model for recognition, the speech recognition effect may be poor, and the accuracy is reduced, so that the vector features output by the decoding module in the transform model can be converted into matrix features in an encoding and decoding manner, so as to improve the accuracy of speech recognition.

Step S240: and aligning the first coding result through the aligning module to obtain a second coding result, wherein the time stamp of the high-dimensional feature in the second coding result is consistent with the time stamp of the audio feature corresponding to the audio to be identified.

In this embodiment, when the audio features in the audio to be recognized are input into the acoustic feature extraction model to extract the high-dimensional features, the high-dimensional features output by the encoder of the transform model may change the timestamp corresponding to each frame in the original audio features, that is, the timestamps of the high-dimensional features output by the feature extraction module in the acoustic feature extraction model may not be consistent with the timestamps of the audio features corresponding to the audio features when the audio features are input, and then in some application scenarios, text information corresponding to a specific moment of the audio to be recognized cannot be accurately recognized, and further the voice recognition effect is affected. Therefore, the first coding result can be aligned to obtain the second coding result, so that the time stamp of each frame of high-dimensional feature in the second coding result is consistent with the time stamp of the corresponding audio feature in the input time, and further, the text information corresponding to a certain moment can be accurately positioned in the result of voice recognition, and the accuracy and pertinence of the voice recognition are improved.

Step S250: and decoding the second coding result through the decoding module to obtain a decoded high-dimensional feature which is used as the high-dimensional acoustic feature.

Further, after the second encoding result after the alignment processing is obtained, the second encoding result may be decoded by a decoding module, so as to obtain a decoded high-dimensional feature (i.e., matrix feature) as the high-dimensional acoustic feature.

Step S260: and acquiring an identification scene corresponding to the audio to be identified as a target identification scene.

Step S270: and inputting the high-dimensional acoustic features into a pre-trained voice recognition model corresponding to the target recognition scene to obtain a text recognition result corresponding to the audio to be recognized.

In this embodiment, the detailed implementation of steps S260 to S270 may refer to the content in the foregoing embodiments, and will not be described herein again.

In this embodiment, the high-dimensional features extracted by the feature extraction module of the acoustic feature extraction model may be subjected to encoding and decoding operations, so that the high-dimensional acoustic features output by the acoustic feature extraction model may meet the requirements of input features in the speech recognition model, and the recognition accuracy of the speech recognition model may be improved; moreover, the high-dimensional features are aligned, so that the time stamp of each frame of high-dimensional features is consistent with the time stamp of the corresponding audio feature in the input process, the text information corresponding to a certain moment can be accurately positioned in the result of voice recognition, and the accuracy and pertinence of the voice recognition are improved.

Referring to fig. 5, fig. 5 is a flowchart illustrating a method for training a speech recognition model based on high-dimensional acoustic features according to still another embodiment of the present application. The method for training the speech recognition model based on the high-dimensional acoustic features provided by the embodiment of the present application will be described in detail below with reference to fig. 5. The training method of the speech recognition model based on the high-dimensional acoustic features can comprise the following steps:

step S310: obtaining a first audio sample set, wherein first audio samples contained in the first audio sample set are all under the same identification scene.

In this embodiment, different speech recognition models can be trained for different recognition scenes, so that the speech recognition model corresponding to the recognition scene of the audio to be recognized can be used to recognize the audio to be recognized, and the pertinence and the accuracy of the speech recognition are improved; it can be understood that, if a speech recognition model in a certain recognition scene is to be trained, a large number of audio samples in the recognition scene need to be acquired as training samples. Therefore, the speech recognition models used in different recognition scenes can be trained according to the requirements in a classified mode. Specifically, a first audio sample set may be obtained, where the first audio samples included in the first audio sample set are all in the same recognition scene, for example, a vertical domain, a special accent, and the like, which is not limited in this embodiment.

Step S320: and acquiring high-dimensional features corresponding to the first audio sample set based on a pre-trained voiceprint feature extraction model to obtain a plurality of high-dimensional acoustic features, wherein each high-dimensional acoustic feature in the plurality of high-dimensional acoustic features corresponds to each first audio sample in the first audio sample set in a one-to-one mode.

In this embodiment, after the first audio sample set is obtained, each first audio sample in the first audio sample set is input into a pre-trained voiceprint feature extraction model, so that a plurality of high-dimensional acoustic features can be obtained.

Wherein, training the voiceprint feature extraction model may include: obtaining a second set of audio samples; and training a second initial model based on the second audio sample set until the second initial model meets a second preset condition to obtain the voiceprint feature extraction model. The second audio sample can include audio samples in various recognition scenes as much as possible, so that the voiceprint feature extraction model obtained based on the second audio sample training is more universal and robust, that is, high-dimensional acoustic features of the voiceprint feature extraction model can be accurately extracted for the speech to be recognized in any recognition scene.

In some embodiments, the second initial model may be a transform model, and a second text recognition result for the second audio sample set output by the transform model is obtained based on the second audio sample set as input data of the transform model; obtaining the difference between each second text recognition result and the second text information correspondingly labeled by the corresponding second audio sample so as to determine a second recognition loss value; adjusting parameters of the transform model according to the second recognition loss value, and performing voice recognition on a second audio sample set based on the transform model after the parameters are adjusted, namely performing iterative training on the transform model based on the second recognition loss value until the second recognition loss value meets a second preset condition to obtain a target voice recognition model; and removing the linear layer and the softmax classifier in the target speech recognition model in fig. 4, and combining the output of the target speech recognition model with the modules removed with the encoding module, the alignment module and the decoding module to obtain the voiceprint feature extraction model. Understandably, the high-dimensional acoustic features of the audio to be recognized are extracted by utilizing the high recognition rate of the transform model, so that the accuracy of subsequent speech recognition based on the high-dimensional acoustic features is improved.

Wherein, the second preset condition may be: the second recognition loss value is smaller than a preset value, the second recognition loss value does not change any more, or the training times reach preset times, and the like. It can be understood that after the iterative training of a plurality of training periods is performed on the second initial model according to the second audio sample set, wherein each training period includes a plurality of iterative training, parameters in the second initial model are continuously optimized, so that the second recognition loss value is smaller and smaller, and finally, the second recognition loss value is smaller and smaller as a fixed value or smaller than the preset value, and at this time, it indicates that the second initial model has converged; of course, it may also be determined that the second initial model has converged after the training times reach the preset times, and at this time, the output module of the second initial model may be improved to obtain the voiceprint feature extraction model. The preset value and the preset times are preset, and the values of the preset value and the preset times can be adjusted according to different application scenarios, which is not limited in this embodiment.

Step S330: training a first initial model based on the plurality of high-dimensional acoustic features until the first initial model meets a first preset condition, and obtaining a voice recognition model under a recognition scene corresponding to the first audio sample set.

In some embodiments, referring to fig. 6, step S330 may include the following steps:

step S331: and inputting the plurality of high-dimensional acoustic features into the first initial model, and obtaining a text recognition result corresponding to each high-dimensional acoustic feature in the plurality of high-dimensional acoustic features to obtain a plurality of text recognition results.

Step S332: and determining a first identification loss value of the first initial model based on the plurality of text identification results and the corresponding labeled text information sets of the first audio sample set.

Step S333: and performing iterative training on the first initial model according to the first recognition loss value until the recognition loss value meets the first preset condition, so as to obtain a speech recognition model under a recognition scene corresponding to the first audio sample set.

In this embodiment, after the high-dimensional acoustic features are obtained, the high-dimensional acoustic features may be normalized, which may prevent problems such as lengthening of model convergence time due to a large difference in a range of the features output by the network, and reduce the time for training the model. Further, the normalized multiple high-dimensional acoustic features may be input into a first initial model, and an output of the first initial model is obtained as the text recognition result; then obtaining the difference between each text recognition result and the text information correspondingly labeled by each first audio sample in the first audio sample set, and determining the recognition loss value of the first initial model; and adjusting parameters of the first initial model according to the recognition loss value, and performing voice recognition on the first audio sample set based on the first initial model after the parameters are adjusted, namely performing iterative training on the first initial model based on the recognition loss value until the recognition loss value meets a first preset condition, so as to obtain a voice recognition model under a recognition scene corresponding to the first audio sample set.

The first preset condition may be: the first identification loss value is smaller than a preset value, the first identification loss value does not change any more, or the training times reach preset times, and the like. It can be understood that after the iterative training of a plurality of training periods is performed on the first initial model according to the first audio sample set, wherein each training period includes a plurality of iterative training, parameters in the first initial model are continuously optimized, so that the first identification loss value is smaller and smaller, and finally, the first identification loss value is smaller and smaller as a fixed value or smaller than the preset value, and at this time, it indicates that the first initial model has converged; of course, it may also be determined that the first initial model has converged after the training times reach the preset times, and at this time, the first initial model may be used as the speech recognition model. The preset value and the preset times are preset, and the values of the preset value and the preset times can be adjusted according to different application scenarios, which is not limited in this embodiment.

In this embodiment, when the recognition rate of data in a specific recognition scene needs to be optimized, such as a specific accent and a specific field speech recognition, the voiceprint feature extraction model does not need to be retrained, only the high-dimensional acoustic features in the first training sample set need to be extracted through the voiceprint feature extraction model, and based on the extracted high-dimensional acoustic features, the speech recognition model used for the specific recognition scene is trained, so that the time for model training is reduced, and the cost required for model optimization is greatly reduced. And the scheme can be understood as combining a transform model and a DNN-HMM model to train an obtained speech recognition model, extracting high-dimensional acoustic features based on the high recognition rate of the transform model, and simultaneously training the DNN-HMM model by combining the high-dimensional acoustic features to obtain a final speech recognition model.

Referring to fig. 7, a block diagram of a speech recognition apparatus 400 based on high-dimensional acoustic features according to an embodiment of the present application is shown. The apparatus 400 may include: an audio acquisition module 410, a high-dimensional feature extraction module 420, a scene determination module 430, and an audio recognition module 440.

The audio obtaining module 410 is used for obtaining audio to be identified.

The high-dimensional feature extraction module 420 is configured to obtain, based on a pre-trained acoustic feature extraction model, a high-dimensional feature corresponding to the audio to be recognized, as the high-dimensional acoustic feature of the audio to be recognized.

The scene determining module 430 is configured to obtain an identification scene corresponding to the audio to be identified as a target identification scene.

The audio recognition module 440 is configured to input the high-dimensional acoustic features into a pre-trained speech recognition model corresponding to the target recognition scene, so as to obtain a text recognition result corresponding to the audio to be recognized.

In some embodiments, the acoustic feature extraction model includes a feature extraction module, an encoding module, an alignment module, and a decoding module, and the high-dimensional feature extraction module 420 may include: the device comprises a feature extraction unit, an encoding unit, an alignment unit and a decoding unit. The feature extraction unit may be configured to obtain, by the feature extraction module, a high-dimensional feature of the audio to be identified. The encoding unit may be configured to encode the high-dimensional feature through the encoding module to obtain a first encoding result. The alignment unit may be configured to perform alignment processing on the first encoding result through the alignment module to obtain a second encoding result, where a timestamp of a high-dimensional feature in the second encoding result is consistent with a timestamp of an audio feature corresponding to the audio to be identified. The decoding unit may be configured to decode the second encoding result through the decoding module to obtain a decoded high-dimensional feature, which is used as the high-dimensional acoustic feature.

In some embodiments, the number of frames of the high-dimensional acoustic features is multiple frames, and the speech frequency recognition module 440 includes: normalization unit and identification unit. The normalization unit may be configured to normalize each frame of high-dimensional acoustic features in the plurality of frames of high-dimensional acoustic features. The recognition unit may be configured to input the normalized multi-frame high-dimensional acoustic features into a pre-trained speech recognition model corresponding to the target recognition scene for speech recognition, so as to obtain a text recognition result corresponding to the audio to be recognized.

In this manner, the normalization unit may be specifically configured to:

wherein A [ i ] is the ith feature value in each frame of high-dimensional acoustic feature, A [ i ]' is the feature value normalized by the ith feature value in each frame of high-dimensional acoustic feature, M is the maximum value in each frame of high-dimensional acoustic feature, N is the minimum value in each frame of high-dimensional acoustic feature, and u is the feature value mean value in each frame of high-dimensional acoustic feature.

In some implementations, the scenario determination module 430 may include: the device comprises an environmental characteristic acquisition unit, a judgment unit and a scene determination unit. The environmental characteristic obtaining unit may be configured to obtain an environmental voiceprint characteristic of the audio to be identified. The judging unit may be configured to judge whether a preset voiceprint feature matching the environmental voiceprint feature exists in a preset voiceprint feature library. The scene determining unit may be configured to, if a preset voiceprint feature matching the environmental voiceprint feature exists in a preset voiceprint feature library, acquire an identification scene corresponding to the preset voiceprint feature matching the environmental voiceprint feature, as the target identification scene.

Referring to fig. 8, a block diagram of a training apparatus 500 for a speech recognition model based on high-dimensional acoustic features according to an embodiment of the present application is shown. The apparatus 500 may comprise: a training sample acquisition module 510, a high-dimensional feature extraction module 520, and a model training module 530.

The training sample obtaining module 510 is configured to obtain a first audio sample set, where first audio samples included in the first audio sample set are all in the same recognition scenario.

The high-dimensional feature extraction module 520 is configured to obtain, based on a pre-trained voiceprint feature extraction model, a high-dimensional feature corresponding to the first audio sample set to obtain multiple high-dimensional acoustic features, where each high-dimensional acoustic feature in the multiple high-dimensional acoustic features corresponds to each first audio sample in the first audio sample set one to one.

The model training module 530 is configured to train a first initial model based on the multiple high-dimensional acoustic features until the first initial model meets a first preset condition, so as to obtain a speech recognition model in a recognition scene corresponding to the first audio sample set.

In some embodiments, the training apparatus 500 based on the speech recognition model of the high-dimensional acoustic features may further include: and a feature extraction model training module. The feature extraction model training module can be used for acquiring a second audio sample set; and training a second initial model based on the second audio sample set until the second initial model meets a second preset condition to obtain the voiceprint feature extraction model.

In this manner, the second initial model is a transformer model, and the feature extraction model training module may be specifically configured to: training the transformer model based on the second audio sample set until the transformer model meets the second preset condition, and removing a linear layer and a classifier in the transformer model to obtain a feature extraction module; and generating the voiceprint feature extraction model based on the feature extraction module, the coding module, the alignment module and the decoding module.

In some embodiments, the model training module 530 may include: the device comprises a characteristic input unit, a loss value determining unit and an iterative training unit. The feature input unit may be configured to input the plurality of high-dimensional acoustic features into the first initial model, obtain a text recognition result corresponding to each of the plurality of high-dimensional acoustic features, and obtain a plurality of text recognition results. The loss value determination unit may be configured to determine a recognition loss value of the first initial model based on the plurality of text recognition results and the set of text information corresponding to the labels of the first set of audio samples. The iterative training unit may be configured to perform iterative training on the first initial model according to the recognition loss value until the recognition loss value meets the first preset condition, so as to obtain a speech recognition model in a recognition scene corresponding to the first audio sample set.

In this manner, the first preset condition includes that the first recognition loss value is smaller than a preset value, the first recognition loss value does not change any more, or the number of times of iterative training reaches a preset number of times.

It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described apparatuses and modules may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, the coupling between the modules may be electrical, mechanical or other type of coupling.

In addition, functional modules in the embodiments of the present application may be integrated into one processing module, or each of the modules may exist alone physically, or two or more modules are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode.

In summary, in the scheme provided by the embodiment of the present application, the audio to be identified is obtained; acquiring high-dimensional features corresponding to the audio to be recognized based on a pre-trained acoustic feature extraction model, and taking the high-dimensional features as the high-dimensional acoustic features of the audio to be recognized; acquiring an identification scene corresponding to the audio to be identified as a target identification scene; and inputting the high-dimensional acoustic features into a pre-trained voice recognition model corresponding to the target recognition scene to obtain a text recognition result corresponding to the audio to be recognized. Therefore, the high-dimensional acoustic features in the audio to be recognized are extracted through the acoustic feature extraction model trained in advance, and compared with other audio features such as MFCC (Mel frequency cepstrum coefficient) and Fbank in the prior art, the high-dimensional acoustic features contain more feature information beneficial to speech recognition, namely the audio to be recognized has better characteristics, so that a text recognition result recognized based on the high-dimensional acoustic features is more accurate, and the accuracy of speech recognition is improved; and calling a voice recognition model corresponding to the recognition scene of the audio to be recognized, and performing voice recognition on the audio to be recognized, namely selecting a voice recognition model more suitable for the audio to be recognized for voice recognition, so that more targeted voice recognition is realized, and the accuracy of a text recognition result is further improved.

A computer device provided by the present application will be described with reference to fig. 9.

Referring to fig. 9, fig. 9 shows a block diagram of a computer device 600 according to an embodiment of the present application, and the method according to the embodiment of the present application may be executed by the computer device 600. The computer device 600 may be a smart phone, a tablet computer, a smart watch, a notebook computer, a desktop computer, a server, a recording pen, or other devices capable of running an application program.

The computer device 600 in the embodiments of the present application may include one or more of the following components: a processor 601, a memory 602, and one or more applications, wherein the one or more applications may be stored in the memory 602 and configured to be executed by the one or more processors 601, the one or more programs configured to perform the methods as described in the aforementioned method embodiments.

Processor 601 may include one or more processing cores. The processor 601 connects various parts throughout the computer device 600 using various interfaces and lines, and performs various functions of the computer device 600 and processes data by executing or executing instructions, programs, code sets, or instruction sets stored in the memory 602, and calling data stored in the memory 602. Alternatively, the processor 601 may be implemented in hardware using at least one of Digital Signal Processing (DSP), Field-Programmable Gate Array (FPGA), and Programmable Logic Array (PLA). The processor 601 may integrate one or more of a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a modem, and the like. Wherein, the CPU mainly processes an operating system, a user interface, an application program and the like; the GPU is used for rendering and drawing display content; the modem is used to handle wireless communications. It is understood that the modem may be integrated into the processor 601, and implemented by a single communication chip.

The Memory 602 may include a Random Access Memory (RAM) or a Read-Only Memory (Read-Only Memory). The memory 602 may be used to store instructions, programs, code sets, or instruction sets. The memory 602 may include a stored program area and a stored data area, wherein the stored program area may store instructions for implementing an operating system, instructions for implementing at least one function (such as a touch function, a sound playing function, an image playing function, etc.), instructions for implementing various method embodiments described below, and the like. The storage data area may also store data created by the computer device 600 in use (such as the various correspondences described above), and so on.

In the several embodiments provided in the present application, the coupling or direct coupling or communication connection between the modules shown or discussed may be through some interfaces, and the indirect coupling or communication connection between the devices or modules may be in an electrical, mechanical or other form.

Referring to fig. 10, a block diagram of a computer-readable storage medium according to an embodiment of the present application is shown. The computer-readable medium 700 has stored therein program code that can be called by a processor to perform the methods described in the above-described method embodiments.

The computer-readable storage medium 700 may be an electronic memory such as a flash memory, an EEPROM (electrically erasable programmable read only memory), an EPROM, a hard disk, or a ROM. Optionally, the computer-readable storage medium 700 includes a non-transitory computer-readable storage medium. The computer readable storage medium 700 has storage space for program code 710 to perform any of the method steps of the method described above. The program code can be read from or written to one or more computer program products. The program code 710 may be compressed, for example, in a suitable form.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the present application, and not to limit the same; although the present application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not necessarily depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims

1. A method for speech recognition based on high-dimensional acoustic features, the method comprising:

acquiring audio to be identified;

acquiring a high-dimensional feature corresponding to the audio to be recognized based on a pre-trained acoustic feature extraction model, and taking the high-dimensional feature as the high-dimensional acoustic feature of the audio to be recognized;

acquiring an identification scene corresponding to the audio to be identified as a target identification scene;

and inputting the high-dimensional acoustic features into a pre-trained voice recognition model corresponding to the target recognition scene to obtain a text recognition result corresponding to the audio to be recognized.

2. The method according to claim 1, wherein the acoustic feature extraction model includes a feature extraction module, an encoding module, an alignment module and a decoding module, and the obtaining, based on the acoustic feature extraction model trained in advance, the high-dimensional features corresponding to the audio to be recognized as the high-dimensional acoustic features of the audio to be recognized includes:

acquiring the high-dimensional features of the audio to be identified through the feature extraction module;

coding the high-dimensional features through the coding module to obtain a first coding result;

aligning the first coding result through the aligning module to obtain a second coding result, wherein the time stamp of the high-dimensional feature in the second coding result is consistent with the time stamp of the audio feature corresponding to the audio to be identified;

and decoding the second coding result through the decoding module to obtain a decoded high-dimensional feature which is used as the high-dimensional acoustic feature.

3. The method according to claim 1 or 2, wherein the number of frames of the high-dimensional acoustic features is multiple frames, and the inputting the high-dimensional acoustic features into a pre-trained speech recognition model corresponding to the target recognition scene to obtain a text recognition result corresponding to the audio to be recognized comprises:

normalizing each frame of high-dimensional acoustic features in the multi-frame high-dimensional acoustic features;

and inputting the multi-frame high-dimensional acoustic features subjected to normalization into a pre-trained voice recognition model corresponding to the target recognition scene for voice recognition, and obtaining a text recognition result corresponding to the audio to be recognized.

4. The method of claim 3, wherein normalizing each frame of high-dimensional acoustic features in the plurality of frames of high-dimensional acoustic features comprises:

5. The method according to claim 1, wherein the acquiring the recognition scene corresponding to the audio to be recognized as a target recognition scene includes:

acquiring the environmental voiceprint characteristics of the audio to be identified;

judging whether a preset voiceprint feature matched with the environmental voiceprint feature exists in a preset voiceprint feature library or not;

and if so, acquiring an identification scene corresponding to the preset voiceprint features matched with the environmental voiceprint features as the target identification scene.

6. A method for training a speech recognition model based on high-dimensional acoustic features, the method comprising:

acquiring a first audio sample set, wherein first audio samples contained in the first audio sample set are all under the same identification scene;

acquiring high-dimensional features corresponding to the first audio sample set based on a pre-trained voiceprint feature extraction model to obtain a plurality of high-dimensional acoustic features, wherein each high-dimensional acoustic feature in the plurality of high-dimensional acoustic features corresponds to each first audio sample in the first audio sample set in a one-to-one mode;

training a first initial model based on the plurality of high-dimensional acoustic features until the first initial model meets a first preset condition, and obtaining a voice recognition model under a recognition scene corresponding to the first audio sample set.

7. The method of claim 6, wherein the training process of the voiceprint feature extraction model comprises:

obtaining a second set of audio samples;

and training a second initial model based on the second audio sample set until the second initial model meets a second preset condition to obtain the voiceprint feature extraction model.

8. The method of claim 7, wherein the second initial model is a transform model, and the training the second initial model based on the second audio sample set until the second initial model satisfies a second preset condition to obtain the voiceprint feature extraction model comprises:

training the transformer model based on the second audio sample set until the transformer model meets the second preset condition, and removing a linear layer and a classifier in the transformer model to obtain a feature extraction module;

and generating the voiceprint feature extraction model based on the feature extraction module, the coding module, the alignment module and the decoding module.

9. The method of claim 6, wherein the training the first initial model based on the high-dimensional acoustic features until the first initial model satisfies a first preset condition to obtain a speech recognition model in a recognition scene corresponding to the first audio sample set comprises:

inputting the plurality of high-dimensional acoustic features into the first initial model, and obtaining a text recognition result corresponding to each high-dimensional acoustic feature in the plurality of high-dimensional acoustic features to obtain a plurality of text recognition results;

determining a first identification loss value of the first initial model based on the plurality of text identification results and the corresponding labeled text information set of the first audio sample set;

and performing iterative training on the first initial model according to the first recognition loss value until the first recognition loss value meets the first preset condition, so as to obtain a speech recognition model under a recognition scene corresponding to the first audio sample set.

10. The method of claim 9, wherein the first predetermined condition comprises the first recognition loss value being less than a predetermined value, the first recognition loss value no longer changing, or a predetermined number of iterative training.