CN113192497A

CN113192497A - Speech recognition method, apparatus, device and medium based on natural language processing

Info

Publication number: CN113192497A
Application number: CN202110467540.0A
Authority: CN
Inventors: 康海梅; 魏韬; 马骏; 王少军
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2021-04-28
Filing date: 2021-04-28
Publication date: 2021-07-30
Anticipated expiration: 2041-04-28
Also published as: CN113192497B

Abstract

The invention discloses a voice recognition method, a system, equipment and a medium based on natural language processing, wherein the method comprises the following steps: extracting audio characteristic information of the voice information, analyzing the audio characteristic information through a confusion network to obtain pinyin information and initial text information, respectively converting the pinyin information and the initial text information to obtain pinyin coding sequences and initial character coding sequences, carrying out superposition combination to obtain combined coding sequences, carrying out error correction on the combined coding sequences according to a text error correction model to obtain error correction coding sequences, and carrying out reverse conversion on the error correction coding sequences to obtain text identification information. The invention belongs to the technical field of natural language processing, analyzes voice information based on the confusion network, corrects the combined coding sequence through a text correction model to finally obtain a text recognition result, and can correct text errors in initial text information obtained by initial recognition, thereby greatly improving the accuracy of recognizing the voice information.

Description

Speech recognition method, apparatus, device and medium based on natural language processing

Technical Field

The invention relates to the technical field of natural language processing, belongs to an application scene of intelligent recognition of voice information based on natural language processing in smart cities, and particularly relates to a voice recognition method, a voice recognition device, voice recognition equipment and voice recognition media based on natural language processing.

Background

With the rapid development of the voice recognition technology, the voice recognition method is widely applied to scenes such as intelligent voice customer service replacing manual customer service and intelligent voice home, a recognition model can be established through the voice recognition technology to recognize voice information input by a user so as to obtain a corresponding recognition result, and a corresponding program is executed or corresponding reply information is obtained according to the recognition result. However, the inventor finds that, in the existing speech recognition technology, speech information is usually recognized to obtain corresponding pinyin information, and the pinyin information is semantically analyzed to obtain text information, however, in the existing technology, a text error often exists in a recognition result due to a problem of matching precision, and the error forms usually include an insertion error, a replacement error and a deletion error, so that it is difficult to obtain an accurate text recognition result by recognizing the speech information. Therefore, the existing voice recognition method has the problem that the voice information is difficult to be accurately recognized.

Disclosure of Invention

The embodiment of the invention provides a voice recognition method, a voice recognition device, voice recognition equipment and a voice recognition medium based on natural language processing, and aims to solve the problem that the voice information is difficult to accurately recognize in the existing voice recognition method.

In a first aspect, an embodiment of the present invention provides a speech recognition method based on natural language processing, where the method includes:

if voice information input by a user is received, extracting the voice information according to a preset voice characteristic extraction model to obtain voice characteristic information;

analyzing the audio characteristic information according to a preset confusion network to obtain pinyin information and initial text information;

respectively converting the pinyin information and the initial text information according to a preset conversion dictionary to obtain a corresponding pinyin coding sequence and an initial character coding sequence;

the pinyin coding sequence and the initial character coding sequence are superposed and combined to obtain a combined coding sequence of the voice information;

inputting the combined coding sequence into a preset text error correction model for error correction to obtain a corresponding error correction coding sequence;

and carrying out reverse conversion on the error correction coding sequence according to the conversion dictionary to obtain text recognition information corresponding to the voice information.

In a second aspect, an embodiment of the present invention provides a speech recognition apparatus based on natural language processing, where the speech recognition apparatus based on natural language processing includes:

the audio characteristic information extraction unit is used for extracting audio characteristic information from the voice information according to a preset audio characteristic extraction model if the voice information input by a user is received;

the initial text information acquisition unit is used for analyzing the audio characteristic information according to a preset confusion network to obtain pinyin information and initial text information;

the coding sequence acquisition unit is used for respectively converting the pinyin information and the initial text information according to a preset conversion dictionary to obtain a corresponding pinyin coding sequence and an initial character coding sequence;

a combined coding sequence obtaining unit, configured to perform superposition combination on the pinyin coding sequence and the initial character coding sequence to obtain a combined coding sequence of the voice information;

the error correction coding sequence acquisition unit is used for inputting the combined coding sequence into a preset text error correction model for error correction to obtain a corresponding error correction coding sequence;

and the text recognition information acquisition unit is used for carrying out reverse conversion on the error correction coding sequence according to the conversion dictionary to obtain text recognition information corresponding to the voice information.

In a third aspect, an embodiment of the present invention further provides a computer device, where the computer device includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and the processor executes the computer program to implement the natural language processing based speech recognition method according to the first aspect.

In a fourth aspect, the embodiment of the present invention further provides a computer-readable storage medium, where the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, the computer program implements the speech recognition method based on natural language processing according to the first aspect.

The embodiment of the invention provides a voice recognition method, a voice recognition device, voice recognition equipment and a voice recognition medium based on natural language processing. Extracting audio characteristic information of the voice information, analyzing the audio characteristic information through a confusion network to obtain pinyin information and initial text information, respectively converting the pinyin information and the initial text information to obtain pinyin coding sequences and initial character coding sequences, carrying out superposition combination to obtain combined coding sequences, carrying out error correction on the combined coding sequences according to a text error correction model to obtain error correction coding sequences, and carrying out reverse conversion on the error correction coding sequences to obtain text identification information. By the method, the voice information is analyzed based on the confusion network, the combined coding sequence is subjected to error correction through the text error correction model, and finally the text recognition result is obtained, so that text errors in the initial text information obtained by initial recognition can be corrected, and the accuracy of recognizing the voice information is greatly improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a flow chart of a speech recognition method based on natural language processing according to an embodiment of the present invention;

FIG. 2 is a sub-flowchart of a speech recognition method based on natural language processing according to an embodiment of the present invention;

FIG. 3 is a schematic view of another sub-flow of a speech recognition method based on natural language processing according to an embodiment of the present invention;

FIG. 4 is a schematic view of another sub-flow of a speech recognition method based on natural language processing according to an embodiment of the present invention;

FIG. 5 is a schematic flow chart of a speech recognition method based on natural language processing according to an embodiment of the present invention;

FIG. 6 is a schematic flow chart of a speech recognition method based on natural language processing according to an embodiment of the present invention;

FIG. 7 is a schematic view of another sub-flow of a speech recognition method based on natural language processing according to an embodiment of the present invention;

FIG. 8 is a schematic block diagram of a speech recognition apparatus based on natural language processing according to an embodiment of the present invention;

FIG. 9 is a schematic block diagram of a computer device provided by an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It is also to be understood that the terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the specification of the present invention and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It should be further understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.

Referring to fig. 1, fig. 1 is a schematic flowchart illustrating a speech recognition method based on natural language processing according to an embodiment of the present invention; the speech recognition method based on natural language processing is applied to a user terminal or a management server, the speech recognition method based on natural language processing is executed through application software installed in the user terminal or the management server, the user terminal is terminal equipment for intelligently recognizing speech information input by a user, such as a desktop computer, a notebook computer, a tablet computer, a mobile phone, an intelligent speech assistant or an intelligent sound box, and the management server is a server end for receiving the speech information input by the user for intelligently recognizing, such as a server built in an enterprise or government department. As shown in fig. 1, the method includes steps S110 to S160.

And S110, if voice information input by a user is received, extracting the voice characteristic information from the voice information according to a preset voice characteristic extraction model.

And if the voice information input by the user is received, extracting the voice characteristic information from the voice information according to a preset voice characteristic extraction model. The voice information input by the user may be embodied as a sentence spoken by the user, wherein the audio feature extraction model includes a spectrum transformation rule, a frequency transformation formula, and an inverse transformation rule, the audio feature information may be Mel Frequency Cepstrum Coefficient (MFCC) corresponding to the voice information, and the audio feature information may be used to quantitatively represent the audio feature of the voice information input by the user.

In an embodiment, as shown in fig. 2, step S110 includes sub-steps S111, S112, S113 and S114.

And S111, performing framing processing on the voice information to obtain corresponding multi-frame audio information.

The voice information is represented in a computer by a spectrogram containing an audio track, the spectrogram contains a plurality of frames, each frame corresponds to one time unit, and then each frame of audio information can be obtained from the spectrogram of the voice information, and each frame of audio information corresponds to the audio information contained in one time unit.

And S112, converting the audio information contained in each unit time into a corresponding audio frequency spectrum according to a preset unit time and the frequency spectrum conversion rule.

The audio information can be segmented according to unit time to obtain a plurality of audio information segments, each audio information segment corresponds to multi-frame audio information contained in the unit time, Fast Fourier Transform (FFT) can be performed on each obtained audio information segment according to a spectrum conversion rule, then the FFT is rotated by 90 degrees, an audio spectrum corresponding to each audio information segment is obtained, and the frequency spectrum in the audio spectrum represents the relationship between frequency and energy. For example, the unit time may be set to 0.05S.

And S113, converting each audio frequency spectrum into a corresponding nonlinear audio frequency spectrum according to the frequency conversion formula.

The audio frequency spectrum expressed in a linear manner is converted into a non-linear audio frequency spectrum according to a frequency conversion formula, and in order to further highlight sound characteristics in voice information input by a user, the audio frequency spectrum expressed in a linear manner can be converted into a non-linear audio frequency spectrum. Both the audio spectrum and the nonlinear audio spectrum can be represented by a spectral curve, and the spectral curve is composed of a plurality of continuous spectral values.

Specifically, the frequency conversion formula can be represented by formula (1):

mel(f)＝2959×log(1+f/700) (1)；

where mel (f) is the spectrum value of the transformed nonlinear audio spectrum, and f is the frequency value of the audio frequency.

And S114, performing inverse transformation on each nonlinear audio frequency spectrum according to the inverse transformation rule to obtain a plurality of audio coefficients corresponding to each nonlinear audio frequency spectrum as the audio characteristic information.

Each nonlinear audio frequency spectrum can be inversely transformed according to an inverse transformation rule, specifically, logarithm of one obtained nonlinear audio frequency spectrum is taken and then Discrete Cosine Transform (DCT) is performed, 2 nd to 13 th coefficients subjected to Discrete Cosine Transform are taken and combined to obtain an audio coefficient corresponding to the nonlinear audio frequency spectrum, 12-dimensional audio coefficients can be correspondingly obtained from each nonlinear audio frequency spectrum, and the audio coefficient corresponding to each nonlinear audio frequency spectrum is obtained, so that audio characteristic information corresponding to voice information can be obtained.

And S120, analyzing the audio characteristic information according to a preset confusion network to obtain pinyin information and initial text information.

And analyzing the audio characteristic information according to a preset confusion network to obtain pinyin information and initial text information. The audio characteristic information is a sound characteristic of the voice information, and in order to recognize the voice information, the audio characteristic information may be analyzed through a Confusion Network (convergence Network) to obtain pinyin information and initial text information. Specifically, the confusion network is a neural network composed of a plurality of characters, each character comprises corresponding standard pinyin information, one standard pinyin information can correspond to a plurality of characters in the confusion network, and the standard pinyin information such as 'owei' can correspond to three characters of 'yes', 'bit' and 'stomach' in the confusion network; each standard pinyin information correspondingly comprises corresponding standard audio characteristic information, an association relation is established between characters through a normal form, the characters correspondingly comprise association coefficients, and the association coefficients are information for representing the association degree between the two characters.

For example, the characters associated with the upstream of the character "bit" are "ground" and "seat", the characters associated with the downstream of the character "bit" are "set" and "located", the correlation coefficient between "position" and "downstream" set "is 1.2, and the correlation coefficient between" position "and" downstream "located" is 0.75.

In an embodiment, as shown in fig. 3, step S120 includes sub-steps S121, S122 and S123.

And S121, acquiring a piece of pinyin information matched with the audio characteristic information according to the corresponding relation between the standard pinyin information and the standard audio characteristic information in the confusion network.

The confusion network comprises the corresponding relation between the standard audio features and the character phonetics, and one character phonetics corresponds to one standard audio feature one by one, so that the similarity between one character pronunciation in the audio feature information and each standard audio feature can be obtained, the character phonetics corresponding to the standard audio feature with the highest similarity are obtained as the phonetics corresponding to the character pronunciation, and the phonetics of each character pronunciation in the audio feature information are obtained to be combined to obtain a piece of corresponding phonetics information. Specifically, an audio segment whose loudness is greater than a loudness threshold may be obtained from the voice information, for example, the loudness threshold may be set to 45dB, a portion of the voice information whose loudness is greater than 45dB is an audio segment for a user to pronounce a corresponding character, each audio segment corresponds to a character pronunciation, an audio coefficient corresponding to each audio segment in the audio feature information is obtained as an audio feature of the corresponding character pronunciation, and a similarity between the audio feature of the character pronunciation and a standard audio feature is further calculated, specifically, the similarity may be calculated by using the following similarity calculation formula (2):

wherein d is^r _jIs the audio coefficient of j dimension in the standard audio feature, d^t _jFor the j-th dimension of the character-pronounced audio feature, S_dThe calculated similarity is obtained.

And S122, acquiring a plurality of optional text messages formed by connecting the characters corresponding to the pinyin information in series in the confusion network according to the association relationship between the standard pinyin information and the characters in the confusion network.

One standard pinyin information in the confusion network can correspond to a plurality of characters, a plurality of characters corresponding to the pinyin information in the confusion network can be obtained, a plurality of characters corresponding to the pinyin information are obtained from the confusion network according to the association relation among the characters in the confusion network, a plurality of paths formed by the plurality of characters in the confusion network in a serial connection mode according to the sequence of the characters are further obtained, and the characters contained in each path are combined into one piece of optional text information.

For example, a piece of pinyin information is "w/i, zh/i, z/i, n/a, l ǐ", and a plurality of pieces of selectable text information matching the pinyin information are "where position", "where position is again", and "where not reached" obtained from the confusion network.

And S123, calculating the path similarity of each optional text message in the confusion network, and acquiring an optional text message with the highest path similarity from the multiple optional text messages as an initial text message.

Calculating the path similarity of each piece of selectable text information according to the association coefficients between the characters in the confusion network, specifically, multiplying a plurality of association coefficients contained in a corresponding path of each piece of selectable text information in the confusion network, wherein the obtained product is the path similarity of the selectable text information, the path similarity can be used for quantitatively representing the strength of the association relationship between the characters in the path, and if the value of the path similarity is larger, the association relationship between the characters in the path is stronger; otherwise, the association relationship between the characters in the path is weak. And acquiring a piece of selectable text information with the highest path similarity according to the path similarity of the selectable text information as initial text information corresponding to the voice information input by the user.

For example, the selectable text information "where position" corresponds to a path similarity of 2.15, "where position" corresponds to a path similarity of 0.77, and "where miss" corresponds to a path similarity of 0.43, then "where position" is selected as the initial text information.

In an embodiment, as shown in fig. 5, step S120 further includes steps S1201, S1202, and S1203.

And S1201, respectively extracting corresponding standard audio characteristic information from standard voice information contained in a pre-stored standard data set according to the audio characteristic extraction model.

Before using the confusion network, the standard data set can be analyzed through an audio feature extraction model to construct the confusion network. Specifically, the standard data set is a data set which is pre-stored in the user terminal or the management server and contains a plurality of pieces of standard data, each piece of standard data contains a piece of standard text information, a piece of standard pinyin information and a piece of corresponding standard voice information, the standard voice information can be obtained by broadcasting, television news and presenter speech, the standard text information and the standard pinyin information can be obtained by manual marking, and the number of characters contained in the standard text information is equal to the number of character pinyins contained in the standard pinyin information. The corresponding standard audio feature information can be extracted from the standard voice information according to the audio feature extraction model, and the specific method for extracting the standard audio feature information is the same as the method for extracting the audio feature information from the voice information, and is not described herein again.

S1202, standard audio characteristics corresponding to the pinyin of each character in the standard pinyin information of the standard data set are obtained from the standard audio characteristic information.

Each piece of standard data comprises standard pinyin information, each character pinyin in the standard pinyin information corresponds to one voice segment in the standard voice information, the corresponding audio feature of each voice segment in the standard voice information in the standard audio feature information can be obtained, each character pinyin can correspond to one or more audio features in the standard audio feature information, and if one character pinyin corresponds to a plurality of audio features, the average value of the corresponding audio features in the same dimension is calculated to obtain the standard audio feature of the character pinyin; if one character pinyin only corresponds to one audio characteristic, the audio characteristic is directly used as the standard audio characteristic of the character pinyin.

S1203, constructing the confusion network according to standard text information in the standard data set and the association relation between the standard audio features and each character pinyin.

Specifically, the standard data set comprises a plurality of pieces of standard text information, each piece of standard text information is formed by combining a plurality of characters, and a confusion network can be constructed according to the incidence relation among the characters, the standard audio features and the pinyin of each character. Specifically, the number of characters related to the upstream and downstream of each character in the standard text information is counted, the number ratio of the characters related to the upstream and downstream of each character is obtained, a preset correlation coefficient calculation formula is input according to the number ratio of the characters, namely, the correlation coefficient between the characters can be calculated, and the correlation relationship between the characters is constructed according to the correlation coefficient and the upstream and downstream relationship between the characters to form an initial confusion network; and adding a corresponding character label to each character in the initial confusion network according to the corresponding relation between each character in the standard text information and the character pinyin and the standard audio characteristics to construct the confusion network, wherein the character label comprises the character pinyin and the standard audio characteristics. The correlation coefficient calculation formula can be expressed by formula (3):

where x is a character ratio associated with the character x, v is a parameter value preset in the formula, G_xIs the calculated correlation coefficient value.

For example, the value of v is set to 0.03 in advance, the number of characters "bit" in the downstream of the character "ground" in the standard data set is 65, and the total number of characters included in the downstream of the character "ground" is 1134, then the number of characters "bit" in the downstream of the character "ground" is calculated by taking the ratio x of the number of characters "bit" as 65/1134 as 0.05732, and the ratio of the number of characters is input into the correlation coefficient calculation formula, that is, the corresponding correlation coefficient G as 1.3658 can be calculated.

And S130, respectively converting the pinyin information and the initial text information according to a preset conversion dictionary to obtain corresponding pinyin code sequences and initial character code sequences.

And respectively converting the pinyin information and the initial text information according to a preset conversion dictionary to obtain a corresponding pinyin coding sequence and an initial character coding sequence. The conversion dictionary is a dictionary for converting character pinyin and characters, each character or each character pinyin can be matched in the conversion dictionary to obtain a corresponding code value, the pinyin information contains a plurality of character pinyins, the pinyin information can be converted into a pinyin code sequence formed by combining a plurality of code values according to the corresponding relation between the character pinyin and the code value in the conversion dictionary, the initial text information contains a plurality of characters, and the initial text information can be converted into an initial character code sequence formed by combining a plurality of code values according to the corresponding relation between the characters and the code values in the conversion dictionary.

S140, overlapping and combining the pinyin coding sequence and the initial character coding sequence to obtain a combined coding sequence of the voice information.

And superposing and combining the pinyin coding sequence and the initial character coding sequence to obtain a combined coding sequence of the voice information. The obtained pinyin coding sequence and the initial character coding sequence can be superposed and combined to obtain a combined coding sequence, and the combined coding sequence simultaneously contains a pinyin coding value and a character coding value corresponding to the voice information.

In an embodiment, as shown in fig. 4, step S140 includes sub-steps S141, S142 and S143.

S141, adding each pinyin code value in the pinyin code sequence and a corresponding character code value in the character code sequence to obtain a corresponding first code sequence.

Specifically, any pinyin code value in the pinyin code sequence can correspondingly find a character code value in the character code sequence, the pinyin code value and the character code value both correspond to the same character in the initial text information, one pinyin code value and one character code value corresponding to each character in the initial text information are obtained, the pinyin code value and the character code value of the same character are added for calculation, the first code sequence can be obtained, and the number of the code values contained in the first code sequence is equal to the number of the characters contained in the initial text information.

And S142, sequentially splicing each pinyin code value in the pinyin code sequence with a corresponding character code value in the character code sequence to obtain a corresponding second code sequence.

And acquiring a pinyin code value and a character code value corresponding to each character in the initial text information, and sequentially splicing the pinyin code value and the character code value of the same character to obtain a second code sequence, wherein the number of the code values contained in the second code sequence is twice of the number of the characters contained in the initial text information.

S143, combining the first coding sequence and the second coding sequence to be used as a corresponding combined coding sequence.

The obtained first coding sequence and the second coding sequence are combined to obtain a combined coding sequence, and specifically, the second coding sequence can be spliced behind the first coding sequence and correspondingly combined to obtain the combined coding sequence.

S150, inputting the combined coding sequence into a preset text error correction model for error correction to obtain a corresponding error correction coding sequence.

And inputting the combined coding sequence into a preset text error correction model for error correction to obtain a corresponding error correction coding sequence. The text error correction model is a neural network model for correcting errors of the combined coding sequence, an error correction coding sequence obtained after the errors of the combined coding sequence are corrected can be obtained through the text error correction model, the error correction coding sequence is inversely converted into text information, and the text information obtained through inverse conversion is the text information obtained after the errors of the initial text information are corrected. Specifically, the neural network model may be a neural network model constructed based on a bert (bidirectional Encoder retrieval from transforms) network and a Natural Language Processing neural network (NLP neural network), where the NLP neural network may be a neural network constructed based on a Multi-Head Self-Attention network (Multi-Head Self-Attention), and the NLP neural network is formed by combining a plurality of encoders and a plurality of decoders. The combined coding sequence can be firstly input into a BERT network for calculation to obtain a corresponding characterization vector, and the obtained characterization vector is input into an NLP neural network for calculation to obtain a corresponding error correction coding sequence. The BERT network is composed of an input layer, a plurality of intermediate layers, and an output layer, the input layer and the intermediate layers, the intermediate layers and other intermediate layers, and the intermediate layers and the output layer are all connected by correlation formulas, for example, a correlation formula can be expressed as y ═ a × x + b, a and b are parameter values in the correlation formula, x is an input value of the correlation formula, y is an output value of the correlation formula, the size of the obtained characterization vector is (N, M), that is, a vector matrix with N rows and M columns, where N is the number of coding values included in the combined coding sequence, and each vector value in the characterization vector belongs to a value range of [0, 1 ]. Calculating the characterization vectors through a plurality of encoders and a plurality of decoders in the NLP neural network to obtain a corresponding error correction coding sequence, wherein the number of the code values in the obtained error correction coding sequence can be equal to or unequal to the number of characters contained in the initial text sequence, and if the number of the code values in the error correction coding sequence is equal to the number of characters contained in the initial text sequence, it is indicated that no error exists or only a replacement error exists in the initial text sequence; if the number of the code values in the error correction code sequence is not equal to the number of the characters contained in the initial text sequence, it indicates that an insertion error or a deletion error exists in the initial text sequence.

In one embodiment, as shown in fig. 6, step S150 further includes steps S151, S152, S153, and S154.

Before using the text error correction model, a model training data set may be constructed based on the constructed training data set and the training data set to train the initial text error correction model, resulting in a trained text error correction model.

And S151, respectively extracting corresponding training audio characteristic information from training voice information contained in a pre-stored training data set according to the audio characteristic extraction model.

The training data set is a data set which is pre-stored in the user terminal or the management server and contains a plurality of pieces of training data, and each piece of training data comprises a piece of training text information and a corresponding piece of training voice information. The specific method for obtaining the training audio feature information by extracting the corresponding training audio feature information from each piece of training voice information through the audio feature extraction model is the same as the method for extracting the audio feature information from the voice information, and is not described herein again.

S152, analyzing the training audio characteristic information according to the confusion network to obtain training pinyin information and training prediction text information.

The obtained training audio characteristic information can be analyzed through a confusion network, the specific analysis mode is the same as the audio characteristic information, the training audio characteristic information is analyzed to obtain corresponding training pinyin information and training prediction text information, and specifically, one piece of training pinyin characteristic information is analyzed to obtain corresponding one piece of training pinyin information and a plurality of pieces of training prediction text information.

In one embodiment, as shown in fig. 7, step S152 further includes steps S1521, S1522, and S1523.

S1521, obtaining a piece of training pinyin information matched with each piece of training pinyin characteristic information according to the corresponding relation between the standard pinyin information and the standard audio characteristic information in the confusion network.

The specific method for obtaining the training pinyin information is the same as the specific method for obtaining the pinyin information of the audio characteristic information, and is not described herein again.

S1522, obtaining a plurality of candidate text messages corresponding to each training pinyin message in the confusion network according to the association relationship between the standard pinyin messages and the characters in the confusion network.

Because a plurality of characters corresponding to a piece of training pinyin information in the confusion network can form a plurality of paths by sequentially connecting the characters in the confusion network in series, a piece of training pinyin information can obtain a plurality of corresponding candidate text information from the confusion network, wherein the specific method for obtaining the candidate text information is the same as the method for obtaining the selectable text information, and is not repeated herein.

S1523, calculating the path similarity of each candidate text message in the confusion network, and screening the candidate text messages according to the path similarity to obtain a plurality of candidate text messages meeting preset screening conditions as corresponding training prediction text messages.

The path similarity of a plurality of candidate text information corresponding to each training pinyin information can be calculated, and a plurality of candidate text information meeting preset screening conditions are screened from the plurality of candidate text information corresponding to each training pinyin information and used as training prediction text information corresponding to each training pinyin information.

For example, if the preset filtering condition is configured to be 10, 10 candidate text information with the highest path similarity may be obtained from the multiple candidate text information corresponding to each piece of pinyin information, and the obtained candidate text information is used as 10 pieces of training prediction text information corresponding to each piece of pinyin information.

S153, combining the training pinyin information and the training predicted text information with corresponding training text information in the training data set to obtain a model training data set.

A piece of training text information can be converted to obtain a piece of corresponding target training encoding information, and the piece of training pinyin information and a plurality of pieces of corresponding training prediction text information are combined respectively to obtain a plurality of pieces of corresponding training prediction encoding information; and taking the target training coding information as a training target, and taking one piece of training prediction coding information in a plurality of pieces of training prediction coding information corresponding to the target training coding information as input, namely correspondingly combining to obtain a piece of model training data.

Specifically, a piece of training text information is converted according to a conversion dictionary to obtain a corresponding training character coding sequence, the training character coding sequence is used as target training coding information, a piece of training pinyin information corresponding to the training text information is converted according to the conversion dictionary to obtain a corresponding training pinyin coding sequence, a plurality of pieces of training predictive text information corresponding to the training text information are respectively converted according to the conversion dictionary to obtain a plurality of corresponding training predictive character coding sequences, the training pinyin coding sequence is superposed and combined with any one of the training predictive character coding sequences to obtain a corresponding piece of training predictive coding information, and then the plurality of training predictive character coding sequences are processed to correspondingly obtain a plurality of pieces of training predictive coding information. And combining one piece of training prediction coding information with a corresponding piece of target training coding information to obtain one piece of model training data, and combining a plurality of pieces of model training data to obtain a model training data set.

And S154, performing iterative training on the initial text error correction model according to the model training data set to obtain a trained text error correction model.

Specifically, the model training data set includes a plurality of pieces of model training data, the training prediction coding information in one piece of model training data is used as the input of the text error correction model, the corresponding piece of target training coding information is used as the current training target of the text error correction model, i.e., the text error correction model can be trained once, and a plurality of pieces of model training data can realize the repeated iterative training of the text error correction model.

The training of the text error correction model may be implemented based on a gradient descent calculation, and specifically, the loss value between the output result and the corresponding target training encoding information may be calculated based on an output result obtained by calculating the training prediction encoding information by the text error correction model, for example, the loss value may be calculated by a formula

Calculated, wherein H is the larger value of the number of coding values included in the target training coding information and the number of coding values included in the training prediction coding information, a_iTraining the ith code in the coded information for the targetValue, b_iAnd correspondingly acquiring an updated value of each parameter in the text error correction model based on the loss value and the gradient calculation formula for training the ith coding value in the predictive coding information, and updating the original parameter value of each parameter through the calculated updated value, thereby completing one training process of the text error correction model.

Specifically, the gradient calculation formula can be expressed as:

wherein the content of the first and second substances,

for the calculated updated value of the parameter f, ω_fIs the original parameter value of the parameter f, eta is the preset learning rate in the gradient calculation formula,

the partial derivative of the parameter f is calculated based on the calculated value corresponding to the loss value and the parameter f (the calculated value corresponding to the parameter is used in the calculation process).

And S160, carrying out reverse conversion on the error correction coding sequence according to the conversion dictionary to obtain text recognition information corresponding to the voice information.

And carrying out reverse conversion on the error correction coding sequence according to the conversion dictionary to obtain text recognition information corresponding to the voice information. The conversion dictionary comprises the corresponding relation between characters and coding values, the coding values contained in the error correction coding sequence can be reversely converted according to the conversion dictionary, namely, the coding values are converted into the corresponding characters, and the characters obtained by reverse conversion can be used as text recognition information corresponding to the voice information after being sequentially arranged.

The technical method can be applied to application scenes including intelligent recognition of voice information based on natural language processing, such as intelligent government affairs, intelligent city management, intelligent community, intelligent security protection, intelligent logistics, intelligent medical treatment, intelligent education, intelligent environmental protection and intelligent traffic, so that the construction of a smart city is promoted.

In the voice recognition method based on natural language processing provided by the embodiment of the invention, the audio characteristic information of the voice information is extracted, pinyin information and initial text information are obtained through confusion network analysis, the pinyin information and the initial text information are respectively converted to obtain a pinyin coding sequence and an initial character coding sequence, the pinyin coding sequence and the initial character coding sequence are superposed and combined to obtain a combined coding sequence, the combined coding sequence is corrected according to a text error correction model to obtain an error correction coding sequence, and the error correction coding sequence is reversely converted to obtain the text recognition information. By the method, the voice information is analyzed based on the confusion network, the combined coding sequence is subjected to error correction through the text error correction model, and finally the text recognition result is obtained, so that text errors in the initial text information obtained by initial recognition can be corrected, and the accuracy of recognizing the voice information is greatly improved.

An embodiment of the present invention further provides a speech recognition apparatus 100 based on natural language processing, which is configured to execute any of the foregoing speech recognition methods based on natural language processing. Specifically, referring to fig. 8, fig. 8 is a schematic block diagram of a speech recognition device based on natural language processing according to an embodiment of the present invention, where the speech recognition device based on natural language processing 100 includes an audio feature information extraction unit 110, an initial text information acquisition unit 120, a code sequence acquisition unit 130, a combined code sequence acquisition unit 140, an error correction code sequence acquisition unit 150, and a text recognition information acquisition unit 160.

The audio feature information extracting unit 110 is configured to, if voice information input by a user is received, extract audio feature information from the voice information according to a preset audio feature extraction model.

In one embodiment, the audio feature information extraction unit 110 includes sub-units: the framing processing unit is used for framing the voice information to obtain corresponding multi-frame audio information; the audio frequency spectrum acquisition unit is used for converting the audio information contained in each unit time into a corresponding audio frequency spectrum according to a preset unit time and the frequency spectrum conversion rule; the frequency conversion unit is used for converting each audio frequency spectrum into a corresponding nonlinear audio frequency spectrum according to the frequency conversion formula; and the audio characteristic information acquisition unit is used for carrying out inverse transformation on each nonlinear audio frequency spectrum according to the inverse transformation rule to obtain a plurality of audio coefficients corresponding to each nonlinear audio frequency spectrum as the audio characteristic information.

An initial text information obtaining unit 120, configured to analyze the audio feature information according to a preset confusion network to obtain pinyin information and initial text information.

In one embodiment, the initial text information obtaining unit 120 includes sub-units: the pinyin information acquisition unit is used for acquiring a piece of pinyin information matched with the audio characteristic information according to the corresponding relation between the standard pinyin information and the standard audio characteristic information in the confusion network; the optional text information acquisition unit is used for acquiring a plurality of optional text information formed by connecting characters corresponding to the pinyin information in series in the confusion network according to the incidence relation between the standard pinyin information and the characters in the confusion network; and the selectable text information screening unit is used for calculating the path similarity of each piece of selectable text information in the confusion network and acquiring a piece of selectable text information with the highest path similarity from the plurality of pieces of selectable text information as initial text information.

In an embodiment, the speech recognition device 100 based on natural language processing further comprises sub-units: the standard audio characteristic information acquisition unit is used for respectively extracting corresponding standard audio characteristic information from standard voice information contained in a pre-stored standard data set according to the audio characteristic extraction model; a standard pinyin information obtaining unit, configured to obtain standard audio characteristics corresponding to each character pinyin in the standard pinyin information of the standard data set from the standard audio characteristic information; and the confusion network construction unit is used for constructing the confusion network according to the standard text information in the standard data set and the association relationship between the standard audio features and each character pinyin.

And a code sequence obtaining unit 130, configured to convert the pinyin information and the initial text information according to a preset conversion dictionary to obtain corresponding pinyin code sequences and initial character code sequences.

And a combined coding sequence obtaining unit 140, configured to perform superposition combination on the pinyin coding sequence and the initial character coding sequence to obtain a combined coding sequence of the voice information.

In one embodiment, the combined code sequence obtaining unit 140 includes sub-units: the first coding sequence acquisition unit is used for adding each pinyin coding value in the pinyin coding sequence and a corresponding character coding value in the character coding sequence to obtain a corresponding first coding sequence; the second coding sequence acquisition unit is used for sequentially splicing each pinyin coding value in the pinyin coding sequence and a corresponding character coding value in the character coding sequence to obtain a corresponding second coding sequence; and the coding sequence combination unit is used for combining the first coding sequence and the second coding sequence to be used as a corresponding combined coding sequence.

And an error correction coding sequence obtaining unit 150, configured to input the combined coding sequence into a preset text error correction model for error correction to obtain a corresponding error correction coding sequence.

In an embodiment, the speech recognition device 100 based on natural language processing further comprises sub-units: the training audio characteristic information acquisition unit is used for respectively extracting corresponding training audio characteristic information from training voice information contained in a pre-stored training data set according to the audio characteristic extraction model; the training audio characteristic information analyzing unit is used for analyzing the training audio characteristic information according to the confusion network to acquire training pinyin information and training prediction text information; a model training data set obtaining unit, configured to combine the training pinyin information and the training prediction text information with corresponding training text information in the training data set to obtain a model training data set; and the model training unit is used for performing iterative training on the initial text error correction model according to the model training data set to obtain a trained text error correction model.

In one embodiment, the training audio feature information parsing unit includes: and the training pinyin information acquisition unit is used for acquiring a piece of training pinyin information matched with each piece of training audio characteristic information according to the corresponding relation between the standard pinyin information and the standard audio characteristic information in the confusion network. And the candidate text information acquisition unit is used for acquiring a plurality of candidate text information corresponding to each training pinyin information in the confusion network according to the association relationship between the standard pinyin information and the characters in the confusion network. And the training prediction text information acquisition unit is used for calculating the path similarity of each candidate text information in the confusion network and screening a plurality of candidate text information meeting preset screening conditions from the plurality of candidate text information according to the path similarity to obtain the corresponding training prediction text information.

A text recognition information obtaining unit 160, configured to perform inverse conversion on the error correction coding sequence according to the conversion dictionary to obtain text recognition information corresponding to the voice information.

The voice recognition device based on natural language processing provided by the embodiment of the invention applies the voice recognition method based on natural language processing, extracts the audio characteristic information of the voice information, obtains pinyin information and initial text information through confusion network analysis, respectively converts the pinyin information and the initial text information to obtain a pinyin code sequence and an initial character code sequence, performs superposition combination to obtain a combined code sequence, performs error correction on the combined code sequence according to a text error correction model to obtain an error correction code sequence, and performs reverse conversion on the error correction code sequence to obtain text recognition information. By the method, the voice information is analyzed based on the confusion network, the combined coding sequence is subjected to error correction through the text error correction model, and finally the text recognition result is obtained, so that text errors in the initial text information obtained by initial recognition can be corrected, and the accuracy of recognizing the voice information is greatly improved.

The above-described speech recognition method based on natural language processing may be implemented in the form of a computer program that can be run on a computer device as shown in fig. 9.

Referring to fig. 9, fig. 9 is a schematic block diagram of a computer device according to an embodiment of the present invention. The computer device may be a user terminal or a management server for performing a voice recognition method based on natural language processing to intelligently recognize voice information.

Referring to fig. 9, the computer device 500 includes a processor 502, memory, and a network interface 505 connected by a system bus 501, where the memory may include a storage medium 503 and an internal memory 504.

The storage medium 503 may store an operating system 5031 and a computer program 5032. The computer program 5032, when executed, may cause the processor 502 to perform a speech recognition method based on natural language processing, wherein the storage medium 503 may be a volatile storage medium or a non-volatile storage medium.

The processor 502 is used to provide computing and control capabilities that support the operation of the overall computer device 500.

The internal memory 504 provides an environment for running the computer program 5032 in the storage medium 503, and when the computer program 5032 is executed by the processor 502, the processor 502 may be caused to execute a speech recognition method based on natural language processing.

The network interface 505 is used for network communication, such as providing transmission of data information. Those skilled in the art will appreciate that the configuration shown in fig. 9 is a block diagram of only a portion of the configuration associated with aspects of the present invention and is not intended to limit the computing device 500 to which aspects of the present invention may be applied, and that a particular computing device 500 may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

The processor 502 is configured to run the computer program 5032 stored in the memory to implement the corresponding functions of the speech recognition method based on natural language processing.

Those skilled in the art will appreciate that the embodiment of a computer device illustrated in fig. 9 does not constitute a limitation on the specific construction of the computer device, and that in other embodiments a computer device may include more or fewer components than those illustrated, or some components may be combined, or a different arrangement of components. For example, in some embodiments, the computer device may only include a memory and a processor, and in such embodiments, the structures and functions of the memory and the processor are consistent with those of the embodiment shown in fig. 9, and are not described herein again.

It should be understood that, in the embodiment of the present invention, the Processor 502 may be a Central Processing Unit (CPU), and the Processor 502 may also be other general-purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components, and the like. Wherein a general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

In another embodiment of the invention, a computer-readable storage medium is provided. The computer readable storage medium may be a volatile or non-volatile computer readable storage medium. The computer-readable storage medium stores a computer program that, when executed by a processor, implements the natural language processing-based speech recognition method described above.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described apparatuses, devices and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again. Those of ordinary skill in the art will appreciate that the elements and algorithm steps of the examples described in connection with the embodiments disclosed herein may be embodied in electronic hardware, computer software, or combinations of both, and that the components and steps of the examples have been described in a functional general in the foregoing description for the purpose of illustrating clearly the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

In the embodiments provided by the present invention, it should be understood that the disclosed apparatus, device and method can be implemented in other ways. For example, the above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only a logical division, and there may be other divisions when the actual implementation is performed, or units having the same function may be grouped into one unit, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may also be an electric, mechanical or other form of connection.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment of the present invention.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention essentially contributes to the prior art, or all or part of the technical solution can be embodied in the form of a software product stored in a computer-readable storage medium, which includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned computer-readable storage media comprise: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a magnetic disk, or an optical disk.

While the invention has been described with reference to specific embodiments, the invention is not limited thereto, and various equivalent modifications and substitutions can be easily made by those skilled in the art within the technical scope of the invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A method for speech recognition based on natural language processing, the method comprising:

2. The speech recognition method based on natural language processing according to claim 1, wherein the audio feature extraction model includes a spectrum transformation rule, a frequency transformation formula and an inverse transformation rule, and the extracting of the audio feature information from the speech information according to the preset audio feature extraction model includes:

performing framing processing on the voice information to obtain corresponding multi-frame audio information;

converting the audio information contained in each unit time into a corresponding audio frequency spectrum according to a preset unit time and the frequency spectrum conversion rule;

converting each audio frequency spectrum into a corresponding nonlinear audio frequency spectrum according to the frequency conversion formula;

and performing inverse transformation on each nonlinear audio frequency spectrum according to the inverse transformation rule to obtain a plurality of audio coefficients corresponding to each nonlinear audio frequency spectrum as the audio characteristic information.

3. The speech recognition method based on natural language processing according to claim 1, wherein the parsing the audio feature information according to a preset confusion network to obtain pinyin information and initial text information comprises:

acquiring a piece of pinyin information matched with the audio characteristic information according to the corresponding relation between the standard pinyin information and the standard audio characteristic information in the confusion network;

acquiring a plurality of pieces of optional text information formed by connecting characters corresponding to the pinyin information in series in the confusion network according to the incidence relation between the standard pinyin information and the characters in the confusion network;

and calculating the path similarity of each optional text message in the confusion network, and acquiring an optional text message with the highest path similarity from the multiple optional text messages as initial text messages.

4. The speech recognition method based on natural language processing according to claim 1, wherein the obtaining of the combined code sequence of the speech information by the superposition and combination of the pinyin code sequence and the initial character code sequence comprises:

adding each pinyin code value in the pinyin code sequence and a corresponding character code value in the character code sequence to obtain a corresponding first code sequence;

sequentially splicing each pinyin code value in the pinyin code sequence with a corresponding character code value in the character code sequence to obtain a corresponding second code sequence;

combining the first coding sequence with the second coding sequence as a corresponding combined coding sequence.

5. The speech recognition method based on natural language processing according to claim 1, wherein before parsing the audio feature information according to a preset confusion network to obtain pinyin information and initial text information, the method further comprises:

extracting corresponding standard audio characteristic information from standard voice information contained in a pre-stored standard data set according to the audio characteristic extraction model;

acquiring standard audio characteristics corresponding to the pinyin of each character in the standard pinyin information of the standard data set from the standard audio characteristic information;

and constructing the confusion network according to the standard text information in the standard data set and the association relationship between the standard audio features and each character pinyin.

6. The method of claim 5, wherein before inputting the combined code sequence into a preset text error correction model for error correction to obtain a corresponding error correction code sequence, the method further comprises:

extracting corresponding training audio characteristic information from training voice information contained in a pre-stored training data set according to the audio characteristic extraction model;

analyzing the training audio characteristic information according to the confusion network to obtain training pinyin information and training prediction text information;

combining the training pinyin information and the training prediction text information with corresponding training text information in the training data set to obtain a model training data set;

and performing iterative training on the initial text error correction model according to the model training data set to obtain a trained text error correction model.

7. The method of claim 6, wherein the parsing the training audio feature information according to the confusion network to obtain training pinyin information and training predictive text information comprises:

acquiring training pinyin information matched with each piece of training audio characteristic information according to the corresponding relation between the standard pinyin information and the standard audio characteristic information in the confusion network;

acquiring a plurality of pieces of alternative text information corresponding to each piece of training pinyin information in the confusion network according to the association relationship between the standard pinyin information and the characters in the confusion network;

and calculating the path similarity of each candidate text message in the confusion network, and screening the multiple candidate text messages according to the path similarity to obtain multiple candidate text messages meeting preset screening conditions as corresponding training prediction text messages.

8. A speech recognition apparatus based on natural language processing, comprising:

9. A computer device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor executing the computer program to implement the natural language processing based speech recognition method of any one of claims 1 to 7.

10. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program which, when executed by a processor, implements the natural language processing based speech recognition method according to any one of claims 1 to 7.