CN113327586A

CN113327586A - Voice recognition method and device, electronic equipment and storage medium

Info

Publication number: CN113327586A
Application number: CN202110610069.6A
Authority: CN
Inventors: 汪雪; 黄石磊; 程刚
Original assignee: Shenzhen Raisound Technology Co ltd
Current assignee: Shenzhen Raisound Technology Co ltd
Priority date: 2021-06-01
Filing date: 2021-06-01
Publication date: 2021-08-31
Anticipated expiration: 2041-06-01
Also published as: CN113327586B

Abstract

The application relates to a voice recognition method, comprising the following steps: acquiring audio data, and performing spectrum analysis on the audio data to generate a Mel-cepstrum of the audio data; performing feature extraction on the Mel inverse spectrogram by using a pre-trained audio recognition model to obtain a feature audio signal, and recognizing a phoneme sequence of the feature audio signal; and performing character extraction on the phoneme sequence, and taking a character extraction result as an identification result of the audio data. In addition, the application also provides a voice recognition device, an electronic device and a computer readable storage medium. The method and the device can improve the accuracy of voice recognition.

Description

Voice recognition method and device, electronic equipment and storage medium

Technical Field

The present application relates to the field of artificial intelligence, and in particular, to a speech recognition method, apparatus, electronic device, and computer-readable storage medium.

Background

In recent years, machine learning develops rapidly, a speech recognition task makes a great breakthrough in the context of deep learning, and although a traditional speech recognition framework can realize stable industrial recognition, with the introduction of deep learning, people in an intelligent big data era can no longer meet the limited model precision, and hope that speech recognition can process more complex data.

At present, speech recognition is usually realized by using a speech recognition model based on an attention mechanism, and because the data quality requirement of the speech recognition model based on the attention mechanism on speech to be recognized is extremely high, in an actual service scene, speech data to be recognized in different noise environments can be generated, such as data of scenes such as accent dialects, noisy scenes, far fields and the like, so that the speech recognition capability of the speech recognition model based on the attention mechanism can be influenced, and the accuracy of the speech recognition can be influenced.

Disclosure of Invention

In order to solve the technical problem or at least partially solve the technical problem, the present application provides a speech recognition method, apparatus, electronic device and computer-readable storage medium, which can improve the accuracy of speech recognition.

In a first aspect, the present application provides a speech recognition method, including:

acquiring audio data, and performing spectrum analysis on the audio data to generate a Mel-cepstrum of the audio data;

performing feature extraction on the Mel inverse spectrogram by using a pre-trained audio recognition model to obtain a feature audio signal, and recognizing a phoneme sequence of the feature audio signal;

and performing character extraction on the phoneme sequence, and taking a character extraction result as an identification result of the audio data.

According to the method, the characteristic data of the audio data can be extracted based on the spectral analysis of the audio data, so that the complexity of the audio data is reduced, and the analysis accuracy of the subsequent audio data can be improved; secondly, the method and the device perform feature extraction and phoneme recognition of the Mel-cepstrum of the audio data through a pre-trained audio recognition model, namely adopt end-to-end phoneme sequence recognition of the audio data, can enhance the anti-interference performance of the audio recognition model to the complex audio data, and further improve the analysis accuracy of the audio data. Therefore, compared with the prior art, the method and the device can enhance the anti-interference performance of the model to the audio data and improve the accuracy of the voice recognition.

In one possible implementation manner of the first aspect, the performing a spectrum analysis on the audio data to generate a mel-frequency cepstrum of the audio data includes:

preprocessing the audio data, and performing short-time Fourier transform on the preprocessed audio data to obtain a spectrogram of the audio data;

performing Mel spectrum filtering on the spectrogram, and performing cepstrum analysis on the Mel spectrum filtered spectrogram to obtain an initial Mel cepstrum of the audio data;

and performing discrete transformation on the initial Mel cepstrum to obtain a Mel cepstrum of the audio data.

In a possible implementation manner of the first aspect, before the performing the feature extraction on the mel-frequency cepstrum by using the pre-trained audio recognition model, the method further includes:

acquiring a training cepstrum and a corresponding first characteristic audio signal, and extracting a phoneme sequence from the first characteristic audio signal to obtain a first phoneme sequence;

performing spectrum enhancement on the training cepstrum, and taking the training cepstrum subjected to spectrum enhancement and the training cepstrum as model training data;

inputting the model training data into a convolution module of the audio recognition model to output a second characteristic audio signal of the model training data, and recognizing a second phoneme sequence of the second characteristic audio signal by using a phoneme recognition module of the audio recognition model;

calculating the training loss of the audio recognition model according to the first characteristic audio signal, the second characteristic audio signal, the first phoneme sequence and the second phoneme sequence;

if the training loss does not meet the preset condition, adjusting the parameters of the audio recognition model, and returning to the step of inputting the model training data into a convolution module of the audio recognition model;

and if the training loss meets a preset condition, obtaining a trained audio recognition model.

In one possible implementation manner of the first aspect, the inputting the model training data into a convolution module of the audio recognition model to output a second feature audio signal of the model training data includes:

carrying out convolution operation on the model training data by utilizing a convolution layer in the convolution module to obtain an initial characteristic audio signal;

utilizing a linear rectifying layer in the convolution module to carry out linear adjustment on the initial characteristic audio signal;

reducing the dimension of the initial characteristic audio signal after linear adjustment by using a pooling layer in the convolution module;

and outputting the initial characteristic audio signal after the dimension reduction by using a full connection layer in the convolution module to obtain a first characteristic audio signal.

In one possible implementation manner of the first aspect, the recognizing, by the phoneme recognition module of the audio recognition model, a second phoneme sequence of the second characteristic audio signal includes:

receiving the second characteristic audio signal by using an input layer in the phoneme recognition module, and setting delay data of the second characteristic audio signal;

extracting a phoneme sequence of the second characteristic audio signal by utilizing a hidden layer in the phoneme recognition module according to the delay data;

and outputting the extracted phoneme sequence by utilizing an output layer in the phoneme recognition module to obtain a second phoneme sequence.

In one possible implementation manner of the first aspect, the calculating a training loss of the audio recognition model according to the first characteristic audio signal, the second characteristic audio signal, the first phoneme sequence and the second phoneme sequence includes:

calculating a first training loss of the audio recognition model according to the first characteristic audio signal and the second characteristic audio signal;

calculating a second training loss of the audio recognition model according to the first phoneme sequence and the second phoneme sequence;

and calculating the training loss of the audio recognition model according to the first training loss and the second training loss.

In a possible implementation manner of the first aspect, the performing text extraction on the phoneme sequence includes:

calculating a text generation probability from the phoneme sequence;

and identifying a character information relation between the phoneme sequences according to the character generation probability, and generating corresponding characters according to the character information relation.

In a second aspect, the present application provides a speech recognition apparatus comprising:

the frequency spectrum analysis module is used for acquiring audio data, carrying out frequency spectrum analysis on the audio data and generating a Mel-cepstrum of the audio data;

the phoneme sequence recognition module is used for extracting features of the Mel inverse spectrogram by using a pre-trained audio recognition model to obtain a feature audio signal and recognizing a phoneme sequence of the feature audio signal;

and the character extraction module is used for extracting characters from the phoneme sequence and taking the character extraction result as the recognition result of the audio data.

In a third aspect, the present application provides an electronic device, comprising: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the speech recognition method according to any one of the above first aspects.

In a fourth aspect, the present application provides a computer-readable storage medium storing a computer program which, when executed by a processor, implements the speech recognition method according to any of the first aspects above.

The beneficial effects of the second to fourth aspects can be referred to the related description of the first aspect, and are not repeated herein.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present application and together with the description, serve to explain the principles of the application.

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly described below, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.

Fig. 1 is a detailed flowchart of a speech recognition method according to an embodiment of the present application;

FIG. 2 is a flowchart illustrating a step of the speech recognition method provided in FIG. 1 according to an embodiment of the present application;

FIG. 3 is a flowchart illustrating another step of the speech recognition method of FIG. 1 according to an embodiment of the present application;

FIG. 4 is a flowchart illustrating further steps of a speech recognition method provided in FIG. 1 according to an embodiment of the present application;

fig. 5 is a block diagram of a speech recognition apparatus according to an embodiment of the present application;

fig. 6 is a schematic internal structural diagram of an electronic device implementing a speech recognition method according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

A speech recognition method provided by an embodiment of the present application is described with reference to a flowchart shown in fig. 1.

Wherein, the method described in figure 1 comprises:

s1, obtaining audio data, carrying out spectrum analysis on the audio data, and generating a Mel-cepstrum of the audio data.

In the embodiment of the present application, the audio data refers to digitized sound data, and includes: voice, vocal music, and other sounds such as voice uttered by a user, vocal music played by a piano, and other sounds uttered by object impact.

As an embodiment of the present application, referring to fig. 2, the performing a spectrum analysis on the audio data to generate a mel-frequency cepstrum of the audio data includes:

s20, preprocessing the audio data, and performing short-time Fourier transform on the preprocessed audio data to obtain a spectrogram of the audio data;

s21, carrying out Mel spectrum filtering on the spectrogram, and carrying out cepstrum analysis on the filtered spectrogram to obtain an initial Mel cepstrum of the audio data;

and S22, performing discrete transformation on the initial Mel cepstrum to obtain a Mel cepstrum of the audio data.

In one embodiment of the present application, the preprocessing the audio data includes: and framing the audio data, and windowing the framed audio data. The framing refers to a process of dividing the signal of the audio data into audio signals of one frame and one frame, and is used for dividing a long audio signal into short audio signals, usually 10-30ms is taken as one frame, and the windowing is used for eliminating discontinuity of two ends of the framed signal.

In one embodiment of the present application, the short-time fourier transform refers to a process of performing fourier transform on a short-time signal, and is configured to convert a signal of audio data from a time domain to a frequency domain, so as to analyze a signal change condition of the audio data, and optionally, the short-time fourier transform is performed on the preprocessed audio data by using the following formula:

wherein, F (ω) represents a spectrogram, F (t) represents the number of the preprocessed audio data, and e represents the wireless acyclic decimal number.

In one embodiment of the present application, the mel-spectrum filtering is used to mask sound signals that do not conform to a preset frequency range in a sound map, so as to obtain a spectrogram conforming to the hearing habit of human ears, and the cepstrum analysis is to perform secondary spectrum analysis on the spectrogram of audio data, so as to extract contour information of the spectrogram, thereby obtaining feature data of the audio data. Optionally, Mel-spectrum filtering of the spectrogram is performed through a Mel filter bank, the preset frequency range is 200HZ to 500HZ, and cepstrum analysis is achieved by taking the logarithm of the spectrogram after Mel-spectrum filtering.

In one embodiment of the present application, the discrete transformation is used to perform image compression on the initial mel-frequency cepstrum to achieve dimension reduction of the initial mel-frequency cepstrum, so as to improve the processing speed of subsequent images.

S2, extracting features of the Mel inverse spectrogram by using a pre-trained audio recognition model to obtain a feature audio signal, and recognizing a phoneme sequence of the feature audio signal.

In an embodiment of the present application, the audio recognition model includes a convolution module and a phoneme recognition module, the convolution module is configured to extract the characteristic audio signal in the mel-frequency cepstrum, and the phoneme recognition module is configured to recognize a phoneme sequence of the extracted characteristic audio signal.

According to one embodiment of the application, the convolution module is constructed through a convolution neural network and comprises a convolution layer, a linear rectification layer, a pooling layer and a full connection layer, wherein the convolution layer is used for extracting different characteristic audio signals of an input Mel-cepstrum, such as edge, line angle and other hierarchical audio signals, the linear rectification layer is used as an activation function through linear rectification, the nonlinear characteristics of a judgment function and the whole neural network can be enhanced, so that the training speed of the neural network is improved, the pooling layer is used for reducing the dimension of the characteristic audio signals extracted by the convolution layer, the characteristics extracted by the convolution layer are cut into a plurality of areas, the maximum values or the average values of the areas are taken, new characteristic audio signals with small dimensions are obtained, and the full connection layer combines all local characteristics into a global characteristic, so that the characteristic audio signals are output.

According to one embodiment of the application, the phoneme recognition module is constructed through a time delay neural network and comprises an input layer, a hidden layer and an output layer, wherein the input layer is used for receiving characteristic audio signals transmitted by a convolution module, the hidden layer recognizes weights among the input characteristic audio signals through setting time delay so as to extract phoneme sequences with the characteristic audio signals meeting conditions, and the output layer is used for outputting the phoneme sequences extracted by the hidden layer.

Further, in this embodiment of the application, before the performing feature extraction on the mel inverse spectrogram by using the pre-trained audio recognition model, the method further includes: and training an audio recognition model.

Specifically, referring to fig. 3, the training of the audio recognition model includes:

s30, acquiring a training cepstrum and a corresponding first characteristic audio signal, and extracting a phoneme sequence from the first characteristic audio signal to obtain a first phoneme sequence;

s31, performing spectrum enhancement on the training cepstrum, and taking the training cepstrum after spectrum enhancement and the training cepstrum as model training data;

s32, inputting the model training data into a convolution module of the audio recognition model to output a second characteristic audio signal of the model training data, and recognizing a second phoneme sequence of the second characteristic audio signal by using a phoneme recognition module of the audio recognition model;

s33, calculating the training loss of the audio recognition model according to the first characteristic audio signal, the second characteristic audio signal, the first phoneme sequence and the second phoneme sequence;

s34, if the training loss does not meet the preset condition, adjusting the parameters of the audio recognition model, and returning to the step of inputting the model training data into the convolution module of the audio recognition model;

and S35, if the training loss meets the preset condition, obtaining the trained audio recognition model.

In an optional embodiment, the first characteristic audio signal is used as a real label of a characteristic audio signal obtained by subsequent model training, the first phoneme sequence is used as a real label of a phoneme sequence obtained by subsequent model training, and the learning effect of a subsequent model can be supervised based on the first characteristic audio signal and the first phoneme sequence, so that the overall recognition capability of the model is improved.

In an optional embodiment, the spectrum enhancement is used to increase the number of training cepstrums to increase training data for subsequent model training, so as to improve the overall robustness of the model.

In an optional embodiment, the inputting the model training data into a convolution module of the audio recognition model to output a second feature audio signal of the model training data includes: performing convolution operation on the model training data by using a convolution layer in the convolution module to obtain an initial characteristic audio signal, performing linear adjustment on the initial characteristic audio signal by using a linear rectification layer in the convolution module, performing dimensionality reduction on the initial characteristic audio signal after the linear adjustment by using a pooling layer in the convolution module, and outputting the initial characteristic audio signal after the dimensionality reduction through a full-connection layer in the convolution module to obtain a first characteristic audio signal.

In an optional embodiment, the recognizing a second phoneme sequence of the second characteristic audio signal by the phoneme recognition module of the audio recognition model includes: and receiving the second characteristic audio signal by using an input layer in the phoneme recognition module, setting delay data of the second characteristic audio signal, extracting a phoneme sequence of the second characteristic audio signal by using a hidden layer in the phoneme recognition module according to the delay data, and outputting the extracted phoneme sequence by using an output layer in the phoneme recognition module to obtain a second phoneme sequence.

In an alternative embodiment, referring to fig. 4, the S33 includes:

s40, calculating a first training loss of the audio recognition model according to the first characteristic audio signal and the second characteristic audio signal;

s41, calculating a second training loss of the audio recognition model according to the first phoneme sequence and the second phoneme sequence;

and S42, calculating the training loss of the audio recognition model according to the first training loss and the second training loss.

In an alternative embodiment of the present application, the first training loss is calculated according to the following formula:

LC＝m_glogm_p+(1-m_g)log(1-m_p)

where LC represents the first training loss, m_gRepresenting the first characteristic audio signal, m_pRepresenting the second characteristic audio signal.

In an alternative embodiment of the present application, the second training loss is calculated according to the following formula:

L1＝|α_p-α_g|

where L1 denotes a second training loss, α_gRepresenting a first phoneme sequence, alpha_pRepresenting a second phoneme sequence.

In an optional embodiment of the present application, the first training loss and the second training loss are added to obtain a training loss of the audio recognition model, i.e., L-L1 + LC.

In an optional embodiment of the present application, the preset condition comprises that the training loss is less than a loss threshold. That is, when the training loss is smaller than the loss threshold, it indicates that the training loss satisfies the preset condition, and when the training loss is greater than or equal to the loss threshold, it indicates that the training loss does not satisfy the preset condition. The loss threshold may be set to 0.1, or may be set according to an actual scene. Further, the parameter adjustment of the audio recognition model may be implemented by a gradient descent algorithm, such as a random descent algorithm.

And S3, performing character extraction on the phoneme sequence, and taking the character extraction result as the recognition result of the audio data.

The embodiment of the application identifies the audio identification result of the audio data by performing character extraction on the phoneme sequence. In one embodiment, the phoneme sequence is subjected to text extraction by using a language model, and the language model is language abstract mathematical modeling according to language objective facts and is used for identifying text information relations corresponding to the phoneme sequence.

In detail, the extracting the words from the phoneme sequence by using the language model includes: and calculating the character generation probability of the phoneme sequences by using the language model, identifying the character information relationship among the phoneme sequences according to the character generation probability, and generating corresponding characters according to the character information relationship. The character generation probability refers to the distribution probability of the phoneme sequence generating characters, and the character information relationship refers to the relationship that any two or more phoneme sequences can form characters.

According to the method, firstly, based on the frequency spectrum analysis of the audio data, the characteristic data of the audio data can be extracted, so that the complexity of the audio data is reduced, and the analysis accuracy of the subsequent audio data can be improved; secondly, the method and the device perform feature extraction and phoneme recognition of the Mel-cepstrum of the audio data through a pre-trained audio recognition model, namely adopt end-to-end phoneme sequence recognition of the audio data, can enhance the anti-interference performance of the audio recognition model to the complex audio data, and further improve the analysis accuracy of the audio data.

Fig. 5 is a functional block diagram of the speech recognition apparatus according to the present application.

The speech recognition apparatus 500 may be installed in an electronic device. According to the realized functions, the speech recognition device can comprise a spectrum analysis module 501, a phoneme sequence recognition module 502 and a character extraction module 503. A module according to the present invention, which may also be referred to as a unit, refers to a series of computer program segments that can be executed by a processor of an electronic device and that can perform a fixed function, and that are stored in a memory of the electronic device.

In the present embodiment, the functions regarding the respective modules/units are as follows:

the spectrum analysis module 501 is configured to obtain audio data, perform spectrum analysis on the audio data, and generate a mel-frequency cepstrum of the audio data;

the phoneme sequence recognition module 502 is configured to perform feature extraction on the mel-frequency cepstrum by using a pre-trained audio recognition model to obtain a feature audio signal, and recognize a phoneme sequence of the feature audio signal;

the text extraction module 503 is configured to perform text extraction on the phoneme sequence, and use a result of the text extraction as an identification result of the audio data.

In detail, in the embodiment of the present application, when the modules in the speech recognition apparatus 500 are used, the same technical means as the speech recognition method described in fig. 1 to fig. 4 are adopted, and the same technical effects can be produced, and are not described again here.

Fig. 6 is a schematic structural diagram of an electronic device implementing the speech recognition method according to the present application.

The electronic device 6 may comprise a processor 60, a memory 61 and a bus, and may further comprise a computer program, such as a speech recognition program 62, stored in the memory 61 and operable on the processor 60.

The memory 61 includes at least one type of readable storage medium, which includes flash memory, removable hard disk, multimedia card, card type memory (e.g., SD or DX memory, etc.), magnetic memory, magnetic disk, optical disk, etc. The memory 61 may in some embodiments be an internal storage unit of the electronic device 6, for example a removable hard disk of the electronic device 6. The memory 61 may also be an external storage device of the electronic device 6 in other embodiments, such as a plug-in mobile hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, provided on the electronic device 6. Further, the memory 61 may also include both an internal storage unit of the electronic device 6 and an external storage device. The memory 61 may be used not only to store application software installed in the electronic device 6 and various types of data, such as codes of the voice recognition program 62, but also to temporarily store data that has been output or is to be output.

The processor 60 may be formed of an integrated circuit in some embodiments, for example, a single packaged integrated circuit, or may be formed of a plurality of integrated circuits packaged with the same or different functions, including one or more Central Processing Units (CPUs), microprocessors, digital Processing chips, graphics processors, and combinations of various control chips. The processor 60 is a Control Unit (Control Unit) of the electronic device, connects various components of the whole electronic device by using various interfaces and lines, and executes various functions and processes data of the electronic device 6 by running or executing programs or modules (e.g., executing the voice recognition program 62, etc.) stored in the memory 61 and calling data stored in the memory 61.

The bus may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. The bus is arranged to enable connection communication between the memory 61 and at least one processor 60 or the like.

Fig. 6 shows only an electronic device with components, and it will be understood by those skilled in the art that the structure shown in fig. 6 does not constitute a limitation of the electronic device 6, and may comprise fewer or more components than those shown, or some components may be combined, or a different arrangement of components.

For example, although not shown, the electronic device 6 may further include a power supply (such as a battery) for supplying power to each component, and preferably, the power supply may be logically connected to the at least one processor 60 through a power management device, so that functions such as charge management, discharge management, and power consumption management are implemented through the power management device. The power supply may also include any component of one or more dc or ac power sources, recharging devices, power failure detection circuitry, power converters or inverters, power status indicators, and the like. The electronic device 6 may further include various sensors, a bluetooth module, a Wi-Fi module, and the like, which are not described herein again.

Further, the electronic device 6 may further include a network interface, and optionally, the network interface may include a wired interface and/or a wireless interface (such as a WI-FI interface, a bluetooth interface, etc.), which are generally used to establish a communication connection between the electronic device 6 and other electronic devices.

Optionally, the electronic device 6 may further comprise a user interface, which may be a Display (Display), an input unit (such as a Keyboard), and optionally a standard wired interface, a wireless interface. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch device, or the like. The display, which may also be referred to as a display screen or display unit, is suitable for displaying information processed in the electronic device 6 and for displaying a visualized user interface.

It is to be understood that the described embodiments are for purposes of illustration only and that the scope of the appended claims is not limited to such structures.

The speech recognition 62 stored by the memory 61 in the electronic device 6 is a combination of computer programs which, when run in the processor 60, enable:

Specifically, the processor 60 may refer to the description of the relevant steps in the embodiment corresponding to fig. 1 for a specific implementation method of the computer program, which is not described herein again.

Further, the integrated modules/units of the electronic device 6, if implemented in the form of software functional units and sold or used as separate products, may be stored in a non-volatile computer-readable storage medium. The computer readable storage medium may be volatile or non-volatile. For example, the computer-readable medium may include: any entity or device capable of carrying said computer program code, recording medium, U-disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM).

The present application also provides a computer-readable storage medium, storing a computer program that, when executed by a processor of an electronic device, may implement:

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus, device and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is only one logical functional division, and other divisions may be realized in practice.

The modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.

In addition, functional modules in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional module.

It will be evident to those skilled in the art that the present application is not limited to the details of the foregoing illustrative embodiments, and that the present application may be embodied in other specific forms without departing from the spirit or essential attributes thereof.

The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the application being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference signs in the claims shall not be construed as limiting the claim concerned.

It is noted that, in this document, relational terms such as "first" and "second," and the like, may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The above description is merely exemplary of the present application and is presented to enable those skilled in the art to understand and practice the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method of speech recognition, the method comprising:

2. The speech recognition method of claim 1, wherein the performing spectral analysis on the audio data to generate a mel-frequency cepstrum of the audio data comprises:

3. The speech recognition method of claim 1, wherein before the feature extraction of the mel-frequency cepstrum by using the pre-trained audio recognition model, the method further comprises:

4. The speech recognition method of claim 3 wherein inputting the model training data into a convolution module of the audio recognition model to output a second characteristic audio signal of the model training data comprises:

5. The speech recognition method of claim 3 wherein said recognizing a second sequence of phonemes for the second characteristic audio signal using the phoneme recognition module of the audio recognition model comprises:

6. The speech recognition method of claim 3 wherein calculating a training loss for the audio recognition model based on the first characteristic audio signal, the second characteristic audio signal, the first phoneme sequence, and the second phoneme sequence comprises:

7. The speech recognition method according to any one of claims 1 to 6, wherein the performing text extraction on the phoneme sequence comprises:

calculating a text generation probability from the phoneme sequence;

8. A speech recognition apparatus, comprising:

9. An electronic device, characterized in that the electronic device comprises:

at least one processor; and the number of the first and second groups,

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the speech recognition method of any one of claims 1 to 7.

10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out a speech recognition method according to any one of claims 1 to 7.