CN113327586A - Voice recognition method and device, electronic equipment and storage medium - Google Patents

Voice recognition method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN113327586A
CN113327586A CN202110610069.6A CN202110610069A CN113327586A CN 113327586 A CN113327586 A CN 113327586A CN 202110610069 A CN202110610069 A CN 202110610069A CN 113327586 A CN113327586 A CN 113327586A
Authority
CN
China
Prior art keywords
audio
audio signal
phoneme sequence
data
characteristic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110610069.6A
Other languages
Chinese (zh)
Other versions
CN113327586B (en
Inventor
汪雪
黄石磊
程刚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Raisound Technology Co ltd
Original Assignee
Shenzhen Raisound Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Raisound Technology Co ltd filed Critical Shenzhen Raisound Technology Co ltd
Priority to CN202110610069.6A priority Critical patent/CN113327586B/en
Publication of CN113327586A publication Critical patent/CN113327586A/en
Application granted granted Critical
Publication of CN113327586B publication Critical patent/CN113327586B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • G10L2015/025Phonemes, fenemes or fenones being the recognition units

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Machine Translation (AREA)

Abstract

The application relates to a voice recognition method, comprising the following steps: acquiring audio data, and performing spectrum analysis on the audio data to generate a Mel-cepstrum of the audio data; performing feature extraction on the Mel inverse spectrogram by using a pre-trained audio recognition model to obtain a feature audio signal, and recognizing a phoneme sequence of the feature audio signal; and performing character extraction on the phoneme sequence, and taking a character extraction result as an identification result of the audio data. In addition, the application also provides a voice recognition device, an electronic device and a computer readable storage medium. The method and the device can improve the accuracy of voice recognition.

Description

Voice recognition method and device, electronic equipment and storage medium
Technical Field
The present application relates to the field of artificial intelligence, and in particular, to a speech recognition method, apparatus, electronic device, and computer-readable storage medium.
Background
In recent years, machine learning develops rapidly, a speech recognition task makes a great breakthrough in the context of deep learning, and although a traditional speech recognition framework can realize stable industrial recognition, with the introduction of deep learning, people in an intelligent big data era can no longer meet the limited model precision, and hope that speech recognition can process more complex data.
At present, speech recognition is usually realized by using a speech recognition model based on an attention mechanism, and because the data quality requirement of the speech recognition model based on the attention mechanism on speech to be recognized is extremely high, in an actual service scene, speech data to be recognized in different noise environments can be generated, such as data of scenes such as accent dialects, noisy scenes, far fields and the like, so that the speech recognition capability of the speech recognition model based on the attention mechanism can be influenced, and the accuracy of the speech recognition can be influenced.
Disclosure of Invention
In order to solve the technical problem or at least partially solve the technical problem, the present application provides a speech recognition method, apparatus, electronic device and computer-readable storage medium, which can improve the accuracy of speech recognition.
In a first aspect, the present application provides a speech recognition method, including:
acquiring audio data, and performing spectrum analysis on the audio data to generate a Mel-cepstrum of the audio data;
performing feature extraction on the Mel inverse spectrogram by using a pre-trained audio recognition model to obtain a feature audio signal, and recognizing a phoneme sequence of the feature audio signal;
and performing character extraction on the phoneme sequence, and taking a character extraction result as an identification result of the audio data.
According to the method, the characteristic data of the audio data can be extracted based on the spectral analysis of the audio data, so that the complexity of the audio data is reduced, and the analysis accuracy of the subsequent audio data can be improved; secondly, the method and the device perform feature extraction and phoneme recognition of the Mel-cepstrum of the audio data through a pre-trained audio recognition model, namely adopt end-to-end phoneme sequence recognition of the audio data, can enhance the anti-interference performance of the audio recognition model to the complex audio data, and further improve the analysis accuracy of the audio data. Therefore, compared with the prior art, the method and the device can enhance the anti-interference performance of the model to the audio data and improve the accuracy of the voice recognition.
In one possible implementation manner of the first aspect, the performing a spectrum analysis on the audio data to generate a mel-frequency cepstrum of the audio data includes:
preprocessing the audio data, and performing short-time Fourier transform on the preprocessed audio data to obtain a spectrogram of the audio data;
performing Mel spectrum filtering on the spectrogram, and performing cepstrum analysis on the Mel spectrum filtered spectrogram to obtain an initial Mel cepstrum of the audio data;
and performing discrete transformation on the initial Mel cepstrum to obtain a Mel cepstrum of the audio data.
In a possible implementation manner of the first aspect, before the performing the feature extraction on the mel-frequency cepstrum by using the pre-trained audio recognition model, the method further includes:
acquiring a training cepstrum and a corresponding first characteristic audio signal, and extracting a phoneme sequence from the first characteristic audio signal to obtain a first phoneme sequence;
performing spectrum enhancement on the training cepstrum, and taking the training cepstrum subjected to spectrum enhancement and the training cepstrum as model training data;
inputting the model training data into a convolution module of the audio recognition model to output a second characteristic audio signal of the model training data, and recognizing a second phoneme sequence of the second characteristic audio signal by using a phoneme recognition module of the audio recognition model;
calculating the training loss of the audio recognition model according to the first characteristic audio signal, the second characteristic audio signal, the first phoneme sequence and the second phoneme sequence;
if the training loss does not meet the preset condition, adjusting the parameters of the audio recognition model, and returning to the step of inputting the model training data into a convolution module of the audio recognition model;
and if the training loss meets a preset condition, obtaining a trained audio recognition model.
In one possible implementation manner of the first aspect, the inputting the model training data into a convolution module of the audio recognition model to output a second feature audio signal of the model training data includes:
carrying out convolution operation on the model training data by utilizing a convolution layer in the convolution module to obtain an initial characteristic audio signal;
utilizing a linear rectifying layer in the convolution module to carry out linear adjustment on the initial characteristic audio signal;
reducing the dimension of the initial characteristic audio signal after linear adjustment by using a pooling layer in the convolution module;
and outputting the initial characteristic audio signal after the dimension reduction by using a full connection layer in the convolution module to obtain a first characteristic audio signal.
In one possible implementation manner of the first aspect, the recognizing, by the phoneme recognition module of the audio recognition model, a second phoneme sequence of the second characteristic audio signal includes:
receiving the second characteristic audio signal by using an input layer in the phoneme recognition module, and setting delay data of the second characteristic audio signal;
extracting a phoneme sequence of the second characteristic audio signal by utilizing a hidden layer in the phoneme recognition module according to the delay data;
and outputting the extracted phoneme sequence by utilizing an output layer in the phoneme recognition module to obtain a second phoneme sequence.
In one possible implementation manner of the first aspect, the calculating a training loss of the audio recognition model according to the first characteristic audio signal, the second characteristic audio signal, the first phoneme sequence and the second phoneme sequence includes:
calculating a first training loss of the audio recognition model according to the first characteristic audio signal and the second characteristic audio signal;
calculating a second training loss of the audio recognition model according to the first phoneme sequence and the second phoneme sequence;
and calculating the training loss of the audio recognition model according to the first training loss and the second training loss.
In a possible implementation manner of the first aspect, the performing text extraction on the phoneme sequence includes:
calculating a text generation probability from the phoneme sequence;
and identifying a character information relation between the phoneme sequences according to the character generation probability, and generating corresponding characters according to the character information relation.
In a second aspect, the present application provides a speech recognition apparatus comprising:
the frequency spectrum analysis module is used for acquiring audio data, carrying out frequency spectrum analysis on the audio data and generating a Mel-cepstrum of the audio data;
the phoneme sequence recognition module is used for extracting features of the Mel inverse spectrogram by using a pre-trained audio recognition model to obtain a feature audio signal and recognizing a phoneme sequence of the feature audio signal;
and the character extraction module is used for extracting characters from the phoneme sequence and taking the character extraction result as the recognition result of the audio data.
In a third aspect, the present application provides an electronic device, comprising: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the speech recognition method according to any one of the above first aspects.
In a fourth aspect, the present application provides a computer-readable storage medium storing a computer program which, when executed by a processor, implements the speech recognition method according to any of the first aspects above.
The beneficial effects of the second to fourth aspects can be referred to the related description of the first aspect, and are not repeated herein.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present application and together with the description, serve to explain the principles of the application.
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly described below, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.
Fig. 1 is a detailed flowchart of a speech recognition method according to an embodiment of the present application;
FIG. 2 is a flowchart illustrating a step of the speech recognition method provided in FIG. 1 according to an embodiment of the present application;
FIG. 3 is a flowchart illustrating another step of the speech recognition method of FIG. 1 according to an embodiment of the present application;
FIG. 4 is a flowchart illustrating further steps of a speech recognition method provided in FIG. 1 according to an embodiment of the present application;
fig. 5 is a block diagram of a speech recognition apparatus according to an embodiment of the present application;
fig. 6 is a schematic internal structural diagram of an electronic device implementing a speech recognition method according to an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
A speech recognition method provided by an embodiment of the present application is described with reference to a flowchart shown in fig. 1.
Wherein, the method described in figure 1 comprises:
s1, obtaining audio data, carrying out spectrum analysis on the audio data, and generating a Mel-cepstrum of the audio data.
In the embodiment of the present application, the audio data refers to digitized sound data, and includes: voice, vocal music, and other sounds such as voice uttered by a user, vocal music played by a piano, and other sounds uttered by object impact.
As an embodiment of the present application, referring to fig. 2, the performing a spectrum analysis on the audio data to generate a mel-frequency cepstrum of the audio data includes:
s20, preprocessing the audio data, and performing short-time Fourier transform on the preprocessed audio data to obtain a spectrogram of the audio data;
s21, carrying out Mel spectrum filtering on the spectrogram, and carrying out cepstrum analysis on the filtered spectrogram to obtain an initial Mel cepstrum of the audio data;
and S22, performing discrete transformation on the initial Mel cepstrum to obtain a Mel cepstrum of the audio data.
In one embodiment of the present application, the preprocessing the audio data includes: and framing the audio data, and windowing the framed audio data. The framing refers to a process of dividing the signal of the audio data into audio signals of one frame and one frame, and is used for dividing a long audio signal into short audio signals, usually 10-30ms is taken as one frame, and the windowing is used for eliminating discontinuity of two ends of the framed signal.
In one embodiment of the present application, the short-time fourier transform refers to a process of performing fourier transform on a short-time signal, and is configured to convert a signal of audio data from a time domain to a frequency domain, so as to analyze a signal change condition of the audio data, and optionally, the short-time fourier transform is performed on the preprocessed audio data by using the following formula:
Figure BDA0003095382840000061
wherein, F (ω) represents a spectrogram, F (t) represents the number of the preprocessed audio data, and e represents the wireless acyclic decimal number.
In one embodiment of the present application, the mel-spectrum filtering is used to mask sound signals that do not conform to a preset frequency range in a sound map, so as to obtain a spectrogram conforming to the hearing habit of human ears, and the cepstrum analysis is to perform secondary spectrum analysis on the spectrogram of audio data, so as to extract contour information of the spectrogram, thereby obtaining feature data of the audio data. Optionally, Mel-spectrum filtering of the spectrogram is performed through a Mel filter bank, the preset frequency range is 200HZ to 500HZ, and cepstrum analysis is achieved by taking the logarithm of the spectrogram after Mel-spectrum filtering.
In one embodiment of the present application, the discrete transformation is used to perform image compression on the initial mel-frequency cepstrum to achieve dimension reduction of the initial mel-frequency cepstrum, so as to improve the processing speed of subsequent images.
S2, extracting features of the Mel inverse spectrogram by using a pre-trained audio recognition model to obtain a feature audio signal, and recognizing a phoneme sequence of the feature audio signal.
In an embodiment of the present application, the audio recognition model includes a convolution module and a phoneme recognition module, the convolution module is configured to extract the characteristic audio signal in the mel-frequency cepstrum, and the phoneme recognition module is configured to recognize a phoneme sequence of the extracted characteristic audio signal.
According to one embodiment of the application, the convolution module is constructed through a convolution neural network and comprises a convolution layer, a linear rectification layer, a pooling layer and a full connection layer, wherein the convolution layer is used for extracting different characteristic audio signals of an input Mel-cepstrum, such as edge, line angle and other hierarchical audio signals, the linear rectification layer is used as an activation function through linear rectification, the nonlinear characteristics of a judgment function and the whole neural network can be enhanced, so that the training speed of the neural network is improved, the pooling layer is used for reducing the dimension of the characteristic audio signals extracted by the convolution layer, the characteristics extracted by the convolution layer are cut into a plurality of areas, the maximum values or the average values of the areas are taken, new characteristic audio signals with small dimensions are obtained, and the full connection layer combines all local characteristics into a global characteristic, so that the characteristic audio signals are output.
According to one embodiment of the application, the phoneme recognition module is constructed through a time delay neural network and comprises an input layer, a hidden layer and an output layer, wherein the input layer is used for receiving characteristic audio signals transmitted by a convolution module, the hidden layer recognizes weights among the input characteristic audio signals through setting time delay so as to extract phoneme sequences with the characteristic audio signals meeting conditions, and the output layer is used for outputting the phoneme sequences extracted by the hidden layer.
Further, in this embodiment of the application, before the performing feature extraction on the mel inverse spectrogram by using the pre-trained audio recognition model, the method further includes: and training an audio recognition model.
Specifically, referring to fig. 3, the training of the audio recognition model includes:
s30, acquiring a training cepstrum and a corresponding first characteristic audio signal, and extracting a phoneme sequence from the first characteristic audio signal to obtain a first phoneme sequence;
s31, performing spectrum enhancement on the training cepstrum, and taking the training cepstrum after spectrum enhancement and the training cepstrum as model training data;
s32, inputting the model training data into a convolution module of the audio recognition model to output a second characteristic audio signal of the model training data, and recognizing a second phoneme sequence of the second characteristic audio signal by using a phoneme recognition module of the audio recognition model;
s33, calculating the training loss of the audio recognition model according to the first characteristic audio signal, the second characteristic audio signal, the first phoneme sequence and the second phoneme sequence;
s34, if the training loss does not meet the preset condition, adjusting the parameters of the audio recognition model, and returning to the step of inputting the model training data into the convolution module of the audio recognition model;
and S35, if the training loss meets the preset condition, obtaining the trained audio recognition model.
In an optional embodiment, the first characteristic audio signal is used as a real label of a characteristic audio signal obtained by subsequent model training, the first phoneme sequence is used as a real label of a phoneme sequence obtained by subsequent model training, and the learning effect of a subsequent model can be supervised based on the first characteristic audio signal and the first phoneme sequence, so that the overall recognition capability of the model is improved.
In an optional embodiment, the spectrum enhancement is used to increase the number of training cepstrums to increase training data for subsequent model training, so as to improve the overall robustness of the model.
In an optional embodiment, the inputting the model training data into a convolution module of the audio recognition model to output a second feature audio signal of the model training data includes: performing convolution operation on the model training data by using a convolution layer in the convolution module to obtain an initial characteristic audio signal, performing linear adjustment on the initial characteristic audio signal by using a linear rectification layer in the convolution module, performing dimensionality reduction on the initial characteristic audio signal after the linear adjustment by using a pooling layer in the convolution module, and outputting the initial characteristic audio signal after the dimensionality reduction through a full-connection layer in the convolution module to obtain a first characteristic audio signal.
In an optional embodiment, the recognizing a second phoneme sequence of the second characteristic audio signal by the phoneme recognition module of the audio recognition model includes: and receiving the second characteristic audio signal by using an input layer in the phoneme recognition module, setting delay data of the second characteristic audio signal, extracting a phoneme sequence of the second characteristic audio signal by using a hidden layer in the phoneme recognition module according to the delay data, and outputting the extracted phoneme sequence by using an output layer in the phoneme recognition module to obtain a second phoneme sequence.
In an alternative embodiment, referring to fig. 4, the S33 includes:
s40, calculating a first training loss of the audio recognition model according to the first characteristic audio signal and the second characteristic audio signal;
s41, calculating a second training loss of the audio recognition model according to the first phoneme sequence and the second phoneme sequence;
and S42, calculating the training loss of the audio recognition model according to the first training loss and the second training loss.
In an alternative embodiment of the present application, the first training loss is calculated according to the following formula:
LC=mglogmp+(1-mg)log(1-mp)
where LC represents the first training loss, mgRepresenting the first characteristic audio signal, mpRepresenting the second characteristic audio signal.
In an alternative embodiment of the present application, the second training loss is calculated according to the following formula:
L1=|αpg|
where L1 denotes a second training loss, αgRepresenting a first phoneme sequence, alphapRepresenting a second phoneme sequence.
In an optional embodiment of the present application, the first training loss and the second training loss are added to obtain a training loss of the audio recognition model, i.e., L-L1 + LC.
In an optional embodiment of the present application, the preset condition comprises that the training loss is less than a loss threshold. That is, when the training loss is smaller than the loss threshold, it indicates that the training loss satisfies the preset condition, and when the training loss is greater than or equal to the loss threshold, it indicates that the training loss does not satisfy the preset condition. The loss threshold may be set to 0.1, or may be set according to an actual scene. Further, the parameter adjustment of the audio recognition model may be implemented by a gradient descent algorithm, such as a random descent algorithm.
And S3, performing character extraction on the phoneme sequence, and taking the character extraction result as the recognition result of the audio data.
The embodiment of the application identifies the audio identification result of the audio data by performing character extraction on the phoneme sequence. In one embodiment, the phoneme sequence is subjected to text extraction by using a language model, and the language model is language abstract mathematical modeling according to language objective facts and is used for identifying text information relations corresponding to the phoneme sequence.
In detail, the extracting the words from the phoneme sequence by using the language model includes: and calculating the character generation probability of the phoneme sequences by using the language model, identifying the character information relationship among the phoneme sequences according to the character generation probability, and generating corresponding characters according to the character information relationship. The character generation probability refers to the distribution probability of the phoneme sequence generating characters, and the character information relationship refers to the relationship that any two or more phoneme sequences can form characters.
According to the method, firstly, based on the frequency spectrum analysis of the audio data, the characteristic data of the audio data can be extracted, so that the complexity of the audio data is reduced, and the analysis accuracy of the subsequent audio data can be improved; secondly, the method and the device perform feature extraction and phoneme recognition of the Mel-cepstrum of the audio data through a pre-trained audio recognition model, namely adopt end-to-end phoneme sequence recognition of the audio data, can enhance the anti-interference performance of the audio recognition model to the complex audio data, and further improve the analysis accuracy of the audio data.
Fig. 5 is a functional block diagram of the speech recognition apparatus according to the present application.
The speech recognition apparatus 500 may be installed in an electronic device. According to the realized functions, the speech recognition device can comprise a spectrum analysis module 501, a phoneme sequence recognition module 502 and a character extraction module 503. A module according to the present invention, which may also be referred to as a unit, refers to a series of computer program segments that can be executed by a processor of an electronic device and that can perform a fixed function, and that are stored in a memory of the electronic device.
In the present embodiment, the functions regarding the respective modules/units are as follows:
the spectrum analysis module 501 is configured to obtain audio data, perform spectrum analysis on the audio data, and generate a mel-frequency cepstrum of the audio data;
the phoneme sequence recognition module 502 is configured to perform feature extraction on the mel-frequency cepstrum by using a pre-trained audio recognition model to obtain a feature audio signal, and recognize a phoneme sequence of the feature audio signal;
the text extraction module 503 is configured to perform text extraction on the phoneme sequence, and use a result of the text extraction as an identification result of the audio data.
In detail, in the embodiment of the present application, when the modules in the speech recognition apparatus 500 are used, the same technical means as the speech recognition method described in fig. 1 to fig. 4 are adopted, and the same technical effects can be produced, and are not described again here.
Fig. 6 is a schematic structural diagram of an electronic device implementing the speech recognition method according to the present application.
The electronic device 6 may comprise a processor 60, a memory 61 and a bus, and may further comprise a computer program, such as a speech recognition program 62, stored in the memory 61 and operable on the processor 60.
The memory 61 includes at least one type of readable storage medium, which includes flash memory, removable hard disk, multimedia card, card type memory (e.g., SD or DX memory, etc.), magnetic memory, magnetic disk, optical disk, etc. The memory 61 may in some embodiments be an internal storage unit of the electronic device 6, for example a removable hard disk of the electronic device 6. The memory 61 may also be an external storage device of the electronic device 6 in other embodiments, such as a plug-in mobile hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, provided on the electronic device 6. Further, the memory 61 may also include both an internal storage unit of the electronic device 6 and an external storage device. The memory 61 may be used not only to store application software installed in the electronic device 6 and various types of data, such as codes of the voice recognition program 62, but also to temporarily store data that has been output or is to be output.
The processor 60 may be formed of an integrated circuit in some embodiments, for example, a single packaged integrated circuit, or may be formed of a plurality of integrated circuits packaged with the same or different functions, including one or more Central Processing Units (CPUs), microprocessors, digital Processing chips, graphics processors, and combinations of various control chips. The processor 60 is a Control Unit (Control Unit) of the electronic device, connects various components of the whole electronic device by using various interfaces and lines, and executes various functions and processes data of the electronic device 6 by running or executing programs or modules (e.g., executing the voice recognition program 62, etc.) stored in the memory 61 and calling data stored in the memory 61.
The bus may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. The bus is arranged to enable connection communication between the memory 61 and at least one processor 60 or the like.
Fig. 6 shows only an electronic device with components, and it will be understood by those skilled in the art that the structure shown in fig. 6 does not constitute a limitation of the electronic device 6, and may comprise fewer or more components than those shown, or some components may be combined, or a different arrangement of components.
For example, although not shown, the electronic device 6 may further include a power supply (such as a battery) for supplying power to each component, and preferably, the power supply may be logically connected to the at least one processor 60 through a power management device, so that functions such as charge management, discharge management, and power consumption management are implemented through the power management device. The power supply may also include any component of one or more dc or ac power sources, recharging devices, power failure detection circuitry, power converters or inverters, power status indicators, and the like. The electronic device 6 may further include various sensors, a bluetooth module, a Wi-Fi module, and the like, which are not described herein again.
Further, the electronic device 6 may further include a network interface, and optionally, the network interface may include a wired interface and/or a wireless interface (such as a WI-FI interface, a bluetooth interface, etc.), which are generally used to establish a communication connection between the electronic device 6 and other electronic devices.
Optionally, the electronic device 6 may further comprise a user interface, which may be a Display (Display), an input unit (such as a Keyboard), and optionally a standard wired interface, a wireless interface. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch device, or the like. The display, which may also be referred to as a display screen or display unit, is suitable for displaying information processed in the electronic device 6 and for displaying a visualized user interface.
It is to be understood that the described embodiments are for purposes of illustration only and that the scope of the appended claims is not limited to such structures.
The speech recognition 62 stored by the memory 61 in the electronic device 6 is a combination of computer programs which, when run in the processor 60, enable:
acquiring audio data, and performing spectrum analysis on the audio data to generate a Mel-cepstrum of the audio data;
performing feature extraction on the Mel inverse spectrogram by using a pre-trained audio recognition model to obtain a feature audio signal, and recognizing a phoneme sequence of the feature audio signal;
and performing character extraction on the phoneme sequence, and taking a character extraction result as an identification result of the audio data.
Specifically, the processor 60 may refer to the description of the relevant steps in the embodiment corresponding to fig. 1 for a specific implementation method of the computer program, which is not described herein again.
Further, the integrated modules/units of the electronic device 6, if implemented in the form of software functional units and sold or used as separate products, may be stored in a non-volatile computer-readable storage medium. The computer readable storage medium may be volatile or non-volatile. For example, the computer-readable medium may include: any entity or device capable of carrying said computer program code, recording medium, U-disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM).
The present application also provides a computer-readable storage medium, storing a computer program that, when executed by a processor of an electronic device, may implement:
acquiring audio data, and performing spectrum analysis on the audio data to generate a Mel-cepstrum of the audio data;
performing feature extraction on the Mel inverse spectrogram by using a pre-trained audio recognition model to obtain a feature audio signal, and recognizing a phoneme sequence of the feature audio signal;
and performing character extraction on the phoneme sequence, and taking a character extraction result as an identification result of the audio data.
In the several embodiments provided in the present application, it should be understood that the disclosed apparatus, device and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is only one logical functional division, and other divisions may be realized in practice.
The modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.
In addition, functional modules in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional module.
It will be evident to those skilled in the art that the present application is not limited to the details of the foregoing illustrative embodiments, and that the present application may be embodied in other specific forms without departing from the spirit or essential attributes thereof.
The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the application being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference signs in the claims shall not be construed as limiting the claim concerned.
It is noted that, in this document, relational terms such as "first" and "second," and the like, may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The above description is merely exemplary of the present application and is presented to enable those skilled in the art to understand and practice the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (10)

1. A method of speech recognition, the method comprising:
acquiring audio data, and performing spectrum analysis on the audio data to generate a Mel-cepstrum of the audio data;
performing feature extraction on the Mel inverse spectrogram by using a pre-trained audio recognition model to obtain a feature audio signal, and recognizing a phoneme sequence of the feature audio signal;
and performing character extraction on the phoneme sequence, and taking a character extraction result as an identification result of the audio data.
2. The speech recognition method of claim 1, wherein the performing spectral analysis on the audio data to generate a mel-frequency cepstrum of the audio data comprises:
preprocessing the audio data, and performing short-time Fourier transform on the preprocessed audio data to obtain a spectrogram of the audio data;
performing Mel spectrum filtering on the spectrogram, and performing cepstrum analysis on the Mel spectrum filtered spectrogram to obtain an initial Mel cepstrum of the audio data;
and performing discrete transformation on the initial Mel cepstrum to obtain a Mel cepstrum of the audio data.
3. The speech recognition method of claim 1, wherein before the feature extraction of the mel-frequency cepstrum by using the pre-trained audio recognition model, the method further comprises:
acquiring a training cepstrum and a corresponding first characteristic audio signal, and extracting a phoneme sequence from the first characteristic audio signal to obtain a first phoneme sequence;
performing spectrum enhancement on the training cepstrum, and taking the training cepstrum subjected to spectrum enhancement and the training cepstrum as model training data;
inputting the model training data into a convolution module of the audio recognition model to output a second characteristic audio signal of the model training data, and recognizing a second phoneme sequence of the second characteristic audio signal by using a phoneme recognition module of the audio recognition model;
calculating the training loss of the audio recognition model according to the first characteristic audio signal, the second characteristic audio signal, the first phoneme sequence and the second phoneme sequence;
if the training loss does not meet the preset condition, adjusting the parameters of the audio recognition model, and returning to the step of inputting the model training data into a convolution module of the audio recognition model;
and if the training loss meets a preset condition, obtaining a trained audio recognition model.
4. The speech recognition method of claim 3 wherein inputting the model training data into a convolution module of the audio recognition model to output a second characteristic audio signal of the model training data comprises:
carrying out convolution operation on the model training data by utilizing a convolution layer in the convolution module to obtain an initial characteristic audio signal;
utilizing a linear rectifying layer in the convolution module to carry out linear adjustment on the initial characteristic audio signal;
reducing the dimension of the initial characteristic audio signal after linear adjustment by using a pooling layer in the convolution module;
and outputting the initial characteristic audio signal after the dimension reduction by using a full connection layer in the convolution module to obtain a first characteristic audio signal.
5. The speech recognition method of claim 3 wherein said recognizing a second sequence of phonemes for the second characteristic audio signal using the phoneme recognition module of the audio recognition model comprises:
receiving the second characteristic audio signal by using an input layer in the phoneme recognition module, and setting delay data of the second characteristic audio signal;
extracting a phoneme sequence of the second characteristic audio signal by utilizing a hidden layer in the phoneme recognition module according to the delay data;
and outputting the extracted phoneme sequence by utilizing an output layer in the phoneme recognition module to obtain a second phoneme sequence.
6. The speech recognition method of claim 3 wherein calculating a training loss for the audio recognition model based on the first characteristic audio signal, the second characteristic audio signal, the first phoneme sequence, and the second phoneme sequence comprises:
calculating a first training loss of the audio recognition model according to the first characteristic audio signal and the second characteristic audio signal;
calculating a second training loss of the audio recognition model according to the first phoneme sequence and the second phoneme sequence;
and calculating the training loss of the audio recognition model according to the first training loss and the second training loss.
7. The speech recognition method according to any one of claims 1 to 6, wherein the performing text extraction on the phoneme sequence comprises:
calculating a text generation probability from the phoneme sequence;
and identifying a character information relation between the phoneme sequences according to the character generation probability, and generating corresponding characters according to the character information relation.
8. A speech recognition apparatus, comprising:
the frequency spectrum analysis module is used for acquiring audio data, carrying out frequency spectrum analysis on the audio data and generating a Mel-cepstrum of the audio data;
the phoneme sequence recognition module is used for extracting features of the Mel inverse spectrogram by using a pre-trained audio recognition model to obtain a feature audio signal and recognizing a phoneme sequence of the feature audio signal;
and the character extraction module is used for extracting characters from the phoneme sequence and taking the character extraction result as the recognition result of the audio data.
9. An electronic device, characterized in that the electronic device comprises:
at least one processor; and the number of the first and second groups,
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the speech recognition method of any one of claims 1 to 7.
10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out a speech recognition method according to any one of claims 1 to 7.
CN202110610069.6A 2021-06-01 2021-06-01 Voice recognition method, device, electronic equipment and storage medium Active CN113327586B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110610069.6A CN113327586B (en) 2021-06-01 2021-06-01 Voice recognition method, device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110610069.6A CN113327586B (en) 2021-06-01 2021-06-01 Voice recognition method, device, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN113327586A true CN113327586A (en) 2021-08-31
CN113327586B CN113327586B (en) 2023-11-28

Family

ID=77423260

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110610069.6A Active CN113327586B (en) 2021-06-01 2021-06-01 Voice recognition method, device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113327586B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113763952A (en) * 2021-09-03 2021-12-07 深圳市北科瑞声科技股份有限公司 Dynamic voice recognition method and device, electronic equipment and storage medium
CN113808577A (en) * 2021-09-18 2021-12-17 平安银行股份有限公司 Intelligent extraction method and device of voice abstract, electronic equipment and storage medium
CN114743554A (en) * 2022-06-09 2022-07-12 武汉工商学院 Intelligent household interaction method and device based on Internet of things

Citations (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20050014183A (en) * 2003-07-30 2005-02-07 주식회사 팬택 Method for modificating state
CN101447185A (en) * 2008-12-08 2009-06-03 深圳市北科瑞声科技有限公司 Audio frequency rapid classification method based on content
US20110224979A1 (en) * 2010-03-09 2011-09-15 Honda Motor Co., Ltd. Enhancing Speech Recognition Using Visual Information
CN106486119A (en) * 2016-10-20 2017-03-08 海信集团有限公司 A kind of method and apparatus of identification voice messaging
JP2019020598A (en) * 2017-07-18 2019-02-07 国立研究開発法人情報通信研究機構 Learning method of neural network
US20190130896A1 (en) * 2017-10-26 2019-05-02 Salesforce.Com, Inc. Regularization Techniques for End-To-End Speech Recognition
KR20190110939A (en) * 2018-03-21 2019-10-01 한국과학기술원 Environment sound recognition method based on convolutional neural networks, and system thereof
US20200043483A1 (en) * 2018-08-01 2020-02-06 Google Llc Minimum word error rate training for attention-based sequence-to-sequence models
CN110827801A (en) * 2020-01-09 2020-02-21 成都无糖信息技术有限公司 Automatic voice recognition method and system based on artificial intelligence
US20200175961A1 (en) * 2018-12-04 2020-06-04 Sorenson Ip Holdings, Llc Training of speech recognition systems
CN111261146A (en) * 2020-01-16 2020-06-09 腾讯科技(深圳)有限公司 Speech recognition and model training method, device and computer readable storage medium
CN111489745A (en) * 2019-01-28 2020-08-04 上海菲碧文化传媒有限公司 Chinese speech recognition system applied to artificial intelligence
CN111681637A (en) * 2020-04-28 2020-09-18 平安科技(深圳)有限公司 Song synthesis method, device, equipment and storage medium
CN111862962A (en) * 2020-07-20 2020-10-30 汪秀英 Voice recognition method and system
CN112116903A (en) * 2020-08-17 2020-12-22 北京大米科技有限公司 Method and device for generating speech synthesis model, storage medium and electronic equipment
CN112133289A (en) * 2020-11-24 2020-12-25 北京远鉴信息技术有限公司 Voiceprint identification model training method, voiceprint identification device, voiceprint identification equipment and voiceprint identification medium
CN112466279A (en) * 2021-02-02 2021-03-09 深圳市阿卡索资讯股份有限公司 Automatic correction method and device for spoken English pronunciation
CN112735371A (en) * 2020-12-28 2021-04-30 出门问问(苏州)信息科技有限公司 Method and device for generating speaker video based on text information
CN112767958A (en) * 2021-02-26 2021-05-07 华南理工大学 Zero-learning-based cross-language tone conversion system and method

Patent Citations (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20050014183A (en) * 2003-07-30 2005-02-07 주식회사 팬택 Method for modificating state
CN101447185A (en) * 2008-12-08 2009-06-03 深圳市北科瑞声科技有限公司 Audio frequency rapid classification method based on content
US20110224979A1 (en) * 2010-03-09 2011-09-15 Honda Motor Co., Ltd. Enhancing Speech Recognition Using Visual Information
CN106486119A (en) * 2016-10-20 2017-03-08 海信集团有限公司 A kind of method and apparatus of identification voice messaging
JP2019020598A (en) * 2017-07-18 2019-02-07 国立研究開発法人情報通信研究機構 Learning method of neural network
US20190130896A1 (en) * 2017-10-26 2019-05-02 Salesforce.Com, Inc. Regularization Techniques for End-To-End Speech Recognition
KR20190110939A (en) * 2018-03-21 2019-10-01 한국과학기술원 Environment sound recognition method based on convolutional neural networks, and system thereof
US20200043483A1 (en) * 2018-08-01 2020-02-06 Google Llc Minimum word error rate training for attention-based sequence-to-sequence models
US20200175961A1 (en) * 2018-12-04 2020-06-04 Sorenson Ip Holdings, Llc Training of speech recognition systems
CN111489745A (en) * 2019-01-28 2020-08-04 上海菲碧文化传媒有限公司 Chinese speech recognition system applied to artificial intelligence
CN110827801A (en) * 2020-01-09 2020-02-21 成都无糖信息技术有限公司 Automatic voice recognition method and system based on artificial intelligence
CN111261146A (en) * 2020-01-16 2020-06-09 腾讯科技(深圳)有限公司 Speech recognition and model training method, device and computer readable storage medium
CN111681637A (en) * 2020-04-28 2020-09-18 平安科技(深圳)有限公司 Song synthesis method, device, equipment and storage medium
CN111862962A (en) * 2020-07-20 2020-10-30 汪秀英 Voice recognition method and system
CN112116903A (en) * 2020-08-17 2020-12-22 北京大米科技有限公司 Method and device for generating speech synthesis model, storage medium and electronic equipment
CN112133289A (en) * 2020-11-24 2020-12-25 北京远鉴信息技术有限公司 Voiceprint identification model training method, voiceprint identification device, voiceprint identification equipment and voiceprint identification medium
CN112735371A (en) * 2020-12-28 2021-04-30 出门问问(苏州)信息科技有限公司 Method and device for generating speaker video based on text information
CN112466279A (en) * 2021-02-02 2021-03-09 深圳市阿卡索资讯股份有限公司 Automatic correction method and device for spoken English pronunciation
CN112767958A (en) * 2021-02-26 2021-05-07 华南理工大学 Zero-learning-based cross-language tone conversion system and method

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113763952A (en) * 2021-09-03 2021-12-07 深圳市北科瑞声科技股份有限公司 Dynamic voice recognition method and device, electronic equipment and storage medium
CN113763952B (en) * 2021-09-03 2022-07-26 深圳市北科瑞声科技股份有限公司 Dynamic voice recognition method and device, electronic equipment and storage medium
CN113808577A (en) * 2021-09-18 2021-12-17 平安银行股份有限公司 Intelligent extraction method and device of voice abstract, electronic equipment and storage medium
CN114743554A (en) * 2022-06-09 2022-07-12 武汉工商学院 Intelligent household interaction method and device based on Internet of things

Also Published As

Publication number Publication date
CN113327586B (en) 2023-11-28

Similar Documents

Publication Publication Date Title
WO2021208287A1 (en) Voice activity detection method and apparatus for emotion recognition, electronic device, and storage medium
CN110085251B (en) Human voice extraction method, human voice extraction device and related products
CN105976812B (en) A kind of audio recognition method and its equipment
CN113327586B (en) Voice recognition method, device, electronic equipment and storage medium
CN110265040A (en) Training method, device, storage medium and the electronic equipment of sound-groove model
CN109785820A (en) A kind of processing method, device and equipment
CN112309365B (en) Training method and device of speech synthesis model, storage medium and electronic equipment
KR20130133858A (en) Speech syllable/vowel/phone boundary detection using auditory attention cues
CN113420556B (en) Emotion recognition method, device, equipment and storage medium based on multi-mode signals
CN110738998A (en) Voice-based personal credit evaluation method, device, terminal and storage medium
CN111243569A (en) Emotional voice automatic generation method and device based on generation type confrontation network
CN113823323B (en) Audio processing method and device based on convolutional neural network and related equipment
CN116665669A (en) Voice interaction method and system based on artificial intelligence
CN113345431A (en) Cross-language voice conversion method, device, equipment and medium
CN112735371A (en) Method and device for generating speaker video based on text information
CN110264993A (en) Phoneme synthesizing method, device, equipment and computer readable storage medium
CN111429914B (en) Microphone control method, electronic device and computer readable storage medium
CN113807103A (en) Recruitment method, device, equipment and storage medium based on artificial intelligence
CN114999533A (en) Intelligent question-answering method, device, equipment and storage medium based on emotion recognition
CN114155832A (en) Speech recognition method, device, equipment and medium based on deep learning
CN117033556A (en) Memory preservation and memory extraction method based on artificial intelligence and related equipment
CN113555003B (en) Speech synthesis method, device, electronic equipment and storage medium
CN111985231B (en) Unsupervised role recognition method and device, electronic equipment and storage medium
CN113053409B (en) Audio evaluation method and device
CN113808577A (en) Intelligent extraction method and device of voice abstract, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant