CN112037776A - Voice recognition method, voice recognition device and terminal equipment - Google Patents

Voice recognition method, voice recognition device and terminal equipment Download PDF

Info

Publication number
CN112037776A
CN112037776A CN201910407591.7A CN201910407591A CN112037776A CN 112037776 A CN112037776 A CN 112037776A CN 201910407591 A CN201910407591 A CN 201910407591A CN 112037776 A CN112037776 A CN 112037776A
Authority
CN
China
Prior art keywords
layer
model
coding
signal
neural network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910407591.7A
Other languages
Chinese (zh)
Inventor
陈明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan TCL Group Industrial Research Institute Co Ltd
Original Assignee
Wuhan TCL Group Industrial Research Institute Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan TCL Group Industrial Research Institute Co Ltd filed Critical Wuhan TCL Group Industrial Research Institute Co Ltd
Priority to CN201910407591.7A priority Critical patent/CN112037776A/en
Publication of CN112037776A publication Critical patent/CN112037776A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Machine Translation (AREA)

Abstract

The application provides a voice recognition method, a voice recognition device and a terminal device, wherein the method comprises the following steps: acquiring a voice signal to be recognized; extracting the characteristics of the voice signal to obtain a characteristic sequence of the voice signal; inputting the characteristic sequence into a trained first neural network model, so that the first neural network model recognizes the voice signal to obtain a first signal output by the first neural network model, wherein the first signal is used for representing text information of the voice signal; the first neural network model is a coding and decoding model based on an attention mechanism, the coding and decoding model comprises a coding model and a decoding model, and the coding model and the decoding model both comprise a multi-head attention layer; a multi-head attention layer is connected to each feed-forward layer in the coding model, and a multi-head attention layer is connected to each feed-forward layer in the decoding model. The method and the device can improve the accuracy of voice recognition to a certain extent.

Description

Voice recognition method, voice recognition device and terminal equipment
Technical Field
The present application belongs to the field of speech recognition technology, and in particular, to a speech recognition method, a speech recognition apparatus, a terminal device, and a computer-readable storage medium.
Background
However, the recognized words may not have the same meaning as that which we want to express, for example, the result of speech recognition of the speech "i want to watch a movie" may be "i want to watch a shop" or "i see a movie" by medicine. Therefore, a speech recognition method with high recognition accuracy is urgently needed.
Disclosure of Invention
In view of the above, the present application provides a speech recognition method, a speech recognition apparatus, a terminal device and a computer readable storage medium, which can improve the recognition accuracy of a speech signal to a certain extent.
A first aspect of the present application provides a speech recognition method, including:
acquiring a voice signal to be recognized;
extracting the characteristics of the voice signal to obtain a characteristic sequence of the voice signal;
inputting the characteristic sequence into a trained first neural network model, so that the trained first neural network model recognizes the voice signal to obtain a first signal output by the first neural network model, wherein the first signal is used for representing character information of the voice signal;
the first neural network model is a coding and decoding model based on an attention mechanism, the coding and decoding model comprises a coding model and a decoding model, and the coding model and the decoding model both comprise a multi-head attention layer;
a multi-head attention layer is connected to each feed-forward layer in the coding model, and a multi-head attention layer is connected to each feed-forward layer in the decoding model.
A second aspect of the present application provides a speech recognition apparatus, comprising:
the voice acquisition module is used for acquiring a voice signal to be recognized;
the feature extraction module is used for extracting the features of the voice signals to obtain a feature sequence of the voice signals;
a voice recognition module, configured to input the feature sequence to a trained first neural network model, so that the trained first neural network model recognizes the voice signal, and a first signal output by the first neural network model is obtained, where the first signal is used to represent text information of the voice signal;
the first neural network model is a coding and decoding model based on an attention mechanism, the coding and decoding model comprises a coding model and a decoding model, and the coding model and the decoding model both comprise a multi-head attention layer;
a multi-head attention layer is connected to each feed-forward layer in the coding model, and a multi-head attention layer is connected to each feed-forward layer in the decoding model.
A third aspect of the present application provides a terminal device, comprising a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor implements the speech recognition method according to the first aspect when executing the computer program.
A fourth aspect of the present application provides a computer-readable storage medium storing a computer program which, when executed by a processor, implements the speech recognition method of the first aspect as described above.
A fifth aspect of the application provides a computer program product comprising a computer program which, when executed by one or more processors, implements the speech recognition method according to the first aspect as described above.
Therefore, the speech recognition method provided by the application identifies a speech signal to be recognized through a trained first neural network model to obtain a first signal for representing text information of the speech signal, wherein the first neural network model is an encoding and decoding model based on an attention mechanism, each feedforward layer of an encoding model in the encoding and decoding model is connected with a multi-head attention layer, and each feedforward layer of a decoding model in the encoding and decoding model is also connected with the multi-head attention layer, so that the attention mechanism can be embedded into an internal structure of the encoding model and an internal structure of the decoding model. When the attention mechanism is used in the neural network model of the voice recognition, the voice recognition accuracy of the neural network model can be improved to a certain extent, and the attention mechanism is further applied to the internal structure of the neural network model of the voice recognition, so that the recognition accuracy of the neural network model is further improved.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed for the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.
Fig. 1 is a schematic flow chart illustrating an implementation of a speech recognition method according to an embodiment of the present application;
FIG. 2 is a schematic structural diagram of a second neural network model provided in an embodiment of the present application;
FIG. 3 is a schematic structural diagram of a first neural network model according to an embodiment of the present disclosure;
FIG. 4 is a schematic diagram of a training method for the first neural network model shown in FIG. 3 according to an embodiment of the present application;
FIG. 5 is a table of performance test results provided in the first embodiment of the present application;
fig. 6 is a schematic structural diagram of a speech recognition apparatus according to the second embodiment of the present application;
fig. 7 is a schematic diagram of a terminal device provided in the third embodiment of the present application.
Detailed Description
In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system structures, techniques, etc. in order to provide a thorough understanding of the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.
The speech recognition method provided by the embodiment of the application is applicable to the terminal device, and the terminal device includes, but is not limited to: smart phones, digital cameras, palm top computers, notebooks, desktop computers, intelligent wearable devices, and the like.
It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
It is also to be understood that the terminology used in the description of the present application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in the specification of the present application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.
It should be further understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.
In addition, in the description of the present application, the terms "first", "second", and the like are used only for distinguishing the description, and are not intended to indicate or imply relative importance.
In order to explain the technical solution described in the present application, the following description will be given by way of specific examples.
Example one
The following describes a speech recognition method provided in the first embodiment of the present application, where the speech recognition method is applied to a terminal device. Referring to fig. 1, a speech recognition method according to a first embodiment of the present application includes:
in step S101, a speech signal to be recognized is acquired;
in this embodiment of the present application, the voice signal to be recognized may be a voice signal input to the terminal device by a user through a microphone; or, the voice signal can be downloaded from the internet by the user; or, the terminal device may also be a voice signal in an audio/video file stored locally in the terminal device. The source of the speech signal is not limited in this application.
In step S102, extracting the features of the speech signal to obtain a feature sequence of the speech signal;
generally, before a speech signal is processed by using a neural network model, the speech signal needs to be preprocessed, i.e., a feature sequence of the speech signal is extracted. The characteristic sequence may be a Mel Frequency Cepstrum Coefficient (MFCC), a Linear Perceptual prediction Coefficient (PLP), a Mel filter bank Coefficient (FBANK), or the like.
In step S103, inputting the feature sequence into a trained first neural network model, so that the trained first neural network model recognizes the speech signal, and obtaining a first signal output by the first neural network model, where the first signal is used to represent text information of the speech signal;
in an embodiment of the present application, the first neural network model is an attention-based coding/decoding model, the coding/decoding model includes a coding model and a decoding model, the coding model and the decoding model both include a multi-head attention layer, each feed-forward layer in the coding model is connected to a multi-head attention layer, and each feed-forward layer in the decoding model is also connected to a multi-head attention layer.
The "text information" in step S103 may be pinyin information, chinese text information, english text information, japanese text information, or the like, and the specific representation form of the text information is not limited in the present application.
In general, a user often wants to obtain chinese character information corresponding to a chinese speech signal when inputting the chinese speech signal through a microphone, for example, when inputting a chinese speech "i am at 6 o 'clock at night" through a microphone, obtain a chinese character "i am at 6 o' clock at night" converted by a terminal device. To implement this function, the following method may be employed:
the first method is to directly train the first neural network model into a neural network model for converting a chinese speech signal into chinese text information (at this time, the "speech signal" in the step S101 is specifically a "chinese speech signal", and the "text information" in the step S102 is specifically "chinese text information"). In this way, after the Chinese speech signal X to be recognized is acquired, the Chinese speech signal X is directly converted into a signal for representing Chinese character information by using the first neural network model.
In the second method, if the end-to-end speech recognition method described in the first method is used, the following technical problems will exist: because the updating speed of the current Chinese vocabulary is very fast, many new Chinese vocabularies continuously appear, such as social animals, rushing families, academic superman, gospel, and the like, in order to improve the speech recognition accuracy of the first neural network model in the first method, the training sample library needs to be frequently updated, and the first neural network model is retrained by using the updated training sample library, and due to the complexity of speech characteristics, the neural network model for speech recognition usually needs longer training time, so that the first method has the technical problem that "in order to match the updating speed of the Chinese vocabularies, the first neural network model needs to be retrained frequently, so that a large amount of training time is needed for each training". In order to solve the technical problem of the first method, the following second method can be adopted:
firstly, training the first neural network model as a neural network model for converting a chinese speech signal into pinyin information (at this time, the "speech signal" in the step S101 is specifically a "chinese speech signal", and the "text information" in the step S103 is specifically "pinyin information"), and training a second neural network model for converting a signal with pinyin information into chinese text information, wherein the second neural network model may be an RNN model (such as deep bidirectional long-short term memory network deep-bilst) or a CNN model (as shown in fig. 2, a structural schematic diagram of the second neural network model for converting a signal with pinyin information into chinese text information provided by the present application);
secondly, after the first neural network model and the second neural network model are trained, when a Chinese speech signal Y to be recognized is obtained, a first signal is obtained by using the first neural network model, the first signal is used for representing pinyin information of the Chinese speech signal Y, the first signal is input to the second neural network model, a second signal output by the second neural network model is obtained, and the second signal is used for representing the Chinese text information of the speech signal Y (as is easily understood by a person skilled in the art, if the second neural network model is trained to convert a signal with pinyin information into foreign text information, the conversion of Chinese speech into foreign text can be realized through the first neural network model and the second neural network model).
In the second method, in order to ensure the accuracy of speech recognition and match the update speed of the chinese vocabulary, those skilled in the art can understand that we do not need to train the first neural network model frequently, and only need to train the second neural network model frequently, and for the character-to-character neural network model, the training time is relatively short, so in the example of the second method, in order to match the update speed of the chinese vocabulary, we only need to train the second neural network model frequently, and the training time of the second neural network model is short, so the second method solves the technical problem of the first method to a certain extent.
In the first embodiment of the present application, we also improve the structure of the first neural network model for speech recognition. When the attention mechanism is introduced into the first neural network model for speech recognition, the speech recognition accuracy of the first neural network model can be improved to a certain extent, but at present, when speech recognition is performed by using a coding/decoding model, the attention network is usually embedded between the coding model and the decoding model, and the attention mechanism is not introduced into the specific structures of the coding model and the decoding model, whereas the attention mechanism is introduced into the feedforward layers of the coding model and the decoding model, and the speech recognition accuracy can be further improved to a certain extent.
The following describes a structure of the first neural network model described in the present application in detail with reference to fig. 3, and those skilled in the art should understand that the example shown in fig. 3 is only an example of the first neural network model in the present application and does not constitute a limitation to the structure of the first neural network model.
The first neural network model of the present application includes an encoding model in which the number of layers of a feedforward layer and a multi-head attention layer may be the same, and N1 layers (N1 is an integer greater than 0, and N1 is 3 in the example shown in fig. 3) are assumed, and a decoding model in which the number of layers of a feedforward layer and a multi-head attention layer may be the same, and N2 layers (N2 is an integer greater than 0, and N2 is 3 in the example shown in fig. 3) are assumed.
Each layer of feedforward layer in the coding model is connected with a multi-head attention layer, and the specific connection mode can be as follows: the input ends of the i1 th layer feedforward layers in the coding model are all connected with the output end of the i1 th layer multi-head attention layer in the coding model, i1 is 1 … … N1, when N1 is greater than 1, the output end of the j1 th layer feedforward layer in the coding model is also connected with the input end of the j1+1 th layer multi-head attention layer in the coding model, and j1 is 1 … … N1-1. As shown in FIG. 3, a layer1 Multi-head attention layer1, a layer1 feedforward layer fed forward layer1, a layer2 Multi-head attention layer2, a layer2 feedforward layer fed forward layer2, a layer3 Multi-head attention layer3 and a layer3 Feed forward layer3 are connected in sequence in the coding model.
Each layer of feedforward layer in the decoding model is connected with a multi-head attention layer, and the specific connection mode can be as follows: the input ends of the i2 th layer feedforward layers in the decoding model are all connected with the output end of the i2 th layer multi-head attention layer in the decoding model, i2 is 1 … … N2, when N2 is greater than 1, the output end of the j2 th layer feedforward layer in the decoding model is also connected with the input end of the j2+1 th layer multi-head attention layer in the decoding model, and j2 is 1 … … N2-1. As shown in FIG. 3, a layer1 Multi-head attention layer1, a layer1 feedforward layer fed forward layer1, a layer2 Multi-head attention layer2, a layer2 feedforward layer fed forward layer2, a layer3 Multi-head attention layer3 and a layer3 Feed forward layer3 are connected in sequence in the decoding model. In addition, in this application, if N2>1, the i3 layer feed-forward layer in the decoding model may also correspond to the i3 layer mask multi-head attention layer, i3 ═ 2 … … N2, and specifically, the output of the j2 layer feed-forward layer in the decoding model may be connected to the input of the j2+1 layer multi-head attention layer in the decoding model through the j2+1 layer mask multi-head attention layer in the decoding model. That is, in the example shown in fig. 3, the layer1 Feed forward layer1 in the decoding model may be connected to the layer2 Multi-head attention layer2 through the layer2 masking Multi-head attention layer Mask Multi-head attention layer, and the layer2 Feed forward layer2 in the decoding model is connected to the layer3 Multi-head attention layer3 through the layer3 masking Multi-head attention layer.
In addition, the coding model may further include a full-link layer and a position-embedded layer, the decoding model may further include a full-link layer and a maximum layer argmax layer, the full-link layer of the coding model and the position-embedded layer of the coding model are configured to receive a speech signal to be recognized, a sum of outputs of the full-link layer and the position-embedded layer is input to a first multi-head attention layer in the coding model, an output of an N2-th layer feedforward layer of the decoding model is connected to an input of the full-link layer of the decoding model, an output of the full-link layer of the decoding model is connected to an input of the maximum layer, and the maximum layer is configured to output the first signal in step S102 (see fig. 3 in particular). Furthermore, as shown in FIG. 3, the output of the N1 th layer feed-forward layer of the coding model may be connected to the first layer multi-headed attention layer of the decoding module.
A method of training the first neural network model shown in fig. 3 is discussed below with reference to fig. 4. It will be appreciated by those skilled in the art that the first neural network model shown in fig. 3 may employ other training methods commonly used in the art, in addition to the training method shown in fig. 4. However, the training method illustrated in FIG. 4 can further improve the speech recognition accuracy of the model illustrated in FIG. 3 compared to conventional training methods.
When the coding and decoding model shown in fig. 3 is trained by using fig. 4, the following structure needs to be added to the decoding model of the coding and decoding model: the decoding model further comprises a label embedding layer and a position embedding layer, wherein the label embedding layer and the position embedding layer in the decoding model are used for receiving the dimension of a label corresponding to a sample voice signal when the coding and decoding model is trained (when the coding and decoding model is trained, the dimension corresponding to an input label can further improve the voice recognition accuracy of the model after training compared with the common mode of only inputting the label), in addition, a first layer multi-head attention layer of the decoding model is corresponding to a first layer masking multi-head attention layer which is used for receiving the sum of the output of the label embedding layer and the output of the position embedding layer in the decoding model, and the output end of the first layer masking multi-head attention layer is connected with the first layer multi-head attention layer of the decoding model. As shown in fig. 4, the trained first neural network model is obtained through steps S401-S404.
In step S401, each sample voice signal and a label corresponding to each sample voice signal are obtained;
if the codec model (i.e., the first neural network model) shown in fig. 3 is trained as a neural network that converts a chinese speech signal into pinyin information, each sample speech signal obtained in step S401 should be a chinese speech signal, and a label corresponding to each sample speech signal should be a signal that includes the pinyin information corresponding to the sample speech signal.
In step S402, for each sample speech signal, inputting the feature sequence of the sample speech signal into the full-link layer in the coding model and the position embedding layer in the coding model in fig. 3, and inputting the dimension of the tag corresponding to the sample speech signal into the tag embedding layer and the position embedding layer in the decoding model, so as to obtain a prediction signal corresponding to each sample speech signal;
the training method provided in fig. 4 requires inputting the dimension of the label corresponding to each sample speech signal (for example, if the pinyin information that the label X represents is: wo3 yao4 kan4 dian4 ying3, the dimension of the label X is 5, that is, if the sample speech signal is chinese, the dimension of the label is the number of words that the label represents, and if the sample speech signal is english, the dimension of the label is the number of words that the label represents) into the label embedding layer of the decoding model and the position embedding layer of the decoding model.
The "prediction signal" in step S402 is used to represent the text information of the sample speech signal input to the first neural network model in fig. 3.
In step S403, determining the speech recognition accuracy of the coding/decoding model in fig. 3 according to the label corresponding to each sample speech signal and each prediction signal;
in step S404, continuously adjusting parameters of each layer in the coding/decoding model until the speech recognition accuracy reaches a preset accuracy;
and determining the speech recognition accuracy of the coding and decoding model by observing each prediction signal by taking the label corresponding to each sample speech signal as a reference, and continuously adjusting the parameters of each layer until the speech recognition accuracy reaches the preset accuracy.
The following describes the performance test of the solution provided in the present application with reference to fig. 5. Firstly, using the training method shown in fig. 4 to obtain a first neural network Model1 (MFCC sequence for extracting sample speech signal is needed when training the Model 1) for converting a chinese speech signal into pinyin information, and using the second neural network Model structure shown in fig. 2 to train to obtain a Model2 (the training method of the Model2 may be a conventional training method, and is not described herein again) for converting a signal with pinyin information into chinese text information; next, with 4000 hours of test audio data, the error rate of the Model1, the error rate of the Model2, and the error rate of cascading the Model1 and the Model2 were detected. See figure 5 for specific data. As can be seen from fig. 5, the error rate of the solution provided by the present application is low.
It should be understood that, the size of the serial number of each step in the foregoing method embodiments does not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present application.
Example two
A second embodiment of the present application provides a speech recognition apparatus, which, for convenience of description, only shows a part related to the present application, and as shown in fig. 6, the speech recognition apparatus 600 includes:
a voice acquiring module 601, configured to acquire a voice signal to be recognized;
a feature extraction module 602, configured to extract features of the voice signal to obtain a feature sequence of the voice signal;
a speech recognition module 603, configured to input the feature sequence to a trained first neural network model, so that the trained first neural network model recognizes the speech signal, and obtains a first signal output by the first neural network model, where the first signal is used to represent text information of the speech signal;
the first neural network model is a coding and decoding model based on an attention mechanism, the coding and decoding model comprises a coding model and a decoding model, and the coding model and the decoding model both comprise a multi-head attention layer;
a multi-head attention layer is connected to each feed-forward layer in the coding model, and a multi-head attention layer is connected to each feed-forward layer in the decoding model.
Optionally, the voice signal is a chinese voice signal, and the text information is pinyin information of the chinese voice signal;
accordingly, the speech recognition apparatus 600 further includes:
and the second voice recognition module is used for inputting the first signal to a trained second neural network model to obtain a second signal output by the second neural network model, wherein the second signal is used for representing Chinese character information or foreign character information of the voice signal, and the second neural network model is a Recurrent Neural Network (RNN) model or a Convolutional Neural Network (CNN) model.
Optionally, the number of layers of the feedforward layer and the multi-head attention layer in the coding model is N1, the number of layers of the feedforward layer and the multi-head attention layer in the decoding model is N2, and N1 and N2 are integers greater than 0;
correspondingly, the connection mode of the feedforward layer and the multi-head attention layer in the coding model is specifically as follows:
the input end of the feed-forward layer of the i1 th layer in the coding model is connected with the output end of the multi-head attention layer of the i1 th layer in the coding model, and i1 is 1 … … N1;
if N1>1, the output end of the feed-forward layer at the j1 th layer in the coding model is also connected with the input end of the multi-head attention layer at the j1+1 th layer in the coding model, and j1 is 1 … … N1-1;
correspondingly, the connection mode of the feedforward layer and the multi-head attention layer in the decoding model specifically comprises the following steps:
the input end of the feed-forward layer of the i2 th layer in the decoding model is connected with the output end of the multi-head attention layer of the i2 th layer in the decoding model, and i2 is 1 … … N2;
if N2>1, the output terminal of the feed-forward layer of the j2 th layer in the decoding model is also connected with the input terminal of the multi-head attention layer of the j2+1 th layer in the decoding model, and j2 is 1 … … N2-1.
Optionally, if N2>1, the i3 th layer of the decoding model corresponds to the i3 th layer of mask multi-head attention layer, i3 ═ 2 … … N2;
correspondingly, the output end of the feed-forward layer at the j2 th layer in the decoding model is further connected to the input end of the multi-head attention layer at the j2+1 th layer in the decoding model, specifically:
the output end of the feed-forward layer of the j2 th layer in the decoding model is connected with the input end of the multi-head attention layer of the j2+1 th layer in the decoding model through the masking multi-head attention layer of the j2+1 th layer in the decoding model.
Optionally, the coding model and the decoding model both include a fully connected layer, the coding model further includes a position embedding layer, and the decoding model further includes a maximum layer argmax layer;
the full-link layer of the coding model and the position embedding layer of the coding model are used for receiving the characteristic sequence of the speech signal, the sum of the output signals of the full-link layer of the coding model and the position embedding layer of the coding model is input into a first multi-head attention layer in the coding model, and the output end of an N1-th feedforward layer of the coding model is connected to the first multi-head attention layer of the decoding model;
the output end of the feedforward layer of the N2 th layer of the decoding model is connected to the input end of the full-link layer of the decoding model, the output end of the full-link layer of the decoding model is connected to the maximum layer, and the output of the maximum layer is the first signal.
Optionally, the decoding model further includes a tag embedding layer and a position embedding layer, wherein the tag embedding layer and the position embedding layer in the decoding model are configured to receive a dimension of a tag corresponding to a sample speech signal when the coding and decoding model is trained, a first multi-head attention layer of the decoding model corresponds to a first multi-head attention layer, the first multi-head attention layer is configured to receive a sum of outputs of the tag embedding layer and the position embedding layer in the decoding model, and an output end of the first multi-head attention layer is connected to the first multi-head attention layer of the decoding model;
accordingly, the coding and decoding model is trained through the following modules:
the sample acquisition module is used for acquiring each sample voice signal and a label corresponding to each sample voice signal;
a prediction signal obtaining module, configured to, for each sample speech signal, input a feature sequence of the sample speech signal to a full-link layer in the coding model and a position embedding layer in the coding model, and input a dimension of a tag corresponding to the sample speech signal to the tag embedding layer and the position embedding layer in the decoding model, so as to obtain a prediction signal corresponding to each sample speech signal;
the accuracy determining module is used for determining the speech recognition accuracy of the coding and decoding model according to the labels corresponding to the sample speech signals and the prediction signals;
and the parameter adjusting module is used for continuously adjusting the parameters in each layer in the coding and decoding model until the speech recognition accuracy reaches the preset accuracy.
It should be noted that, because the contents of information interaction, execution process, and the like between the above-mentioned apparatuses/units are based on the same concept as the method embodiment of the present invention, specific functions and technical effects thereof can be referred to specifically in the method embodiment section, and are not described herein again.
EXAMPLE III
Fig. 7 is a schematic diagram of a terminal device according to a fourth embodiment of the present application. As shown in fig. 7, the terminal device of this embodiment includes: a processor 70, a memory 71 and a computer program 72 stored in said memory 71 and executable on said processor 70. The processor 70 implements steps in the above-described embodiment of the picture processing method, such as steps S101 to S103 shown in fig. 1, when executing the computer program 72. Alternatively, the processor 70 implements the functions of the modules/units in the device embodiments, for example, the functions of the modules 601 to 603 shown in fig. 6, when executing the computer program 72.
Illustratively, the computer program 72 may be divided into one or more modules/units, which are stored in the memory 71 and executed by the processor 70 to accomplish the present application. The one or more modules/units may be a series of computer program instruction segments capable of performing specific functions, which are used to describe the execution process of the computer program 72 in the terminal device 7. For example, the computer program 72 may be divided into a speech acquisition module, a feature extraction module, and a speech recognition module, and each module specifically functions as follows:
acquiring a voice signal to be recognized;
extracting the characteristics of the voice signal to obtain a characteristic sequence of the voice signal;
inputting the characteristic sequence into a trained first neural network model, so that the trained first neural network model recognizes the voice signal to obtain a first signal output by the first neural network model, wherein the first signal is used for representing character information of the voice signal;
the first neural network model is a coding and decoding model based on an attention mechanism, the coding and decoding model comprises a coding model and a decoding model, and the coding model and the decoding model both comprise a multi-head attention layer;
a multi-head attention layer is connected to each feed-forward layer in the coding model, and a multi-head attention layer is connected to each feed-forward layer in the decoding model.
The terminal device may include, but is not limited to, a processor 70 and a memory 71. It will be appreciated by those skilled in the art that fig. 7 is merely an example of a terminal device 7, and does not constitute a limitation of the terminal device 7, and may include more or less components than those shown, or combine some components, or different components, for example, the terminal device may also include input and output devices, network access devices, buses, etc.
The Processor 70 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The storage 71 may be an internal storage unit of the terminal device 7, such as a hard disk or a memory of the terminal device 7. The memory 71 may be an external storage device of the terminal device 7, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like provided in the terminal device 7. Further, the memory 71 may include both an internal storage unit and an external storage device of the terminal device 7. The memory 71 is used to store the computer program and other programs and data required by the terminal device. The above-mentioned memory 71 may also be used to temporarily store data that has been output or is to be output.
It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned functions may be distributed as different functional units and modules according to needs, that is, the internal structure of the apparatus may be divided into different functional units or modules to implement all or part of the above-mentioned functions. Each functional unit and module in the embodiments may be integrated in one processing unit, or each unit may exist alone physically, or two or more units are integrated in one unit, and the integrated unit may be implemented in a form of hardware, or in a form of software functional unit. In addition, specific names of the functional units and modules are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present application. The specific working processes of the units and modules in the system may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and reference may be made to the related descriptions of other embodiments for parts that are not described or illustrated in a certain embodiment.
Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus/terminal device and method may be implemented in other ways. For example, the above-described embodiments of the apparatus/terminal device are merely illustrative, and for example, the division of the above modules or units is only one logical function division, and there may be other division manners in actual implementation, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated modules/units described above, if implemented in the form of software functional units and sold or used as separate products, may be stored in a computer readable storage medium. Based on such understanding, all or part of the flow in the method of the embodiments described above may be implemented by a computer program, which may be stored in a computer readable storage medium and used by a processor to implement the steps of the embodiments of the methods described above. The computer program includes computer program code, and the computer program code may be in a source code form, an object code form, an executable file or some intermediate form. The computer readable medium may include: any entity or device capable of carrying the above-mentioned computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signal, telecommunication signal, software distribution medium, etc. It should be noted that the computer readable medium described above may include content that is subject to appropriate increase or decrease as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media that does not include electrical carrier signals and telecommunications signals in accordance with legislation and patent practice.
The above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present application and are intended to be included within the scope of the present application.

Claims (10)

1. A speech recognition method, comprising:
acquiring a voice signal to be recognized;
extracting the characteristics of the voice signal to obtain a characteristic sequence of the voice signal;
inputting the characteristic sequence into a trained first neural network model, so that the trained first neural network model recognizes the voice signal to obtain a first signal output by the first neural network model, wherein the first signal is used for representing text information of the voice signal;
the first neural network model is a coding and decoding model based on an attention mechanism, the coding and decoding model comprises a coding model and a decoding model, and the coding model and the decoding model both comprise a multi-head attention layer;
and each feed-forward layer in the coding model is connected with a multi-head attention layer, and each feed-forward layer in the decoding model is also connected with a multi-head attention layer.
2. The speech recognition method of claim 1, wherein the speech signal is a chinese speech signal, and the text information is pinyin information of the speech signal;
correspondingly, after the step of inputting the feature sequence to the trained first neural network model so that the trained first neural network model recognizes the speech signal, obtaining the first signal output by the first neural network model, the speech recognition method further includes:
and inputting the first signal into a trained second neural network model to obtain a second signal output by the second neural network model, wherein the second signal is used for representing Chinese character information or foreign character information of the voice signal, and the second neural network model is a Recurrent Neural Network (RNN) model or a Convolutional Neural Network (CNN) model.
3. The speech recognition method according to claim 1 or 2, wherein the number of layers of the feedforward layer and the multi-head attention layer in the coding model is N1, the number of layers of the feedforward layer and the multi-head attention layer in the decoding model is N2, and each of the N1 and the N2 is an integer greater than 0;
correspondingly, the connection mode of the feedforward layer and the multi-head attention layer in the coding model is specifically as follows:
the input end of the feed-forward layer of the i1 th layer in the coding model is connected with the output end of the multi-head attention layer of the i1 th layer in the coding model, and i1 is 1 … … N1;
if N1>1, the output end of the feed-forward layer at the j1 th layer in the coding model is also connected with the input end of the multi-head attention layer at the j1+1 th layer in the coding model, and j1 is 1 … … N1-1;
correspondingly, the connection mode of the feedforward layer and the multi-head attention layer in the decoding model specifically comprises the following steps:
the input end of the feed-forward layer of the i2 th layer in the decoding model is connected with the output end of the multi-head attention layer of the i2 th layer in the decoding model, and i2 is 1 … … N2;
if N2>1, the output end of the feed-forward layer of the j2 th layer in the decoding model is also connected with the input end of the multi-head attention layer of the j2+1 th layer in the decoding model, and j2 is 1 … … N2-1.
4. The speech recognition method of claim 3, wherein if N2>1, the i3 th layer of the decoding model corresponds to the i3 th layer mask multi-head attention layer, i3 ═ 2 … … N2;
correspondingly, the output end of the feed-forward layer at the j2 th layer in the decoding model is further connected to the input end of the multi-head attention layer at the j2+1 th layer in the decoding model, specifically:
the output end of the feed-forward layer of the j2 th layer in the decoding model is connected with the input end of the multi-head attention layer of the j2+1 th layer in the decoding model through the masking multi-head attention layer of the j2+1 th layer in the decoding model.
5. The speech recognition method of claim 4, wherein the coding model and the decoding model each comprise a fully connected layer dense layer, the coding model further comprises a position embedding layer, the decoding model further comprises a maximum layer argmax layer;
the full-connection layer of the coding model and the position embedding layer of the coding model are used for receiving the characteristic sequence of the speech signal, the sum of the output signals of the full-connection layer of the coding model and the position embedding layer of the coding model is input into a first multi-head attention layer in the coding model, and the output end of an N1-th feedforward layer of the coding model is connected to the first multi-head attention layer of the decoding model;
the output end of the feed-forward layer of the N2 th layer of the decoding model is connected to the input end of the full-connection layer of the decoding model, the output end of the full-connection layer of the decoding model is connected to the maximum layer, and the output of the maximum layer is the first signal.
6. The speech signal recognition method according to claim 5, wherein the decoding model further includes a label embedding layer and a position embedding layer, wherein the label embedding layer and the position embedding layer in the decoding model are used for receiving a dimension of a label corresponding to a sample speech signal when the codec model is trained, a first layer multi-head attention layer of the decoding model corresponds to a first layer multi-head attention layer, the first layer multi-head attention layer is used for receiving a sum of outputs of the label embedding layer and the position embedding layer in the decoding model, and an output end of the first layer multi-head attention layer is connected to the first layer multi-head attention layer of the decoding model;
accordingly, the training process of the coding and decoding model is as follows:
obtaining each sample voice signal and a label corresponding to each sample voice signal;
for each sample voice signal, inputting the characteristic sequence of the sample voice signal into a full-connection layer in the coding model and a position embedding layer in the coding model, and inputting the dimension of a label corresponding to the sample voice signal into the label embedding layer and the position embedding layer in the decoding model to obtain a prediction signal corresponding to each sample voice signal;
determining the speech recognition accuracy of the coding and decoding model according to the label corresponding to each sample speech signal and each prediction signal;
and continuously adjusting parameters in each layer in the coding and decoding model until the speech recognition accuracy reaches preset accuracy.
7. A speech recognition apparatus, comprising:
the voice acquisition module is used for acquiring a voice signal to be recognized;
the characteristic extraction module is used for extracting the characteristics of the voice signal to obtain a characteristic sequence of the voice signal;
a voice recognition module, configured to input the feature sequence to a trained first neural network model, so that the trained first neural network model recognizes the voice signal, and a first signal output by the first neural network model is obtained, where the first signal is used to represent text information of the voice signal;
the first neural network model is a coding and decoding model based on an attention mechanism, the coding and decoding model comprises a coding model and a decoding model, and the coding model and the decoding model both comprise a multi-head attention layer;
and each feed-forward layer in the coding model is connected with a multi-head attention layer, and each feed-forward layer in the decoding model is also connected with a multi-head attention layer.
8. The speech recognition apparatus of claim 7, wherein the speech signal is a chinese speech signal, and the text information is pinyin information of the speech signal;
accordingly, the speech recognition apparatus further includes:
and the second voice recognition module is used for inputting the first signal to a trained second neural network model to obtain a second signal output by the second neural network model, wherein the second signal is used for representing Chinese character information or foreign character information of the voice signal, and the second neural network model is a Recurrent Neural Network (RNN) model or a Convolutional Neural Network (CNN) model.
9. A terminal device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the speech recognition method according to any of claims 1 to 6 when executing the computer program.
10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the speech recognition method according to any one of claims 1 to 6.
CN201910407591.7A 2019-05-16 2019-05-16 Voice recognition method, voice recognition device and terminal equipment Pending CN112037776A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910407591.7A CN112037776A (en) 2019-05-16 2019-05-16 Voice recognition method, voice recognition device and terminal equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910407591.7A CN112037776A (en) 2019-05-16 2019-05-16 Voice recognition method, voice recognition device and terminal equipment

Publications (1)

Publication Number Publication Date
CN112037776A true CN112037776A (en) 2020-12-04

Family

ID=73575741

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910407591.7A Pending CN112037776A (en) 2019-05-16 2019-05-16 Voice recognition method, voice recognition device and terminal equipment

Country Status (1)

Country Link
CN (1) CN112037776A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112951218A (en) * 2021-03-22 2021-06-11 百果园技术(新加坡)有限公司 Voice processing method and device based on neural network model and electronic equipment
CN113096621A (en) * 2021-03-26 2021-07-09 平安科技(深圳)有限公司 Music generation method, device and equipment based on specific style and storage medium
CN113129869A (en) * 2021-03-22 2021-07-16 北京百度网讯科技有限公司 Method and device for training and recognizing voice recognition model
CN113611289A (en) * 2021-08-06 2021-11-05 上海汽车集团股份有限公司 Voice recognition method and device
CN114993677A (en) * 2022-05-11 2022-09-02 山东大学 Rolling bearing fault diagnosis method and system based on unbalanced small sample data

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110218796A1 (en) * 2010-03-05 2011-09-08 Microsoft Corporation Transliteration using indicator and hybrid generative features
CN103578465A (en) * 2013-10-18 2014-02-12 威盛电子股份有限公司 Speech recognition method and electronic device
US20170169830A1 (en) * 2015-12-11 2017-06-15 Electronics And Telecommunications Research Institute Method and apparatus for inserting data to audio signal or extracting data from audio signal based on time domain
CN108417202A (en) * 2018-01-19 2018-08-17 苏州思必驰信息科技有限公司 Audio recognition method and system
CN108491514A (en) * 2018-03-26 2018-09-04 清华大学 The method and device putd question in conversational system, electronic equipment, computer-readable medium
CN108549637A (en) * 2018-04-19 2018-09-18 京东方科技集团股份有限公司 Method for recognizing semantics, device based on phonetic and interactive system
WO2018217948A1 (en) * 2017-05-23 2018-11-29 Google Llc Attention-based sequence transduction neural networks
CN109344413A (en) * 2018-10-16 2019-02-15 北京百度网讯科技有限公司 Translation processing method and device
CN109389091A (en) * 2018-10-22 2019-02-26 重庆邮电大学 The character identification system and method combined based on neural network and attention mechanism
CN109492232A (en) * 2018-10-22 2019-03-19 内蒙古工业大学 A kind of illiteracy Chinese machine translation method of the enhancing semantic feature information based on Transformer
CN109558576A (en) * 2018-11-05 2019-04-02 中山大学 A kind of punctuation mark prediction technique based on from attention mechanism
CN109697974A (en) * 2017-10-19 2019-04-30 百度(美国)有限责任公司 Use the system and method for the neural text-to-speech that convolution sequence learns

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110218796A1 (en) * 2010-03-05 2011-09-08 Microsoft Corporation Transliteration using indicator and hybrid generative features
CN103578465A (en) * 2013-10-18 2014-02-12 威盛电子股份有限公司 Speech recognition method and electronic device
US20170169830A1 (en) * 2015-12-11 2017-06-15 Electronics And Telecommunications Research Institute Method and apparatus for inserting data to audio signal or extracting data from audio signal based on time domain
WO2018217948A1 (en) * 2017-05-23 2018-11-29 Google Llc Attention-based sequence transduction neural networks
CN109697974A (en) * 2017-10-19 2019-04-30 百度(美国)有限责任公司 Use the system and method for the neural text-to-speech that convolution sequence learns
CN108417202A (en) * 2018-01-19 2018-08-17 苏州思必驰信息科技有限公司 Audio recognition method and system
CN108491514A (en) * 2018-03-26 2018-09-04 清华大学 The method and device putd question in conversational system, electronic equipment, computer-readable medium
CN108549637A (en) * 2018-04-19 2018-09-18 京东方科技集团股份有限公司 Method for recognizing semantics, device based on phonetic and interactive system
CN109344413A (en) * 2018-10-16 2019-02-15 北京百度网讯科技有限公司 Translation processing method and device
CN109389091A (en) * 2018-10-22 2019-02-26 重庆邮电大学 The character identification system and method combined based on neural network and attention mechanism
CN109492232A (en) * 2018-10-22 2019-03-19 内蒙古工业大学 A kind of illiteracy Chinese machine translation method of the enhancing semantic feature information based on Transformer
CN109558576A (en) * 2018-11-05 2019-04-02 中山大学 A kind of punctuation mark prediction technique based on from attention mechanism

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
吴邦誉: "采用拼音降维的中文对话模型", 《中文信息学报》, vol. 33, no. 5, pages 113 - 121 *

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112951218A (en) * 2021-03-22 2021-06-11 百果园技术(新加坡)有限公司 Voice processing method and device based on neural network model and electronic equipment
CN113129869A (en) * 2021-03-22 2021-07-16 北京百度网讯科技有限公司 Method and device for training and recognizing voice recognition model
CN113129869B (en) * 2021-03-22 2022-01-28 北京百度网讯科技有限公司 Method and device for training and recognizing voice recognition model
CN112951218B (en) * 2021-03-22 2024-03-29 百果园技术(新加坡)有限公司 Voice processing method and device based on neural network model and electronic equipment
CN113096621A (en) * 2021-03-26 2021-07-09 平安科技(深圳)有限公司 Music generation method, device and equipment based on specific style and storage medium
CN113096621B (en) * 2021-03-26 2024-05-28 平安科技(深圳)有限公司 Music generation method, device, equipment and storage medium based on specific style
CN113611289A (en) * 2021-08-06 2021-11-05 上海汽车集团股份有限公司 Voice recognition method and device
CN113611289B (en) * 2021-08-06 2024-06-18 上海汽车集团股份有限公司 Voice recognition method and device
CN114993677A (en) * 2022-05-11 2022-09-02 山东大学 Rolling bearing fault diagnosis method and system based on unbalanced small sample data

Similar Documents

Publication Publication Date Title
CN112037776A (en) Voice recognition method, voice recognition device and terminal equipment
CN112786006B (en) Speech synthesis method, synthesis model training method, device, medium and equipment
WO2020253060A1 (en) Speech recognition method, model training method, apparatus and device, and storage medium
CN111369971B (en) Speech synthesis method, device, storage medium and electronic equipment
CN112183120A (en) Speech translation method, device, equipment and storage medium
CN112507706B (en) Training method and device for knowledge pre-training model and electronic equipment
CN112633947B (en) Text generation model generation method, text generation method, device and equipment
CN111292719A (en) Speech synthesis method, speech synthesis device, computer readable medium and electronic equipment
CN112259089A (en) Voice recognition method and device
CN112397051A (en) Voice recognition method and device and terminal equipment
WO2022156434A1 (en) Method and apparatus for generating text
US20210004603A1 (en) Method and apparatus for determining (raw) video materials for news
CN111667810A (en) Method and device for acquiring polyphone corpus, readable medium and electronic equipment
CN111862961A (en) Method and device for recognizing voice
CN111241853A (en) Session translation method, device, storage medium and terminal equipment
CN109949814A (en) Audio recognition method, system, computer system and computer readable storage medium
CN113468344B (en) Entity relationship extraction method and device, electronic equipment and computer readable medium
CN111538817A (en) Man-machine interaction method and device
CN113053362A (en) Method, device, equipment and computer readable medium for speech recognition
US11238865B2 (en) Function performance based on input intonation
CN114694637A (en) Hybrid speech recognition method, device, electronic equipment and storage medium
CN116825084A (en) Cross-language speech synthesis method and device, electronic equipment and storage medium
CN115101075A (en) Voice recognition method and related device
CN110728137B (en) Method and device for word segmentation
CN114429629A (en) Image processing method and device, readable storage medium and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination