CN112037776A

CN112037776A - Voice recognition method, voice recognition device and terminal equipment

Info

Publication number: CN112037776A
Application number: CN201910407591.7A
Authority: CN
Inventors: 陈明
Original assignee: Wuhan TCL Group Industrial Research Institute Co Ltd
Current assignee: Wuhan TCL Group Industrial Research Institute Co Ltd
Priority date: 2019-05-16
Filing date: 2019-05-16
Publication date: 2020-12-04

Abstract

The application provides a voice recognition method, a voice recognition device and a terminal device, wherein the method comprises the following steps: acquiring a voice signal to be recognized; extracting the characteristics of the voice signal to obtain a characteristic sequence of the voice signal; inputting the characteristic sequence into a trained first neural network model, so that the first neural network model recognizes the voice signal to obtain a first signal output by the first neural network model, wherein the first signal is used for representing text information of the voice signal; the first neural network model is a coding and decoding model based on an attention mechanism, the coding and decoding model comprises a coding model and a decoding model, and the coding model and the decoding model both comprise a multi-head attention layer; a multi-head attention layer is connected to each feed-forward layer in the coding model, and a multi-head attention layer is connected to each feed-forward layer in the decoding model. The method and the device can improve the accuracy of voice recognition to a certain extent.

Description

Voice recognition method, voice recognition device and terminal equipment

Technical Field

The present application belongs to the field of speech recognition technology, and in particular, to a speech recognition method, a speech recognition apparatus, a terminal device, and a computer-readable storage medium.

Background

However, the recognized words may not have the same meaning as that which we want to express, for example, the result of speech recognition of the speech "i want to watch a movie" may be "i want to watch a shop" or "i see a movie" by medicine. Therefore, a speech recognition method with high recognition accuracy is urgently needed.

Disclosure of Invention

In view of the above, the present application provides a speech recognition method, a speech recognition apparatus, a terminal device and a computer readable storage medium, which can improve the recognition accuracy of a speech signal to a certain extent.

A first aspect of the present application provides a speech recognition method, including:

acquiring a voice signal to be recognized;

extracting the characteristics of the voice signal to obtain a characteristic sequence of the voice signal;

inputting the characteristic sequence into a trained first neural network model, so that the trained first neural network model recognizes the voice signal to obtain a first signal output by the first neural network model, wherein the first signal is used for representing character information of the voice signal;

the first neural network model is a coding and decoding model based on an attention mechanism, the coding and decoding model comprises a coding model and a decoding model, and the coding model and the decoding model both comprise a multi-head attention layer;

a multi-head attention layer is connected to each feed-forward layer in the coding model, and a multi-head attention layer is connected to each feed-forward layer in the decoding model.

A second aspect of the present application provides a speech recognition apparatus, comprising:

the voice acquisition module is used for acquiring a voice signal to be recognized;

the feature extraction module is used for extracting the features of the voice signals to obtain a feature sequence of the voice signals;

a voice recognition module, configured to input the feature sequence to a trained first neural network model, so that the trained first neural network model recognizes the voice signal, and a first signal output by the first neural network model is obtained, where the first signal is used to represent text information of the voice signal;

A third aspect of the present application provides a terminal device, comprising a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor implements the speech recognition method according to the first aspect when executing the computer program.

A fourth aspect of the present application provides a computer-readable storage medium storing a computer program which, when executed by a processor, implements the speech recognition method of the first aspect as described above.

A fifth aspect of the application provides a computer program product comprising a computer program which, when executed by one or more processors, implements the speech recognition method according to the first aspect as described above.

Therefore, the speech recognition method provided by the application identifies a speech signal to be recognized through a trained first neural network model to obtain a first signal for representing text information of the speech signal, wherein the first neural network model is an encoding and decoding model based on an attention mechanism, each feedforward layer of an encoding model in the encoding and decoding model is connected with a multi-head attention layer, and each feedforward layer of a decoding model in the encoding and decoding model is also connected with the multi-head attention layer, so that the attention mechanism can be embedded into an internal structure of the encoding model and an internal structure of the decoding model. When the attention mechanism is used in the neural network model of the voice recognition, the voice recognition accuracy of the neural network model can be improved to a certain extent, and the attention mechanism is further applied to the internal structure of the neural network model of the voice recognition, so that the recognition accuracy of the neural network model is further improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed for the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.

Fig. 1 is a schematic flow chart illustrating an implementation of a speech recognition method according to an embodiment of the present application;

FIG. 2 is a schematic structural diagram of a second neural network model provided in an embodiment of the present application;

FIG. 3 is a schematic structural diagram of a first neural network model according to an embodiment of the present disclosure;

FIG. 4 is a schematic diagram of a training method for the first neural network model shown in FIG. 3 according to an embodiment of the present application;

FIG. 5 is a table of performance test results provided in the first embodiment of the present application;

fig. 6 is a schematic structural diagram of a speech recognition apparatus according to the second embodiment of the present application;

fig. 7 is a schematic diagram of a terminal device provided in the third embodiment of the present application.

Detailed Description

In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system structures, techniques, etc. in order to provide a thorough understanding of the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.

The speech recognition method provided by the embodiment of the application is applicable to the terminal device, and the terminal device includes, but is not limited to: smart phones, digital cameras, palm top computers, notebooks, desktop computers, intelligent wearable devices, and the like.

It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It is also to be understood that the terminology used in the description of the present application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in the specification of the present application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It should be further understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.

In addition, in the description of the present application, the terms "first", "second", and the like are used only for distinguishing the description, and are not intended to indicate or imply relative importance.

In order to explain the technical solution described in the present application, the following description will be given by way of specific examples.

Example one

The following describes a speech recognition method provided in the first embodiment of the present application, where the speech recognition method is applied to a terminal device. Referring to fig. 1, a speech recognition method according to a first embodiment of the present application includes:

in step S101, a speech signal to be recognized is acquired;

in this embodiment of the present application, the voice signal to be recognized may be a voice signal input to the terminal device by a user through a microphone; or, the voice signal can be downloaded from the internet by the user; or, the terminal device may also be a voice signal in an audio/video file stored locally in the terminal device. The source of the speech signal is not limited in this application.

In step S102, extracting the features of the speech signal to obtain a feature sequence of the speech signal;

generally, before a speech signal is processed by using a neural network model, the speech signal needs to be preprocessed, i.e., a feature sequence of the speech signal is extracted. The characteristic sequence may be a Mel Frequency Cepstrum Coefficient (MFCC), a Linear Perceptual prediction Coefficient (PLP), a Mel filter bank Coefficient (FBANK), or the like.

In step S103, inputting the feature sequence into a trained first neural network model, so that the trained first neural network model recognizes the speech signal, and obtaining a first signal output by the first neural network model, where the first signal is used to represent text information of the speech signal;

in an embodiment of the present application, the first neural network model is an attention-based coding/decoding model, the coding/decoding model includes a coding model and a decoding model, the coding model and the decoding model both include a multi-head attention layer, each feed-forward layer in the coding model is connected to a multi-head attention layer, and each feed-forward layer in the decoding model is also connected to a multi-head attention layer.

The "text information" in step S103 may be pinyin information, chinese text information, english text information, japanese text information, or the like, and the specific representation form of the text information is not limited in the present application.

In general, a user often wants to obtain chinese character information corresponding to a chinese speech signal when inputting the chinese speech signal through a microphone, for example, when inputting a chinese speech "i am at 6 o 'clock at night" through a microphone, obtain a chinese character "i am at 6 o' clock at night" converted by a terminal device. To implement this function, the following method may be employed:

the first method is to directly train the first neural network model into a neural network model for converting a chinese speech signal into chinese text information (at this time, the "speech signal" in the step S101 is specifically a "chinese speech signal", and the "text information" in the step S102 is specifically "chinese text information"). In this way, after the Chinese speech signal X to be recognized is acquired, the Chinese speech signal X is directly converted into a signal for representing Chinese character information by using the first neural network model.

In the second method, if the end-to-end speech recognition method described in the first method is used, the following technical problems will exist: because the updating speed of the current Chinese vocabulary is very fast, many new Chinese vocabularies continuously appear, such as social animals, rushing families, academic superman, gospel, and the like, in order to improve the speech recognition accuracy of the first neural network model in the first method, the training sample library needs to be frequently updated, and the first neural network model is retrained by using the updated training sample library, and due to the complexity of speech characteristics, the neural network model for speech recognition usually needs longer training time, so that the first method has the technical problem that "in order to match the updating speed of the Chinese vocabularies, the first neural network model needs to be retrained frequently, so that a large amount of training time is needed for each training". In order to solve the technical problem of the first method, the following second method can be adopted:

firstly, training the first neural network model as a neural network model for converting a chinese speech signal into pinyin information (at this time, the "speech signal" in the step S101 is specifically a "chinese speech signal", and the "text information" in the step S103 is specifically "pinyin information"), and training a second neural network model for converting a signal with pinyin information into chinese text information, wherein the second neural network model may be an RNN model (such as deep bidirectional long-short term memory network deep-bilst) or a CNN model (as shown in fig. 2, a structural schematic diagram of the second neural network model for converting a signal with pinyin information into chinese text information provided by the present application);

secondly, after the first neural network model and the second neural network model are trained, when a Chinese speech signal Y to be recognized is obtained, a first signal is obtained by using the first neural network model, the first signal is used for representing pinyin information of the Chinese speech signal Y, the first signal is input to the second neural network model, a second signal output by the second neural network model is obtained, and the second signal is used for representing the Chinese text information of the speech signal Y (as is easily understood by a person skilled in the art, if the second neural network model is trained to convert a signal with pinyin information into foreign text information, the conversion of Chinese speech into foreign text can be realized through the first neural network model and the second neural network model).

In the second method, in order to ensure the accuracy of speech recognition and match the update speed of the chinese vocabulary, those skilled in the art can understand that we do not need to train the first neural network model frequently, and only need to train the second neural network model frequently, and for the character-to-character neural network model, the training time is relatively short, so in the example of the second method, in order to match the update speed of the chinese vocabulary, we only need to train the second neural network model frequently, and the training time of the second neural network model is short, so the second method solves the technical problem of the first method to a certain extent.

In the first embodiment of the present application, we also improve the structure of the first neural network model for speech recognition. When the attention mechanism is introduced into the first neural network model for speech recognition, the speech recognition accuracy of the first neural network model can be improved to a certain extent, but at present, when speech recognition is performed by using a coding/decoding model, the attention network is usually embedded between the coding model and the decoding model, and the attention mechanism is not introduced into the specific structures of the coding model and the decoding model, whereas the attention mechanism is introduced into the feedforward layers of the coding model and the decoding model, and the speech recognition accuracy can be further improved to a certain extent.

The following describes a structure of the first neural network model described in the present application in detail with reference to fig. 3, and those skilled in the art should understand that the example shown in fig. 3 is only an example of the first neural network model in the present application and does not constitute a limitation to the structure of the first neural network model.

The first neural network model of the present application includes an encoding model in which the number of layers of a feedforward layer and a multi-head attention layer may be the same, and N1 layers (N1 is an integer greater than 0, and N1 is 3 in the example shown in fig. 3) are assumed, and a decoding model in which the number of layers of a feedforward layer and a multi-head attention layer may be the same, and N2 layers (N2 is an integer greater than 0, and N2 is 3 in the example shown in fig. 3) are assumed.

Each layer of feedforward layer in the coding model is connected with a multi-head attention layer, and the specific connection mode can be as follows: the input ends of the i1 th layer feedforward layers in the coding model are all connected with the output end of the i1 th layer multi-head attention layer in the coding model, i1 is 1 … … N1, when N1 is greater than 1, the output end of the j1 th layer feedforward layer in the coding model is also connected with the input end of the j1+1 th layer multi-head attention layer in the coding model, and j1 is 1 … … N1-1. As shown in FIG. 3, a layer1 Multi-head attention layer1, a layer1 feedforward layer fed forward layer1, a layer2 Multi-head attention layer2, a layer2 feedforward layer fed forward layer2, a layer3 Multi-head attention layer3 and a layer3 Feed forward layer3 are connected in sequence in the coding model.

Each layer of feedforward layer in the decoding model is connected with a multi-head attention layer, and the specific connection mode can be as follows: the input ends of the i2 th layer feedforward layers in the decoding model are all connected with the output end of the i2 th layer multi-head attention layer in the decoding model, i2 is 1 … … N2, when N2 is greater than 1, the output end of the j2 th layer feedforward layer in the decoding model is also connected with the input end of the j2+1 th layer multi-head attention layer in the decoding model, and j2 is 1 … … N2-1. As shown in FIG. 3, a layer1 Multi-head attention layer1, a layer1 feedforward layer fed forward layer1, a layer2 Multi-head attention layer2, a layer2 feedforward layer fed forward layer2, a layer3 Multi-head attention layer3 and a layer3 Feed forward layer3 are connected in sequence in the decoding model. In addition, in this application, if N2>1, the i3 layer feed-forward layer in the decoding model may also correspond to the i3 layer mask multi-head attention layer, i3 ═ 2 … … N2, and specifically, the output of the j2 layer feed-forward layer in the decoding model may be connected to the input of the j2+1 layer multi-head attention layer in the decoding model through the j2+1 layer mask multi-head attention layer in the decoding model. That is, in the example shown in fig. 3, the layer1 Feed forward layer1 in the decoding model may be connected to the layer2 Multi-head attention layer2 through the layer2 masking Multi-head attention layer Mask Multi-head attention layer, and the layer2 Feed forward layer2 in the decoding model is connected to the layer3 Multi-head attention layer3 through the layer3 masking Multi-head attention layer.

In addition, the coding model may further include a full-link layer and a position-embedded layer, the decoding model may further include a full-link layer and a maximum layer argmax layer, the full-link layer of the coding model and the position-embedded layer of the coding model are configured to receive a speech signal to be recognized, a sum of outputs of the full-link layer and the position-embedded layer is input to a first multi-head attention layer in the coding model, an output of an N2-th layer feedforward layer of the decoding model is connected to an input of the full-link layer of the decoding model, an output of the full-link layer of the decoding model is connected to an input of the maximum layer, and the maximum layer is configured to output the first signal in step S102 (see fig. 3 in particular). Furthermore, as shown in FIG. 3, the output of the N1 th layer feed-forward layer of the coding model may be connected to the first layer multi-headed attention layer of the decoding module.

A method of training the first neural network model shown in fig. 3 is discussed below with reference to fig. 4. It will be appreciated by those skilled in the art that the first neural network model shown in fig. 3 may employ other training methods commonly used in the art, in addition to the training method shown in fig. 4. However, the training method illustrated in FIG. 4 can further improve the speech recognition accuracy of the model illustrated in FIG. 3 compared to conventional training methods.

When the coding and decoding model shown in fig. 3 is trained by using fig. 4, the following structure needs to be added to the decoding model of the coding and decoding model: the decoding model further comprises a label embedding layer and a position embedding layer, wherein the label embedding layer and the position embedding layer in the decoding model are used for receiving the dimension of a label corresponding to a sample voice signal when the coding and decoding model is trained (when the coding and decoding model is trained, the dimension corresponding to an input label can further improve the voice recognition accuracy of the model after training compared with the common mode of only inputting the label), in addition, a first layer multi-head attention layer of the decoding model is corresponding to a first layer masking multi-head attention layer which is used for receiving the sum of the output of the label embedding layer and the output of the position embedding layer in the decoding model, and the output end of the first layer masking multi-head attention layer is connected with the first layer multi-head attention layer of the decoding model. As shown in fig. 4, the trained first neural network model is obtained through steps S401-S404.

In step S401, each sample voice signal and a label corresponding to each sample voice signal are obtained;

if the codec model (i.e., the first neural network model) shown in fig. 3 is trained as a neural network that converts a chinese speech signal into pinyin information, each sample speech signal obtained in step S401 should be a chinese speech signal, and a label corresponding to each sample speech signal should be a signal that includes the pinyin information corresponding to the sample speech signal.

In step S402, for each sample speech signal, inputting the feature sequence of the sample speech signal into the full-link layer in the coding model and the position embedding layer in the coding model in fig. 3, and inputting the dimension of the tag corresponding to the sample speech signal into the tag embedding layer and the position embedding layer in the decoding model, so as to obtain a prediction signal corresponding to each sample speech signal;

the training method provided in fig. 4 requires inputting the dimension of the label corresponding to each sample speech signal (for example, if the pinyin information that the label X represents is: wo3 yao4 kan4 dian4 ying3, the dimension of the label X is 5, that is, if the sample speech signal is chinese, the dimension of the label is the number of words that the label represents, and if the sample speech signal is english, the dimension of the label is the number of words that the label represents) into the label embedding layer of the decoding model and the position embedding layer of the decoding model.

The "prediction signal" in step S402 is used to represent the text information of the sample speech signal input to the first neural network model in fig. 3.

In step S403, determining the speech recognition accuracy of the coding/decoding model in fig. 3 according to the label corresponding to each sample speech signal and each prediction signal;

in step S404, continuously adjusting parameters of each layer in the coding/decoding model until the speech recognition accuracy reaches a preset accuracy;

and determining the speech recognition accuracy of the coding and decoding model by observing each prediction signal by taking the label corresponding to each sample speech signal as a reference, and continuously adjusting the parameters of each layer until the speech recognition accuracy reaches the preset accuracy.

The following describes the performance test of the solution provided in the present application with reference to fig. 5. Firstly, using the training method shown in fig. 4 to obtain a first neural network Model1 (MFCC sequence for extracting sample speech signal is needed when training the Model 1) for converting a chinese speech signal into pinyin information, and using the second neural network Model structure shown in fig. 2 to train to obtain a Model2 (the training method of the Model2 may be a conventional training method, and is not described herein again) for converting a signal with pinyin information into chinese text information; next, with 4000 hours of test audio data, the error rate of the Model1, the error rate of the Model2, and the error rate of cascading the Model1 and the Model2 were detected. See figure 5 for specific data. As can be seen from fig. 5, the error rate of the solution provided by the present application is low.

It should be understood that, the size of the serial number of each step in the foregoing method embodiments does not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present application.

Example two

A second embodiment of the present application provides a speech recognition apparatus, which, for convenience of description, only shows a part related to the present application, and as shown in fig. 6, the speech recognition apparatus 600 includes:

a voice acquiring module 601, configured to acquire a voice signal to be recognized;

a feature extraction module 602, configured to extract features of the voice signal to obtain a feature sequence of the voice signal;

a speech recognition module 603, configured to input the feature sequence to a trained first neural network model, so that the trained first neural network model recognizes the speech signal, and obtains a first signal output by the first neural network model, where the first signal is used to represent text information of the speech signal;

Optionally, the voice signal is a chinese voice signal, and the text information is pinyin information of the chinese voice signal;

accordingly, the speech recognition apparatus 600 further includes:

and the second voice recognition module is used for inputting the first signal to a trained second neural network model to obtain a second signal output by the second neural network model, wherein the second signal is used for representing Chinese character information or foreign character information of the voice signal, and the second neural network model is a Recurrent Neural Network (RNN) model or a Convolutional Neural Network (CNN) model.

Optionally, the number of layers of the feedforward layer and the multi-head attention layer in the coding model is N1, the number of layers of the feedforward layer and the multi-head attention layer in the decoding model is N2, and N1 and N2 are integers greater than 0;

correspondingly, the connection mode of the feedforward layer and the multi-head attention layer in the coding model is specifically as follows:

the input end of the feed-forward layer of the i1 th layer in the coding model is connected with the output end of the multi-head attention layer of the i1 th layer in the coding model, and i1 is 1 … … N1;

if N1>1, the output end of the feed-forward layer at the j1 th layer in the coding model is also connected with the input end of the multi-head attention layer at the j1+1 th layer in the coding model, and j1 is 1 … … N1-1;

correspondingly, the connection mode of the feedforward layer and the multi-head attention layer in the decoding model specifically comprises the following steps:

the input end of the feed-forward layer of the i2 th layer in the decoding model is connected with the output end of the multi-head attention layer of the i2 th layer in the decoding model, and i2 is 1 … … N2;

if N2>1, the output terminal of the feed-forward layer of the j2 th layer in the decoding model is also connected with the input terminal of the multi-head attention layer of the j2+1 th layer in the decoding model, and j2 is 1 … … N2-1.

Optionally, if N2>1, the i3 th layer of the decoding model corresponds to the i3 th layer of mask multi-head attention layer, i3 ═ 2 … … N2;

correspondingly, the output end of the feed-forward layer at the j2 th layer in the decoding model is further connected to the input end of the multi-head attention layer at the j2+1 th layer in the decoding model, specifically:

the output end of the feed-forward layer of the j2 th layer in the decoding model is connected with the input end of the multi-head attention layer of the j2+1 th layer in the decoding model through the masking multi-head attention layer of the j2+1 th layer in the decoding model.

Optionally, the coding model and the decoding model both include a fully connected layer, the coding model further includes a position embedding layer, and the decoding model further includes a maximum layer argmax layer;

the full-link layer of the coding model and the position embedding layer of the coding model are used for receiving the characteristic sequence of the speech signal, the sum of the output signals of the full-link layer of the coding model and the position embedding layer of the coding model is input into a first multi-head attention layer in the coding model, and the output end of an N1-th feedforward layer of the coding model is connected to the first multi-head attention layer of the decoding model;

the output end of the feedforward layer of the N2 th layer of the decoding model is connected to the input end of the full-link layer of the decoding model, the output end of the full-link layer of the decoding model is connected to the maximum layer, and the output of the maximum layer is the first signal.

Optionally, the decoding model further includes a tag embedding layer and a position embedding layer, wherein the tag embedding layer and the position embedding layer in the decoding model are configured to receive a dimension of a tag corresponding to a sample speech signal when the coding and decoding model is trained, a first multi-head attention layer of the decoding model corresponds to a first multi-head attention layer, the first multi-head attention layer is configured to receive a sum of outputs of the tag embedding layer and the position embedding layer in the decoding model, and an output end of the first multi-head attention layer is connected to the first multi-head attention layer of the decoding model;

accordingly, the coding and decoding model is trained through the following modules:

the sample acquisition module is used for acquiring each sample voice signal and a label corresponding to each sample voice signal;

a prediction signal obtaining module, configured to, for each sample speech signal, input a feature sequence of the sample speech signal to a full-link layer in the coding model and a position embedding layer in the coding model, and input a dimension of a tag corresponding to the sample speech signal to the tag embedding layer and the position embedding layer in the decoding model, so as to obtain a prediction signal corresponding to each sample speech signal;

the accuracy determining module is used for determining the speech recognition accuracy of the coding and decoding model according to the labels corresponding to the sample speech signals and the prediction signals;

and the parameter adjusting module is used for continuously adjusting the parameters in each layer in the coding and decoding model until the speech recognition accuracy reaches the preset accuracy.

It should be noted that, because the contents of information interaction, execution process, and the like between the above-mentioned apparatuses/units are based on the same concept as the method embodiment of the present invention, specific functions and technical effects thereof can be referred to specifically in the method embodiment section, and are not described herein again.

EXAMPLE III

Fig. 7 is a schematic diagram of a terminal device according to a fourth embodiment of the present application. As shown in fig. 7, the terminal device of this embodiment includes: a processor 70, a memory 71 and a computer program 72 stored in said memory 71 and executable on said processor 70. The processor 70 implements steps in the above-described embodiment of the picture processing method, such as steps S101 to S103 shown in fig. 1, when executing the computer program 72. Alternatively, the processor 70 implements the functions of the modules/units in the device embodiments, for example, the functions of the modules 601 to 603 shown in fig. 6, when executing the computer program 72.

Illustratively, the computer program 72 may be divided into one or more modules/units, which are stored in the memory 71 and executed by the processor 70 to accomplish the present application. The one or more modules/units may be a series of computer program instruction segments capable of performing specific functions, which are used to describe the execution process of the computer program 72 in the terminal device 7. For example, the computer program 72 may be divided into a speech acquisition module, a feature extraction module, and a speech recognition module, and each module specifically functions as follows:

acquiring a voice signal to be recognized;

The terminal device may include, but is not limited to, a processor 70 and a memory 71. It will be appreciated by those skilled in the art that fig. 7 is merely an example of a terminal device 7, and does not constitute a limitation of the terminal device 7, and may include more or less components than those shown, or combine some components, or different components, for example, the terminal device may also include input and output devices, network access devices, buses, etc.

The Processor 70 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The storage 71 may be an internal storage unit of the terminal device 7, such as a hard disk or a memory of the terminal device 7. The memory 71 may be an external storage device of the terminal device 7, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like provided in the terminal device 7. Further, the memory 71 may include both an internal storage unit and an external storage device of the terminal device 7. The memory 71 is used to store the computer program and other programs and data required by the terminal device. The above-mentioned memory 71 may also be used to temporarily store data that has been output or is to be output.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned functions may be distributed as different functional units and modules according to needs, that is, the internal structure of the apparatus may be divided into different functional units or modules to implement all or part of the above-mentioned functions. Each functional unit and module in the embodiments may be integrated in one processing unit, or each unit may exist alone physically, or two or more units are integrated in one unit, and the integrated unit may be implemented in a form of hardware, or in a form of software functional unit. In addition, specific names of the functional units and modules are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present application. The specific working processes of the units and modules in the system may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and reference may be made to the related descriptions of other embodiments for parts that are not described or illustrated in a certain embodiment.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus/terminal device and method may be implemented in other ways. For example, the above-described embodiments of the apparatus/terminal device are merely illustrative, and for example, the division of the above modules or units is only one logical function division, and there may be other division manners in actual implementation, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated modules/units described above, if implemented in the form of software functional units and sold or used as separate products, may be stored in a computer readable storage medium. Based on such understanding, all or part of the flow in the method of the embodiments described above may be implemented by a computer program, which may be stored in a computer readable storage medium and used by a processor to implement the steps of the embodiments of the methods described above. The computer program includes computer program code, and the computer program code may be in a source code form, an object code form, an executable file or some intermediate form. The computer readable medium may include: any entity or device capable of carrying the above-mentioned computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signal, telecommunication signal, software distribution medium, etc. It should be noted that the computer readable medium described above may include content that is subject to appropriate increase or decrease as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media that does not include electrical carrier signals and telecommunications signals in accordance with legislation and patent practice.

The above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present application and are intended to be included within the scope of the present application.

Claims

1. A speech recognition method, comprising:

acquiring a voice signal to be recognized;

inputting the characteristic sequence into a trained first neural network model, so that the trained first neural network model recognizes the voice signal to obtain a first signal output by the first neural network model, wherein the first signal is used for representing text information of the voice signal;

and each feed-forward layer in the coding model is connected with a multi-head attention layer, and each feed-forward layer in the decoding model is also connected with a multi-head attention layer.

2. The speech recognition method of claim 1, wherein the speech signal is a chinese speech signal, and the text information is pinyin information of the speech signal;

correspondingly, after the step of inputting the feature sequence to the trained first neural network model so that the trained first neural network model recognizes the speech signal, obtaining the first signal output by the first neural network model, the speech recognition method further includes:

and inputting the first signal into a trained second neural network model to obtain a second signal output by the second neural network model, wherein the second signal is used for representing Chinese character information or foreign character information of the voice signal, and the second neural network model is a Recurrent Neural Network (RNN) model or a Convolutional Neural Network (CNN) model.

3. The speech recognition method according to claim 1 or 2, wherein the number of layers of the feedforward layer and the multi-head attention layer in the coding model is N1, the number of layers of the feedforward layer and the multi-head attention layer in the decoding model is N2, and each of the N1 and the N2 is an integer greater than 0;

if N2>1, the output end of the feed-forward layer of the j2 th layer in the decoding model is also connected with the input end of the multi-head attention layer of the j2+1 th layer in the decoding model, and j2 is 1 … … N2-1.

4. The speech recognition method of claim 3, wherein if N2>1, the i3 th layer of the decoding model corresponds to the i3 th layer mask multi-head attention layer, i3 ═ 2 … … N2;

5. The speech recognition method of claim 4, wherein the coding model and the decoding model each comprise a fully connected layer dense layer, the coding model further comprises a position embedding layer, the decoding model further comprises a maximum layer argmax layer;

the full-connection layer of the coding model and the position embedding layer of the coding model are used for receiving the characteristic sequence of the speech signal, the sum of the output signals of the full-connection layer of the coding model and the position embedding layer of the coding model is input into a first multi-head attention layer in the coding model, and the output end of an N1-th feedforward layer of the coding model is connected to the first multi-head attention layer of the decoding model;

the output end of the feed-forward layer of the N2 th layer of the decoding model is connected to the input end of the full-connection layer of the decoding model, the output end of the full-connection layer of the decoding model is connected to the maximum layer, and the output of the maximum layer is the first signal.

6. The speech signal recognition method according to claim 5, wherein the decoding model further includes a label embedding layer and a position embedding layer, wherein the label embedding layer and the position embedding layer in the decoding model are used for receiving a dimension of a label corresponding to a sample speech signal when the codec model is trained, a first layer multi-head attention layer of the decoding model corresponds to a first layer multi-head attention layer, the first layer multi-head attention layer is used for receiving a sum of outputs of the label embedding layer and the position embedding layer in the decoding model, and an output end of the first layer multi-head attention layer is connected to the first layer multi-head attention layer of the decoding model;

accordingly, the training process of the coding and decoding model is as follows:

obtaining each sample voice signal and a label corresponding to each sample voice signal;

for each sample voice signal, inputting the characteristic sequence of the sample voice signal into a full-connection layer in the coding model and a position embedding layer in the coding model, and inputting the dimension of a label corresponding to the sample voice signal into the label embedding layer and the position embedding layer in the decoding model to obtain a prediction signal corresponding to each sample voice signal;

determining the speech recognition accuracy of the coding and decoding model according to the label corresponding to each sample speech signal and each prediction signal;

and continuously adjusting parameters in each layer in the coding and decoding model until the speech recognition accuracy reaches preset accuracy.

7. A speech recognition apparatus, comprising:

the characteristic extraction module is used for extracting the characteristics of the voice signal to obtain a characteristic sequence of the voice signal;

8. The speech recognition apparatus of claim 7, wherein the speech signal is a chinese speech signal, and the text information is pinyin information of the speech signal;

accordingly, the speech recognition apparatus further includes:

9. A terminal device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the speech recognition method according to any of claims 1 to 6 when executing the computer program.

10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the speech recognition method according to any one of claims 1 to 6.