WO2020057624A1

WO2020057624A1 - Voice recognition method and apparatus

Info

Publication number: WO2020057624A1
Application number: PCT/CN2019/106909
Authority: WO
Inventors: 郝婧; 陈凯; 谢迪; 浦世亮
Original assignee: 杭州海康威视数字技术股份有限公司
Priority date: 2018-09-20
Filing date: 2019-09-20
Publication date: 2020-03-26
Also published as: CN110931000B; CN110931000A

Abstract

Provided are a voice recognition method and apparatus, which belong to the field of voice recognition. The method comprises: acquiring voice data to be recognized (101); determining phoneme data corresponding to the voice data (102); and inputting the phoneme data into a pre-trained phoneme text conversion model, so as to obtain text data corresponding to the voice data (103). The method can improve the accuracy of recognizing voice data.

Description

Method and device for speech recognition

This application claims priority from a Chinese patent application filed on September 20, 2018 with an application number of 201811099967.4 and an invention name of "Method and Device for Speech Recognition", the entire contents of which are incorporated herein by reference.

Technical field

The present invention relates to the technical field of speech recognition, and in particular, to a method and a device for speech recognition.

Background technique

With the continuous development of science and technology, the voice intelligent control technology has also made great progress, and voice-controlled household appliances have been applied in daily life.

The related technology of speech recognition processing mainly uses a speech recognition model to directly convert speech data into text data, which can be obtained through training and learning.

The inventors found that if the processing method for converting speech data into text data is adopted, the accuracy is relatively low, and therefore, a method that can provide higher accuracy of speech recognition is urgently needed.

Summary of the Invention

In view of the above technical problems, embodiments of the present application provide a method and a device for speech recognition.

In a first aspect, an embodiment of the present application provides a method for speech recognition, where the method includes:

Obtain the speech data to be identified;

Determining phoneme data corresponding to the voice data;

The phoneme data is input into a pre-trained phoneme text conversion model to obtain text data corresponding to the voice data.

Optionally, the method further includes:

Obtaining sample phoneme data and corresponding sample text data;

Using the sample phoneme data as sample input data and the sample text data as sample output data, the initial phoneme text conversion model is trained to obtain the phoneme text conversion model.

Optionally, the determining phoneme data corresponding to the voice data includes:

Based on a pre-trained speech acoustic model, phoneme data corresponding to the speech data is determined.

Optionally, the phoneme-to-Chinese character conversion model includes an encoder model, a decoder model, an attention mechanism model, and a spatial search model;

The inputting the phoneme data into a pre-trained phoneme Chinese character conversion model to obtain the Chinese character text corresponding to the voice data includes:

Inputting the phoneme data into the encoder model to obtain a first feature code corresponding to the phoneme data;

Inputting the first feature code into the attention mechanism model to obtain a second feature code corresponding to the phoneme data;

Inputting the second feature code into the decoder model to obtain a feature code of a text corresponding to a first phoneme unit in the phoneme data;

Setting the text sequence number i corresponding to the phoneme data to be equal to 1;

Inputting the first feature code and the feature code of the text corresponding to the i-th phoneme unit in the phoneme data into the attention mechanism model to obtain a fusion feature code of the text corresponding to the i-th phoneme unit;

Inputting the fusion feature code of the text corresponding to the i-th phoneme unit into the spatial search model to obtain the text corresponding to the i-th phoneme unit;

If the i-th phoneme unit is not the last phoneme unit in the phoneme data, inputting the text corresponding to the i-th phoneme unit and the second feature code into the decoder model to obtain the The feature code of the text corresponding to the i + 1th phoneme unit in the phoneme data, increment the value of i by 1 and go to the execution of the first feature code and the text corresponding to the i-th phoneme unit in the phoneme data. A processing step of inputting a feature code into the attention mechanism model;

If the i-th phoneme unit is the last phoneme unit in the phoneme data, the text corresponding to each phoneme unit obtained by the spatial search model is sorted according to the corresponding phoneme unit in the phoneme data, Combined together to obtain text data corresponding to the voice data.

Optionally, the encoder model is a convolutional neural network CNN, and the decoder model is a convolutional neural network CNN.

In a second aspect, an embodiment of the present application provides a device for voice recognition, where the device includes:

An acquisition module for acquiring voice data to be identified;

A determining module, configured to determine phoneme data corresponding to the voice data;

A conversion module is configured to input the phoneme data into a pre-trained phoneme text conversion model to obtain text data corresponding to the speech data.

Optionally, the device further includes a training module for:

Obtaining sample phoneme data and corresponding sample text data;

Optionally, the determining module is configured to:

The conversion module is configured to:

According to a third aspect, an embodiment of the present application provides a terminal. The terminal includes a processor and a memory. The memory stores at least one instruction, and the at least one instruction is loaded and executed by the processor to implement the foregoing. The method for speech recognition according to the first aspect.

In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium, where the storage medium stores at least one instruction, and the at least one instruction is loaded and executed by the processor to implement the method as described in the first aspect above. The method of speech recognition described above.

The beneficial effects brought by the technical solutions provided in the embodiments of the present application include at least:

In the embodiment of the present application, the speech data recognition process is divided into two parts. First, the speech data is converted into phoneme data, and then the phoneme text conversion model is used to convert the phoneme data into text data. Compared with directly converting speech data to text data, this conversion method reduces the span of data conversion. The conversion of speech data to phoneme data and the conversion of phoneme data to text data have higher accuracy. Therefore, this scheme has a higher accuracy for speech data recognition.

BRIEF DESCRIPTION OF THE DRAWINGS

In order to explain the technical solutions in the embodiments of the present invention more clearly, the drawings used in the description of the embodiments are briefly introduced below. Obviously, the drawings in the following description are only some embodiments of the present invention. For those of ordinary skill in the art, other drawings can be obtained according to these drawings without paying creative labor.

In order to explain the technical solution in the embodiments of the present application more clearly, the drawings used in the description of the embodiments will be briefly introduced below. Obviously, the drawings in the following description are only some embodiments of the present invention. For those of ordinary skill in the art, other drawings can be obtained according to these drawings without paying creative labor.

FIG. 1 is a flowchart of a speech recognition method provided by an embodiment of the present application;

2 is a schematic diagram of a speech acoustic model and a phoneme text conversion model provided by an embodiment of the present application;

FIG. 3 is a flowchart of a voice recognition method according to an embodiment of the present application; FIG.

FIG. 4 is a schematic structural diagram of a voice recognition device according to an embodiment of the present application; FIG.

5 is a schematic structural diagram of a terminal according to an embodiment of the present application;

FIG. 6 is a schematic structural diagram of a server according to an embodiment of the present application.

detailed description

To make the objectives, technical solutions, and advantages of the present invention clearer, the embodiments of the present invention will be described in further detail below with reference to the accompanying drawings.

An embodiment of the present application provides a method for speech recognition. The method may be implemented by a computer device, and the computer device may be a terminal or a server. For example, the method may be implemented by a terminal, and the terminal may be a device with an audio collection function, such as a mobile phone, a computer, an air conditioner, or a television. After collecting voice data, the terminal may directly perform voice recognition based on the method. As another example, the method can also be implemented by a server, that is, the collected voice data is sent to the server at a terminal with an audio collection function, and the server performs voice recognition based on the method. The following exemplifies a scenario to which the method can be applied. The device implementing the method in this scenario can be a mobile phone with an audio collection function, or a server.

An instant messaging application can be installed in a mobile phone with audio capture function, and a voice recognition option can be provided in the dialogue interface of the instant messaging application. After the user selects the voice recognition option, he can say the sentence he wants to send to the phone , The mobile phone can use the speech recognition method provided in the embodiment of the present application to convert the sentence spoken by the user into text, and display the text input box in the dialog interface. In addition, the mobile phone can also send the sentence spoken by the user to the server, and the server converts the sentence spoken by the user into text through the speech recognition method provided in the embodiment of the present application, and returns it to the mobile phone, and the mobile phone displays the text in the above Text input box in the dialogue interface.

In this embodiment, only the terminal is used as an example for description. Other situations are similar to this, and are not described in detail here.

As shown in FIG. 1, the processing flow of the method may include the following steps:

In step 101, voice data to be recognized is acquired.

In implementation, an audio collection device, such as a microphone, for collecting voice data may be installed on the terminal. When the user wants to send text to others through the terminal, he can first turn on the voice recognition function of the terminal, and then speak the voice corresponding to the text to be sent to the terminal's audio collection device, and the terminal can obtain it through the audio collection device. Corresponding voice data. For example, if a user wants to send a text "What are you doing" to a friend in the terminal's instant messaging application, he can turn on the terminal's voice recognition function and say "What are you doing" to the audio collection device. The terminal can then obtain the voice data corresponding to "what are you doing" through the audio collection device.

Or, when the user wants to control the terminal by voice, he can first enable the voice recognition function of the terminal, and then speak the corresponding control word to the terminal. For example, when the user wants to control the instant messaging application in the mobile phone to send a text message to Zhang San, he can say "open the instant messaging application and send to Zhang San: rain today" to the mobile phone; or the user wants to control the temperature of the air conditioner At this time, you can say "adjust the temperature to 25 degrees" to the air conditioner.

The terminal collects the corresponding voice data through the audio collection device. Since the user starts speaking, until the detected volume is lower than the preset threshold, the voice data obtained at this time is the voice data to be recognized. When the user speaks more content, the corresponding duration will be longer. Since the user starts speaking, the terminal obtains multiple pieces of voice data according to a preset duration, and each piece of voice data can be used as the voice data to be identified.

In step 102, phoneme data corresponding to the speech data to be identified is determined.

Among them, the phoneme data is data used to indicate the composition of the pronunciation. For example, for Chinese, the phoneme data is the pinyin corresponding to the Chinese characters. The phoneme data can include one or more phoneme units, each phoneme unit corresponds to a word, and each phoneme unit can be composed of one or more pronunciation identifiers. For Chinese, the pronunciation identifier is the initial in each pinyin The phoneme unit corresponding to the vowel, for example, "I" is wǒ. For example, the phoneme unit corresponding to "か" in Japanese is "ka".

Optionally, in order to convert the speech data into phoneme data with higher accuracy and efficiency, a machine training model may be used for conversion. Accordingly, the processing in step 102 may be as follows: the speech is determined based on a pre-trained speech acoustic model Phoneme data corresponding to the data.

Among them, the speech acoustic model is a model constructed based on CNN (Convolutional Neural Network) and RNN (Recurrent Neural Network).

In the implementation, the voice acoustic model should be trained in advance. A technician can obtain the voice data and its corresponding phoneme data from an existing database, or obtain the voice data from the Internet, and then obtain the phoneme data manually based on the voice data. The acquired voice data is sample voice data, which is used as sample input data. The phoneme data corresponding to the voice data is sample phoneme data. As sample output data, one sample voice data and corresponding sample phoneme data can be used as a set of training samples. Train the initial speech acoustic model. The speech acoustic model is trained by a large number of training samples to obtain the required speech acoustic model. During the training of the speech acoustic model, due to the large amount of sample data and the high computing and storage performance requirements of the device, it can be performed in the server.

The process of determining the phoneme data corresponding to the voice data may be performed in the terminal. Take the voice data generated by the user's real-time speech as the input of the voice acoustic model as an example: when the user speaks a sentence to the terminal, the terminal acquires the voice data and inputs it into the voice acoustic model and the voice acoustic model The CNN performs feature extraction on the voice data to obtain the feature vector corresponding to the voice data. The feature vector is then processed by the RNN to obtain the corresponding phoneme data.

In step 103, the phoneme data is input into a pre-trained phoneme text conversion model to obtain text data corresponding to the speech data.

Among them, the phoneme text conversion model is a machine training model.

In the embodiment, as shown in FIG. 2 is a schematic diagram of a speech acoustic model and a phoneme text conversion model in the embodiment of the present application. It can be seen that the phoneme data obtained by the terminal converting the speech data through the speech acoustic model will be input to the pre-trained Phoneme text conversion model to obtain text data corresponding to phoneme data. The terminal can display the obtained text data. For example, in an instant messaging application, the user can input text through voice. After the terminal displays the text data, the user can also edit the displayed text data. Alternatively, the terminal can perform corresponding operations based on the text data. For example, in a voice assistant of a smartphone, an operation instruction can be spoken to it and "call Li Si", and the voice assistant displays "Give Li Si Da" on the interface. Phone "and do so.

Optionally, before the phoneme text conversion model is used to convert the phoneme text, the phoneme text conversion model may be trained in advance. Accordingly, the processing may be as follows: obtaining sample phoneme data and corresponding sample text data; using the sample phoneme data as Sample input data, and the sample text data is used as sample output data to train an initial phoneme text conversion model to obtain a phoneme text conversion model.

In implementation, since nearly six million training samples are used to train the initial phonetic text conversion model, this training process can be performed in the server. For the acquisition of the sample phoneme data and its corresponding sample text data, a technician can first obtain the text data from the Internet or an existing database as the sample text data. For each sample text data obtained, the corresponding phoneme data can be obtained by querying the pronunciation dictionary as the sample phoneme data. The sample phoneme data is used as sample input data, and the sample text data corresponding to the sample phoneme data is used as sample output data to form a training sample. This solution can use the back-propagation algorithm as a preset training algorithm to train the initial phonetic text conversion model. The sample input data is input into the initial phoneme text conversion model to obtain the output data, and then the server determines the adjustment value of each parameter to be adjusted in the model based on the output data, the sample output data, and the preset training algorithm. Adjust the parameters. For each training sample, it is processed according to the above process to obtain the final phoneme text conversion model.

Optionally, the phoneme-to-Chinese character conversion model includes an encoder model, a decoder model, an attention mechanism model, and a spatial search model. Correspondingly, as shown in FIG. 3, the processing in step 103 may be as follows:

Step 1031: Input phoneme data into the encoder model to obtain a first feature code corresponding to the phoneme data.

Step 1032: Enter the first feature code into the attention mechanism model to obtain a second feature code corresponding to the phoneme data.

Step 1033: Enter the second feature code into the decoder model to obtain the feature code of the text corresponding to the first phoneme unit in the phoneme data; and set the sequence number i of the text corresponding to the phoneme data to 1.

In step 1034, the first feature code and the feature code of the character corresponding to the i-th phoneme unit in the phoneme data are input into the attention mechanism model to obtain the fusion feature code of the character corresponding to the i-th phoneme unit.

Step 1035: Enter the fusion feature code of the text corresponding to the i-th phoneme unit into a spatial search model to obtain the text corresponding to the i-th phoneme unit.

Step 1036, if the i-th phoneme unit is not the last phoneme unit in the phoneme data, then input the text corresponding to the i-th phoneme unit and the second feature code into the decoder model to obtain the i + 1th phoneme in the phoneme data For the character code of the unit, increase the value of i by 1 and go to step 4.

In step 1037, if the i-th phoneme unit is the last phoneme unit in the phoneme data, the text corresponding to each phoneme unit obtained by the spatial search model is grouped according to the order of the corresponding phoneme unit in the phoneme data. To get the text data corresponding to the voice data.

Among them, the above-mentioned encoder model and decoder model can both adopt CNN.

In the implementation, the phoneme data obtained from the above-mentioned speech acoustic model is coded in the form of One-Hot coding, and the corresponding input sequence is obtained. The input sequence is then input into the encoder model. In the encoder model, the The Embeding operation maps the input sequence to a unified dimension, so that the relationship between the elements in the input sequence is more effectively represented. In the encoder model, each convolutional layer in the CNN uses residual connections, so it is necessary to perform Linear Mapping (linear mapping) to change the vector dimension before the output of the encoder model. The encoder model then outputs a first feature code corresponding to the phoneme data, which may be in the form of a feature vector. The first feature code is input into the attention mechanism model to obtain a second feature code corresponding to the phoneme data. The second feature code may also be in the form of a feature vector. Then, the second feature code is input into the decoder model to obtain the feature code of the character corresponding to the first phoneme unit in the phoneme data, and the second feature code of the character may also be in the form of a feature vector.

Then, the first feature code obtained from the encoder model and the feature code of the text corresponding to the first phoneme unit of the phoneme data are input into the attention mechanism model to obtain a fusion of the text corresponding to the first phoneme unit. Feature code. Then, the fusion feature code of the text corresponding to the first phoneme unit is input to the spatial search model, and the text corresponding to the first phoneme unit can be obtained.

Then, the character code corresponding to the second phoneme unit and the text corresponding to the first phoneme unit are input into the decoder model, and the feature code of the text corresponding to the second phoneme unit can be obtained. Among them, the decoder will also perform the Embeding (embedding) operation, using residual connections, and perform Linear Mapping (linear mapping) operation before output. Then, the first feature code obtained from the encoder model and the character code corresponding to the second phoneme unit from the decoder model are input into the attention mechanism model to obtain the fusion feature code of the text corresponding to the second phoneme unit. Then the fusion feature code of the text corresponding to the second phoneme unit is input into the spatial search model, and the text corresponding to the second phoneme unit can be obtained. The following operations are performed on the subsequent third phoneme unit, the fourth phoneme unit, and the like, and details are not described herein. This loop operation process is performed until the text corresponding to the last phoneme unit of the phoneme data is output. Finally, the obtained characters can be sorted according to the order of their respective phoneme units in the phoneme data, and then combined together to obtain the text data corresponding to the phoneme data.

In addition, it should be noted that, unlike the process of using the phoneme text conversion model, in the process of training the phoneme text conversion model, the text corresponding to the last phoneme unit predicted by the phoneme text conversion model is not input to the decoder model, but the decoder model. The correct text corresponding to the last phoneme unit is input to the decoder model.

As described above, in the embodiment of the present application, the speech data recognition process is divided into two parts. The speech data is first converted into phoneme data, and then the phoneme data is converted into text data by using a phoneme text conversion model. Compared with directly converting speech data to text data, this conversion method reduces the span of data conversion. The conversion of speech data to phoneme data and the conversion of phoneme data to text data have higher accuracy. Therefore, this scheme has a higher accuracy for speech data recognition.

The speech recognition method described in this embodiment is implemented by a convolution-based sequence learning model, and the conversion speed of speech data to text data is also improved.

Based on the same technical concept, an embodiment of the present application further provides a device for speech recognition. The device may be a terminal in the foregoing embodiment. As shown in FIG. 4, the device includes an obtaining module 401, a determining module 402, and a conversion module. Module 403 and training module 404.

An acquisition module 401, configured to acquire voice data to be identified;

A determining module 402, configured to determine phoneme data corresponding to the voice data;

A conversion module 403 is configured to input the phoneme data into a pre-trained phoneme text conversion model to obtain text data corresponding to the voice data.

Optionally, the device further includes a training module 404, configured to:

Obtaining sample phoneme data and corresponding sample text data;

Optionally, the determining module 402 is configured to:

The conversion module 403 is configured to:

If the i-th phoneme unit is the last phoneme unit in the phoneme data, the obtained text corresponding to each phoneme unit is combined according to the order of the corresponding phoneme unit in the phoneme data, Get text data corresponding to the voice data.

Regarding the device in the above embodiment, the specific manner in which each module performs operations has been described in detail in the embodiment of the method, and will not be described in detail here.

It should be noted that the device for speech recognition provided in the foregoing embodiments only uses the division of the functional modules as an example for speech recognition. In practical applications, the above functions may be allocated by different functional modules as required. That is, the internal structure of the device is divided into different functional modules to complete all or part of the functions described above. In addition, the speech recognition apparatus and the speech recognition method embodiments provided by the foregoing embodiments belong to the same concept. For specific implementation processes, refer to the method embodiments, and details are not described herein again.

FIG. 5 is a structural block diagram of a terminal provided by an embodiment of the present application. The terminal 500 may be a portable mobile terminal, such as a smart phone or a tablet computer. The terminal 500 may also be called other names such as user equipment, portable terminal, and the like.

Generally, the terminal 500 includes a processor 501 and a memory 502.

The processor 501 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and the like. The processor 501 may use at least one hardware form among DSP (Digital Signal Processing), FPGA (Field-Programmable Gate Array), and PLA (Programmable Logic Array). achieve. The processor 501 may also include a main processor and a co-processor. The main processor is a processor for processing data in the awake state, also called a CPU (Central Processing Unit). The co-processor is Low-power processor for processing data in standby. In some embodiments, the processor 501 may be integrated with a GPU (Graphics Processing Unit), and the GPU is responsible for rendering and drawing content required to be displayed on the display screen. In some embodiments, the processor 501 may further include an AI (Artificial Intelligence) processor, and the AI processor is configured to process computing operations related to machine learning.

The memory 502 may include one or more computer-readable storage media, which may be tangible and non-transitory. The memory 502 may also include high-speed random access memory, and non-volatile memory, such as one or more disk storage devices, flash storage devices. In some embodiments, the non-transitory computer-readable storage medium in the memory 502 is used to store at least one instruction that is executed by the processor 501 to implement the speech recognition method provided in this application.

In some embodiments, the terminal 500 may further include a peripheral device interface 503 and at least one peripheral device. Specifically, the peripheral device includes at least one of a radio frequency circuit 504, a touch display screen 505, an audio circuit 506, and a power source 507.

The peripheral device interface 503 may be used to connect at least one peripheral device related to I / O (Input / Output) to the processor 501 and the memory 502. In some embodiments, the processor 501, the memory 502, and the peripheral device interface 503 are integrated on the same chip or circuit board; in some other embodiments, any one of the processor 501, the memory 502, and the peripheral device interface 503 or Two can be implemented on separate chips or circuit boards, which is not limited in this embodiment.

The radio frequency circuit 504 is used for receiving and transmitting an RF (Radio Frequency) signal, also called an electromagnetic signal. The radio frequency circuit 504 communicates with a communication network and other communication devices through electromagnetic signals. The radio frequency circuit 504 converts electrical signals into electromagnetic signals for transmission, or converts received electromagnetic signals into electrical signals. Optionally, the radio frequency circuit 504 includes: an antenna system, an RF transceiver, one or more amplifiers, a tuner, an oscillator, a digital signal processor, a codec chipset, a subscriber identity module card, and the like. The radio frequency circuit 504 can communicate with other terminals through at least one wireless communication protocol. The wireless communication protocols include, but are not limited to, the World Wide Web, metropolitan area networks, intranets, mobile communication networks (2G, 3G, 4G, and 5G) of various generations, wireless local area networks, and / or WiFi (Wireless Fidelity) networks. In some embodiments, the radio frequency circuit 504 may further include circuits related to Near Field Communication (NFC), which is not limited in this application.

The touch display screen 505 is used to display a UI (User Interface). The UI can include graphics, text, icons, videos, and any combination thereof. The touch display screen 505 also has the ability to collect touch signals on or above the surface of the touch display screen 505. The touch signal can be input to the processor 501 as a control signal for processing. The touch display screen 505 is used to provide virtual buttons and / or virtual keyboards, which are also called soft buttons and / or soft keyboards. In some embodiments, the touch display screen 505 may be one, and the front panel of the terminal 500 is provided. In other embodiments, the touch display screen 505 may be at least two, which are respectively provided on different surfaces of the terminal 500 or have a folded design. In still other embodiments, the touch display screen 505 may be a flexible display screen disposed on a curved surface or a folded surface of the terminal 500. Furthermore, the touch display screen 505 can also be set as a non-rectangular irregular figure, that is, a special-shaped screen. The touch display screen 505 can be made of materials such as LCD (Liquid Crystal Display) and OLED (Organic Light-Emitting Diode).

The audio circuit 506 is used to provide an audio interface between the user and the terminal 500. The audio circuit 506 may include a microphone and a speaker. The microphone is used to collect sound waves of the user and the environment, and convert the sound waves into electrical signals and input them to the processor 501 for processing, or input them to the radio frequency circuit 504 to implement voice communication. For the purpose of stereo collection or noise reduction, there may be multiple microphones, which are respectively disposed at different parts of the terminal 500. The microphone can also be an array microphone or an omnidirectional acquisition microphone. The speaker is used to convert electrical signals from the processor 501 or the radio frequency circuit 504 into sound waves in the future. The speaker can be a traditional film speaker or a piezoelectric ceramic speaker. When the speaker is a piezoelectric ceramic speaker, it can not only convert electrical signals into sound waves audible to humans, but also convert electrical signals into sound waves inaudible to humans for ranging purposes. In some embodiments, the audio circuit 506 may further include a headphone jack.

The power supply 507 is used to supply power to various components in the terminal 500. The power source 507 may be an alternating current, a direct current, a disposable battery, or a rechargeable battery. When the power source 507 includes a rechargeable battery, the rechargeable battery may be a wired rechargeable battery or a wireless rechargeable battery. The wired rechargeable battery is a battery charged through a wired line, and the wireless rechargeable battery is a battery charged through a wireless coil. The rechargeable battery can also be used to support fast charging technology.

Those skilled in the art may understand that the structure shown in FIG. 5 does not constitute a limitation on the terminal 500, and may include more or fewer components than shown in the figure, or combine certain components, or adopt different component arrangements.

In an exemplary embodiment, a computer-readable storage medium is also provided. The storage medium stores at least one instruction, and at least one instruction is loaded and executed by a processor to implement the method for identifying an action category in the foregoing embodiment. For example, the computer-readable storage medium may be a ROM, a random access memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

FIG. 6 is a schematic structural diagram of a server provided by an embodiment of the present application. The server 600 may have a large difference due to different configurations or performance, and may include one or more processors (central processing units) (CPU) 601 and Or more than one memory 602, where at least one instruction is stored in the memory 602, and the at least one instruction is loaded and executed by the processor 601 to implement the method for speech recognition described above.

A person of ordinary skill in the art may understand that all or part of the steps of implementing the foregoing embodiments may be implemented by hardware, or may be instructed by a program to complete related hardware. The program may be stored in a computer-readable storage medium. The storage medium mentioned may be a read-only memory, a magnetic disk or an optical disk.

The above are only preferred embodiments of the present invention and are not intended to limit the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention shall be included in the protection of the present invention. Within range.

Claims

A method for speech recognition, characterized in that the method includes:

Obtain the speech data to be identified;

Determining phoneme data corresponding to the voice data;

The phoneme data is input into a pre-trained phoneme text conversion model to obtain text data corresponding to the voice data.
The method according to claim 1, further comprising:

Obtaining sample phoneme data and corresponding sample text data;

Using the sample phoneme data as sample input data and the sample text data as sample output data, the initial phoneme text conversion model is trained to obtain the phoneme text conversion model.
The method according to claim 1, wherein the determining phoneme data corresponding to the voice data comprises:

Based on a pre-trained speech acoustic model, phoneme data corresponding to the speech data is determined.
The method according to claim 1, wherein the phoneme-kanji conversion model comprises an encoder model, a decoder model, an attention mechanism model, and a spatial search model;

The inputting the phoneme data into a pre-trained phoneme Chinese character conversion model to obtain the Chinese character text corresponding to the voice data includes:

Inputting the phoneme data into the encoder model to obtain a first feature code corresponding to the phoneme data;

Inputting the first feature code into the attention mechanism model to obtain a second feature code corresponding to the phoneme data;

Inputting the second feature code into the decoder model to obtain a feature code of a text corresponding to a first phoneme unit in the phoneme data;

Setting the text sequence number i corresponding to the phoneme data to be equal to 1;

Inputting the first feature code and the feature code of the text corresponding to the i-th phoneme unit in the phoneme data into the attention mechanism model to obtain a fusion feature code of the text corresponding to the i-th phoneme unit;

Inputting the fusion feature code of the text corresponding to the i-th phoneme unit into the spatial search model to obtain the text corresponding to the i-th phoneme unit;

If the i-th phoneme unit is not the last phoneme unit in the phoneme data, inputting the text corresponding to the i-th phoneme unit and the second feature code into the decoder model to obtain the The feature code of the text corresponding to the i + 1th phoneme unit in the phoneme data, increment the value of i by 1 and go to the execution of the first feature code and the text corresponding to the ith phoneme unit in the phoneme data A processing step of inputting a feature code into the attention mechanism model;

If the i-th phoneme unit is the last phoneme unit in the phoneme data, the text corresponding to each phoneme unit obtained by the spatial search model is sorted according to the corresponding phoneme unit in the phoneme data, Combined together to obtain text data corresponding to the voice data.
The method according to claim 4, wherein the encoder model is a convolutional neural network CNN, and the decoder model is a convolutional neural network CNN.
A device for speech recognition, characterized in that the device includes:

An acquisition module for acquiring voice data to be identified;

A determining module, configured to determine phoneme data corresponding to the voice data;

A conversion module is configured to input the phoneme data into a pre-trained phoneme text conversion model to obtain text data corresponding to the speech data.
The apparatus according to claim 6, further comprising a training module, configured to:

Obtaining sample phoneme data and corresponding sample text data;

Using the sample phoneme data as sample input data and the sample text data as sample output data, the initial phoneme text conversion model is trained to obtain the phoneme text conversion model.
The method according to claim 6, wherein the determining module is configured to:

Based on a pre-trained speech acoustic model, phoneme data corresponding to the speech data is determined.
The device according to claim 6, wherein the phoneme-kanji conversion model comprises an encoder model, a decoder model, an attention mechanism model, and a spatial search model;

The conversion module is configured to:

Inputting the phoneme data into the encoder model to obtain a first feature code corresponding to the phoneme data;

Inputting the first feature code into the attention mechanism model to obtain a second feature code corresponding to the phoneme data;

Inputting the second feature code into the decoder model to obtain a feature code of a text corresponding to a first phoneme unit in the phoneme data;

Setting the text sequence number i corresponding to the phoneme data to be equal to 1;

Inputting the first feature code and the feature code of the text corresponding to the i-th phoneme unit in the phoneme data into the attention mechanism model to obtain a fusion feature code of the text corresponding to the i-th phoneme unit;

Inputting the fusion feature code of the text corresponding to the i-th phoneme unit into the spatial search model to obtain the text corresponding to the i-th phoneme unit;

If the i-th phoneme unit is not the last phoneme unit in the phoneme data, inputting the text corresponding to the i-th phoneme unit and the second feature code into the decoder model to obtain the The feature code of the text corresponding to the i + 1th phoneme unit in the phoneme data, increment the value of i by 1 and go to the execution of the first feature code and the text corresponding to the i-th phoneme unit in the phoneme data. A processing step of inputting a feature code into the attention mechanism model;

If the i-th phoneme unit is the last phoneme unit in the phoneme data, the text corresponding to each phoneme unit obtained by the spatial search model is sorted according to the corresponding phoneme unit in the phoneme data, Combined together to obtain text data corresponding to the voice data.
The apparatus according to claim 9, wherein the encoder model is a convolutional neural network CNN, and the decoder model is a convolutional neural network CNN.
A terminal, characterized in that the terminal includes a processor, a memory, an audio acquisition device, and a display, wherein:

The audio acquisition device is configured to acquire voice data to be identified;

The processor is configured to determine phoneme data corresponding to the voice data; input the phoneme data into a pre-trained phoneme text conversion model stored in the memory to obtain text data corresponding to the voice data;

The display is configured to display the text data.
The terminal according to claim 11, wherein the processor is configured to:

Based on a pre-trained speech acoustic model, phoneme data corresponding to the speech data is determined.
The terminal according to claim 11, wherein the phoneme-kanji conversion model comprises an encoder model, a decoder model, an attention mechanism model, and a spatial search model;

The processor is configured to:

Inputting the phoneme data into the encoder model stored in the memory to obtain a first feature code corresponding to the phoneme data;

Inputting the first feature code into the attention mechanism model stored in the memory to obtain a second feature code corresponding to the phoneme data;

Inputting the second feature code into the decoder model stored in the memory to obtain a feature code of a character corresponding to a first phoneme unit in the phoneme data;

Setting the text sequence number i corresponding to the phoneme data to be equal to 1;

Inputting the first feature code and the feature code of the text corresponding to the i-th phoneme unit in the phoneme data into the attention mechanism model to obtain a fusion feature code of the text corresponding to the i-th phoneme unit;

Inputting the fusion feature code of the text corresponding to the i-th phoneme unit into the spatial search model stored in the memory to obtain the text corresponding to the i-th phoneme unit;

If the i-th phoneme unit is not the last phoneme unit in the phoneme data, inputting the text corresponding to the i-th phoneme unit and the second feature code into the decoder model to obtain the The feature code of the text corresponding to the i + 1th phoneme unit in the phoneme data, increment the value of i by 1 and go to the execution of the first feature code and the text corresponding to the i-th phoneme unit in the phoneme data. A processing step of inputting a feature code into the attention mechanism model;

If the i-th phoneme unit is the last phoneme unit in the phoneme data, the text corresponding to each phoneme unit obtained by the spatial search model is sorted according to the corresponding phoneme unit in the phoneme data, Combined together to obtain text data corresponding to the voice data.