WO2020057624A1 - Voice recognition method and apparatus - Google Patents

Voice recognition method and apparatus Download PDF

Info

Publication number
WO2020057624A1
WO2020057624A1 PCT/CN2019/106909 CN2019106909W WO2020057624A1 WO 2020057624 A1 WO2020057624 A1 WO 2020057624A1 CN 2019106909 W CN2019106909 W CN 2019106909W WO 2020057624 A1 WO2020057624 A1 WO 2020057624A1
Authority
WO
WIPO (PCT)
Prior art keywords
phoneme
data
model
text
feature code
Prior art date
Application number
PCT/CN2019/106909
Other languages
French (fr)
Chinese (zh)
Inventor
郝婧
陈凯
谢迪
浦世亮
Original Assignee
杭州海康威视数字技术股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 杭州海康威视数字技术股份有限公司 filed Critical 杭州海康威视数字技术股份有限公司
Publication of WO2020057624A1 publication Critical patent/WO2020057624A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/18Artificial neural networks; Connectionist approaches

Definitions

  • the present invention relates to the technical field of speech recognition, and in particular, to a method and a device for speech recognition.
  • the related technology of speech recognition processing mainly uses a speech recognition model to directly convert speech data into text data, which can be obtained through training and learning.
  • embodiments of the present application provide a method and a device for speech recognition.
  • an embodiment of the present application provides a method for speech recognition, where the method includes:
  • the phoneme data is input into a pre-trained phoneme text conversion model to obtain text data corresponding to the voice data.
  • the method further includes:
  • the initial phoneme text conversion model is trained to obtain the phoneme text conversion model.
  • the determining phoneme data corresponding to the voice data includes:
  • phoneme data corresponding to the speech data is determined.
  • the phoneme-to-Chinese character conversion model includes an encoder model, a decoder model, an attention mechanism model, and a spatial search model;
  • the text corresponding to each phoneme unit obtained by the spatial search model is sorted according to the corresponding phoneme unit in the phoneme data, Combined together to obtain text data corresponding to the voice data.
  • the encoder model is a convolutional neural network CNN
  • the decoder model is a convolutional neural network CNN
  • an embodiment of the present application provides a device for voice recognition, where the device includes:
  • a determining module configured to determine phoneme data corresponding to the voice data
  • a conversion module is configured to input the phoneme data into a pre-trained phoneme text conversion model to obtain text data corresponding to the speech data.
  • the device further includes a training module for:
  • the initial phoneme text conversion model is trained to obtain the phoneme text conversion model.
  • the determining module is configured to:
  • phoneme data corresponding to the speech data is determined.
  • the phoneme-to-Chinese character conversion model includes an encoder model, a decoder model, an attention mechanism model, and a spatial search model;
  • the conversion module is configured to:
  • the text corresponding to each phoneme unit obtained by the spatial search model is sorted according to the corresponding phoneme unit in the phoneme data, Combined together to obtain text data corresponding to the voice data.
  • the encoder model is a convolutional neural network CNN
  • the decoder model is a convolutional neural network CNN
  • an embodiment of the present application provides a terminal.
  • the terminal includes a processor and a memory.
  • the memory stores at least one instruction, and the at least one instruction is loaded and executed by the processor to implement the foregoing.
  • an embodiment of the present application provides a computer-readable storage medium, where the storage medium stores at least one instruction, and the at least one instruction is loaded and executed by the processor to implement the method as described in the first aspect above.
  • the speech data recognition process is divided into two parts. First, the speech data is converted into phoneme data, and then the phoneme text conversion model is used to convert the phoneme data into text data. Compared with directly converting speech data to text data, this conversion method reduces the span of data conversion. The conversion of speech data to phoneme data and the conversion of phoneme data to text data have higher accuracy. Therefore, this scheme has a higher accuracy for speech data recognition.
  • FIG. 1 is a flowchart of a speech recognition method provided by an embodiment of the present application
  • FIG. 2 is a schematic diagram of a speech acoustic model and a phoneme text conversion model provided by an embodiment of the present application;
  • FIG. 3 is a flowchart of a voice recognition method according to an embodiment of the present application.
  • FIG. 4 is a schematic structural diagram of a voice recognition device according to an embodiment of the present application.
  • FIG. 5 is a schematic structural diagram of a terminal according to an embodiment of the present application.
  • FIG. 6 is a schematic structural diagram of a server according to an embodiment of the present application.
  • An embodiment of the present application provides a method for speech recognition.
  • the method may be implemented by a computer device, and the computer device may be a terminal or a server.
  • the method may be implemented by a terminal, and the terminal may be a device with an audio collection function, such as a mobile phone, a computer, an air conditioner, or a television.
  • the terminal may directly perform voice recognition based on the method.
  • the method can also be implemented by a server, that is, the collected voice data is sent to the server at a terminal with an audio collection function, and the server performs voice recognition based on the method.
  • the following exemplifies a scenario to which the method can be applied.
  • the device implementing the method in this scenario can be a mobile phone with an audio collection function, or a server.
  • An instant messaging application can be installed in a mobile phone with audio capture function, and a voice recognition option can be provided in the dialogue interface of the instant messaging application.
  • a voice recognition option can be provided in the dialogue interface of the instant messaging application.
  • the mobile phone can use the speech recognition method provided in the embodiment of the present application to convert the sentence spoken by the user into text, and display the text input box in the dialog interface.
  • the mobile phone can also send the sentence spoken by the user to the server, and the server converts the sentence spoken by the user into text through the speech recognition method provided in the embodiment of the present application, and returns it to the mobile phone, and the mobile phone displays the text in the above Text input box in the dialogue interface.
  • the processing flow of the method may include the following steps:
  • step 101 voice data to be recognized is acquired.
  • an audio collection device such as a microphone, for collecting voice data may be installed on the terminal.
  • the user wants to send text to others through the terminal, he can first turn on the voice recognition function of the terminal, and then speak the voice corresponding to the text to be sent to the terminal's audio collection device, and the terminal can obtain it through the audio collection device.
  • Corresponding voice data For example, if a user wants to send a text "What are you doing" to a friend in the terminal's instant messaging application, he can turn on the terminal's voice recognition function and say "What are you doing" to the audio collection device. The terminal can then obtain the voice data corresponding to "what are you doing" through the audio collection device.
  • the user wants to control the terminal by voice, he can first enable the voice recognition function of the terminal, and then speak the corresponding control word to the terminal. For example, when the user wants to control the instant messaging application in the mobile phone to send a text message to Zhang San, he can say "open the instant messaging application and send to Zhang San: rain today" to the mobile phone; or the user wants to control the temperature of the air conditioner At this time, you can say “adjust the temperature to 25 degrees" to the air conditioner.
  • the terminal collects the corresponding voice data through the audio collection device. Since the user starts speaking, until the detected volume is lower than the preset threshold, the voice data obtained at this time is the voice data to be recognized. When the user speaks more content, the corresponding duration will be longer. Since the user starts speaking, the terminal obtains multiple pieces of voice data according to a preset duration, and each piece of voice data can be used as the voice data to be identified.
  • step 102 phoneme data corresponding to the speech data to be identified is determined.
  • the phoneme data is data used to indicate the composition of the pronunciation.
  • the phoneme data is the pinyin corresponding to the Chinese characters.
  • the phoneme data can include one or more phoneme units, each phoneme unit corresponds to a word, and each phoneme unit can be composed of one or more pronunciation identifiers.
  • the pronunciation identifier is the initial in each pinyin
  • the phoneme unit corresponding to the vowel, for example, "I” is w ⁇ .
  • the phoneme unit corresponding to " ⁇ " in Japanese is "ka".
  • a machine training model may be used for conversion. Accordingly, the processing in step 102 may be as follows: the speech is determined based on a pre-trained speech acoustic model Phoneme data corresponding to the data.
  • the speech acoustic model is a model constructed based on CNN (Convolutional Neural Network) and RNN (Recurrent Neural Network).
  • the voice acoustic model should be trained in advance.
  • a technician can obtain the voice data and its corresponding phoneme data from an existing database, or obtain the voice data from the Internet, and then obtain the phoneme data manually based on the voice data.
  • the acquired voice data is sample voice data, which is used as sample input data.
  • the phoneme data corresponding to the voice data is sample phoneme data.
  • sample output data one sample voice data and corresponding sample phoneme data can be used as a set of training samples.
  • Train the initial speech acoustic model The speech acoustic model is trained by a large number of training samples to obtain the required speech acoustic model. During the training of the speech acoustic model, due to the large amount of sample data and the high computing and storage performance requirements of the device, it can be performed in the server.
  • the process of determining the phoneme data corresponding to the voice data may be performed in the terminal.
  • the CNN performs feature extraction on the voice data to obtain the feature vector corresponding to the voice data.
  • the feature vector is then processed by the RNN to obtain the corresponding phoneme data.
  • step 103 the phoneme data is input into a pre-trained phoneme text conversion model to obtain text data corresponding to the speech data.
  • the phoneme text conversion model is a machine training model.
  • FIG. 2 is a schematic diagram of a speech acoustic model and a phoneme text conversion model in the embodiment of the present application.
  • the phoneme data obtained by the terminal converting the speech data through the speech acoustic model will be input to the pre-trained Phoneme text conversion model to obtain text data corresponding to phoneme data.
  • the terminal can display the obtained text data.
  • the user can input text through voice.
  • the terminal displays the text data
  • the user can also edit the displayed text data.
  • the terminal can perform corresponding operations based on the text data.
  • an operation instruction can be spoken to it and "call Li Si", and the voice assistant displays "Give Li Si Da" on the interface. Phone "and do so.
  • the phoneme text conversion model may be trained in advance. Accordingly, the processing may be as follows: obtaining sample phoneme data and corresponding sample text data; using the sample phoneme data as Sample input data, and the sample text data is used as sample output data to train an initial phoneme text conversion model to obtain a phoneme text conversion model.
  • this training process can be performed in the server.
  • a technician can first obtain the text data from the Internet or an existing database as the sample text data.
  • the corresponding phoneme data can be obtained by querying the pronunciation dictionary as the sample phoneme data.
  • the sample phoneme data is used as sample input data, and the sample text data corresponding to the sample phoneme data is used as sample output data to form a training sample.
  • This solution can use the back-propagation algorithm as a preset training algorithm to train the initial phonetic text conversion model.
  • the sample input data is input into the initial phoneme text conversion model to obtain the output data, and then the server determines the adjustment value of each parameter to be adjusted in the model based on the output data, the sample output data, and the preset training algorithm. Adjust the parameters. For each training sample, it is processed according to the above process to obtain the final phoneme text conversion model.
  • the phoneme-to-Chinese character conversion model includes an encoder model, a decoder model, an attention mechanism model, and a spatial search model.
  • the processing in step 103 may be as follows:
  • Step 1031 Input phoneme data into the encoder model to obtain a first feature code corresponding to the phoneme data.
  • Step 1032 Enter the first feature code into the attention mechanism model to obtain a second feature code corresponding to the phoneme data.
  • Step 1033 Enter the second feature code into the decoder model to obtain the feature code of the text corresponding to the first phoneme unit in the phoneme data; and set the sequence number i of the text corresponding to the phoneme data to 1.
  • step 1034 the first feature code and the feature code of the character corresponding to the i-th phoneme unit in the phoneme data are input into the attention mechanism model to obtain the fusion feature code of the character corresponding to the i-th phoneme unit.
  • Step 1035 Enter the fusion feature code of the text corresponding to the i-th phoneme unit into a spatial search model to obtain the text corresponding to the i-th phoneme unit.
  • Step 1036 if the i-th phoneme unit is not the last phoneme unit in the phoneme data, then input the text corresponding to the i-th phoneme unit and the second feature code into the decoder model to obtain the i + 1th phoneme in the phoneme data For the character code of the unit, increase the value of i by 1 and go to step 4.
  • step 1037 if the i-th phoneme unit is the last phoneme unit in the phoneme data, the text corresponding to each phoneme unit obtained by the spatial search model is grouped according to the order of the corresponding phoneme unit in the phoneme data. To get the text data corresponding to the voice data.
  • the above-mentioned encoder model and decoder model can both adopt CNN.
  • the phoneme data obtained from the above-mentioned speech acoustic model is coded in the form of One-Hot coding, and the corresponding input sequence is obtained.
  • the input sequence is then input into the encoder model.
  • the The Embeding operation maps the input sequence to a unified dimension, so that the relationship between the elements in the input sequence is more effectively represented.
  • each convolutional layer in the CNN uses residual connections, so it is necessary to perform Linear Mapping (linear mapping) to change the vector dimension before the output of the encoder model.
  • the encoder model then outputs a first feature code corresponding to the phoneme data, which may be in the form of a feature vector.
  • the first feature code is input into the attention mechanism model to obtain a second feature code corresponding to the phoneme data.
  • the second feature code may also be in the form of a feature vector.
  • the second feature code is input into the decoder model to obtain the feature code of the character corresponding to the first phoneme unit in the phoneme data, and the second feature code of the character may also be in the form of a feature vector.
  • the first feature code obtained from the encoder model and the feature code of the text corresponding to the first phoneme unit of the phoneme data are input into the attention mechanism model to obtain a fusion of the text corresponding to the first phoneme unit.
  • Feature code Then, the fusion feature code of the text corresponding to the first phoneme unit is input to the spatial search model, and the text corresponding to the first phoneme unit can be obtained.
  • the character code corresponding to the second phoneme unit and the text corresponding to the first phoneme unit are input into the decoder model, and the feature code of the text corresponding to the second phoneme unit can be obtained.
  • the decoder will also perform the Embeding (embedding) operation, using residual connections, and perform Linear Mapping (linear mapping) operation before output.
  • the first feature code obtained from the encoder model and the character code corresponding to the second phoneme unit from the decoder model are input into the attention mechanism model to obtain the fusion feature code of the text corresponding to the second phoneme unit.
  • the fusion feature code of the text corresponding to the second phoneme unit is input into the spatial search model, and the text corresponding to the second phoneme unit can be obtained.
  • the following operations are performed on the subsequent third phoneme unit, the fourth phoneme unit, and the like, and details are not described herein.
  • This loop operation process is performed until the text corresponding to the last phoneme unit of the phoneme data is output.
  • the obtained characters can be sorted according to the order of their respective phoneme units in the phoneme data, and then combined together to obtain the text data corresponding to the phoneme data.
  • the text corresponding to the last phoneme unit predicted by the phoneme text conversion model is not input to the decoder model, but the decoder model.
  • the correct text corresponding to the last phoneme unit is input to the decoder model.
  • the speech data recognition process is divided into two parts.
  • the speech data is first converted into phoneme data, and then the phoneme data is converted into text data by using a phoneme text conversion model.
  • this conversion method reduces the span of data conversion.
  • the conversion of speech data to phoneme data and the conversion of phoneme data to text data have higher accuracy. Therefore, this scheme has a higher accuracy for speech data recognition.
  • the speech recognition method described in this embodiment is implemented by a convolution-based sequence learning model, and the conversion speed of speech data to text data is also improved.
  • an embodiment of the present application further provides a device for speech recognition.
  • the device may be a terminal in the foregoing embodiment.
  • the device includes an obtaining module 401, a determining module 402, and a conversion module.
  • An acquisition module 401 configured to acquire voice data to be identified
  • a determining module 402 configured to determine phoneme data corresponding to the voice data
  • a conversion module 403 is configured to input the phoneme data into a pre-trained phoneme text conversion model to obtain text data corresponding to the voice data.
  • the device further includes a training module 404, configured to:
  • the initial phoneme text conversion model is trained to obtain the phoneme text conversion model.
  • the determining module 402 is configured to:
  • phoneme data corresponding to the speech data is determined.
  • the phoneme-to-Chinese character conversion model includes an encoder model, a decoder model, an attention mechanism model, and a spatial search model;
  • the conversion module 403 is configured to:
  • the obtained text corresponding to each phoneme unit is combined according to the order of the corresponding phoneme unit in the phoneme data, Get text data corresponding to the voice data.
  • the encoder model is a convolutional neural network CNN
  • the decoder model is a convolutional neural network CNN
  • the device for speech recognition only uses the division of the functional modules as an example for speech recognition.
  • the above functions may be allocated by different functional modules as required. That is, the internal structure of the device is divided into different functional modules to complete all or part of the functions described above.
  • the speech recognition apparatus and the speech recognition method embodiments provided by the foregoing embodiments belong to the same concept. For specific implementation processes, refer to the method embodiments, and details are not described herein again.
  • FIG. 5 is a structural block diagram of a terminal provided by an embodiment of the present application.
  • the terminal 500 may be a portable mobile terminal, such as a smart phone or a tablet computer.
  • the terminal 500 may also be called other names such as user equipment, portable terminal, and the like.
  • the terminal 500 includes a processor 501 and a memory 502.
  • the processor 501 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and the like.
  • the processor 501 may use at least one hardware form among DSP (Digital Signal Processing), FPGA (Field-Programmable Gate Array), and PLA (Programmable Logic Array). achieve.
  • the processor 501 may also include a main processor and a co-processor.
  • the main processor is a processor for processing data in the awake state, also called a CPU (Central Processing Unit).
  • the co-processor is Low-power processor for processing data in standby.
  • the processor 501 may be integrated with a GPU (Graphics Processing Unit), and the GPU is responsible for rendering and drawing content required to be displayed on the display screen.
  • the processor 501 may further include an AI (Artificial Intelligence) processor, and the AI processor is configured to process computing operations related to machine learning.
  • AI Artificial Intelligence
  • the memory 502 may include one or more computer-readable storage media, which may be tangible and non-transitory.
  • the memory 502 may also include high-speed random access memory, and non-volatile memory, such as one or more disk storage devices, flash storage devices.
  • non-transitory computer-readable storage medium in the memory 502 is used to store at least one instruction that is executed by the processor 501 to implement the speech recognition method provided in this application.
  • the terminal 500 may further include a peripheral device interface 503 and at least one peripheral device.
  • the peripheral device includes at least one of a radio frequency circuit 504, a touch display screen 505, an audio circuit 506, and a power source 507.
  • the peripheral device interface 503 may be used to connect at least one peripheral device related to I / O (Input / Output) to the processor 501 and the memory 502.
  • the processor 501, the memory 502, and the peripheral device interface 503 are integrated on the same chip or circuit board; in some other embodiments, any one of the processor 501, the memory 502, and the peripheral device interface 503 or Two can be implemented on separate chips or circuit boards, which is not limited in this embodiment.
  • the radio frequency circuit 504 is used for receiving and transmitting an RF (Radio Frequency) signal, also called an electromagnetic signal.
  • the radio frequency circuit 504 communicates with a communication network and other communication devices through electromagnetic signals.
  • the radio frequency circuit 504 converts electrical signals into electromagnetic signals for transmission, or converts received electromagnetic signals into electrical signals.
  • the radio frequency circuit 504 includes: an antenna system, an RF transceiver, one or more amplifiers, a tuner, an oscillator, a digital signal processor, a codec chipset, a subscriber identity module card, and the like.
  • the radio frequency circuit 504 can communicate with other terminals through at least one wireless communication protocol.
  • the wireless communication protocols include, but are not limited to, the World Wide Web, metropolitan area networks, intranets, mobile communication networks (2G, 3G, 4G, and 5G) of various generations, wireless local area networks, and / or WiFi (Wireless Fidelity) networks.
  • the radio frequency circuit 504 may further include circuits related to Near Field Communication (NFC), which is not limited in this application.
  • NFC Near Field Communication
  • the touch display screen 505 is used to display a UI (User Interface).
  • the UI can include graphics, text, icons, videos, and any combination thereof.
  • the touch display screen 505 also has the ability to collect touch signals on or above the surface of the touch display screen 505.
  • the touch signal can be input to the processor 501 as a control signal for processing.
  • the touch display screen 505 is used to provide virtual buttons and / or virtual keyboards, which are also called soft buttons and / or soft keyboards.
  • the touch display screen 505 may be one, and the front panel of the terminal 500 is provided. In other embodiments, the touch display screen 505 may be at least two, which are respectively provided on different surfaces of the terminal 500 or have a folded design.
  • the touch display screen 505 may be a flexible display screen disposed on a curved surface or a folded surface of the terminal 500. Furthermore, the touch display screen 505 can also be set as a non-rectangular irregular figure, that is, a special-shaped screen.
  • the touch display screen 505 can be made of materials such as LCD (Liquid Crystal Display) and OLED (Organic Light-Emitting Diode).
  • the audio circuit 506 is used to provide an audio interface between the user and the terminal 500.
  • the audio circuit 506 may include a microphone and a speaker.
  • the microphone is used to collect sound waves of the user and the environment, and convert the sound waves into electrical signals and input them to the processor 501 for processing, or input them to the radio frequency circuit 504 to implement voice communication.
  • the microphone can also be an array microphone or an omnidirectional acquisition microphone.
  • the speaker is used to convert electrical signals from the processor 501 or the radio frequency circuit 504 into sound waves in the future.
  • the speaker can be a traditional film speaker or a piezoelectric ceramic speaker.
  • the speaker When the speaker is a piezoelectric ceramic speaker, it can not only convert electrical signals into sound waves audible to humans, but also convert electrical signals into sound waves inaudible to humans for ranging purposes.
  • the audio circuit 506 may further include a headphone jack.
  • the power supply 507 is used to supply power to various components in the terminal 500.
  • the power source 507 may be an alternating current, a direct current, a disposable battery, or a rechargeable battery.
  • the rechargeable battery may be a wired rechargeable battery or a wireless rechargeable battery.
  • the wired rechargeable battery is a battery charged through a wired line
  • the wireless rechargeable battery is a battery charged through a wireless coil.
  • the rechargeable battery can also be used to support fast charging technology.
  • FIG. 5 does not constitute a limitation on the terminal 500, and may include more or fewer components than shown in the figure, or combine certain components, or adopt different component arrangements.
  • a computer-readable storage medium stores at least one instruction, and at least one instruction is loaded and executed by a processor to implement the method for identifying an action category in the foregoing embodiment.
  • the computer-readable storage medium may be a ROM, a random access memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.
  • FIG. 6 is a schematic structural diagram of a server provided by an embodiment of the present application.
  • the server 600 may have a large difference due to different configurations or performance, and may include one or more processors (central processing units) (CPU) 601 and Or more than one memory 602, where at least one instruction is stored in the memory 602, and the at least one instruction is loaded and executed by the processor 601 to implement the method for speech recognition described above.
  • processors central processing units
  • memory 602 or more than one memory 602
  • the program may be stored in a computer-readable storage medium.
  • the storage medium mentioned may be a read-only memory, a magnetic disk or an optical disk.

Abstract

Provided are a voice recognition method and apparatus, which belong to the field of voice recognition. The method comprises: acquiring voice data to be recognized (101); determining phoneme data corresponding to the voice data (102); and inputting the phoneme data into a pre-trained phoneme text conversion model, so as to obtain text data corresponding to the voice data (103). The method can improve the accuracy of recognizing voice data.

Description

语音识别的方法和装置Method and device for speech recognition
本申请要求于2018年9月20日提交的申请号为201811099967.4、发明名称为“语音识别的方法和装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims priority from a Chinese patent application filed on September 20, 2018 with an application number of 201811099967.4 and an invention name of "Method and Device for Speech Recognition", the entire contents of which are incorporated herein by reference.
技术领域Technical field
本发明涉及语音识别技术领域,特别涉及一种语音识别的方法和装置。The present invention relates to the technical field of speech recognition, and in particular, to a method and a device for speech recognition.
背景技术Background technique
随着科学技术的不断发展,语音智能控制技术也有了很大的进步,语音控制家用电器已经在日常生活中得到了应用。With the continuous development of science and technology, the voice intelligent control technology has also made great progress, and voice-controlled household appliances have been applied in daily life.
相关技术的语音识别处理,主要采用语音识别模型将语音数据直接转换为文字数据,该模型可以通过训练学习得到。The related technology of speech recognition processing mainly uses a speech recognition model to directly convert speech data into text data, which can be obtained through training and learning.
发明人发现如果采用将语音数据转换为文字数据的处理方式,准确度比较低,因此,急需有一种能够提供更高的语音识别准确度的方法。The inventors found that if the processing method for converting speech data into text data is adopted, the accuracy is relatively low, and therefore, a method that can provide higher accuracy of speech recognition is urgently needed.
发明内容Summary of the Invention
针对上述技术问题,本申请实施例提供了一种语音识别的方法和装置。In view of the above technical problems, embodiments of the present application provide a method and a device for speech recognition.
第一方面,本申请实施例提供了一种语音识别的方法,所述方法包括:In a first aspect, an embodiment of the present application provides a method for speech recognition, where the method includes:
获取待识别的语音数据;Obtain the speech data to be identified;
确定所述语音数据对应的音素数据;Determining phoneme data corresponding to the voice data;
将所述音素数据,输入预先训练的音素文字转换模型,得到所述语音数据对应的文字数据。The phoneme data is input into a pre-trained phoneme text conversion model to obtain text data corresponding to the voice data.
可选的,所述方法还包括:Optionally, the method further includes:
获取样本音素数据和对应的样本文字数据;Obtaining sample phoneme data and corresponding sample text data;
将所述样本音素数据作为样本输入数据,所述样本文字数据作为样本输出数据,对初始音素文字转换模型进行训练,得到所述音素文字转换模型。Using the sample phoneme data as sample input data and the sample text data as sample output data, the initial phoneme text conversion model is trained to obtain the phoneme text conversion model.
可选的,所述确定所述语音数据对应的音素数据,包括:Optionally, the determining phoneme data corresponding to the voice data includes:
基于预先训练的语音声学模型,确定所述语音数据对应的音素数据。Based on a pre-trained speech acoustic model, phoneme data corresponding to the speech data is determined.
可选的,所述音素汉字转换模型,包括编码器模型、解码器模型、注意力机制模型和空间搜索模型;Optionally, the phoneme-to-Chinese character conversion model includes an encoder model, a decoder model, an attention mechanism model, and a spatial search model;
所述将所述音素数据,输入预先训练的音素汉字转换模型,得到所述语音数据对应的汉字文本,包括:The inputting the phoneme data into a pre-trained phoneme Chinese character conversion model to obtain the Chinese character text corresponding to the voice data includes:
将所述音素数据输入所述编码器模型,得到所述音素数据对应的第一特征码;Inputting the phoneme data into the encoder model to obtain a first feature code corresponding to the phoneme data;
将所述第一特征码输入所述注意力机制模型,得到所述音素数据对应的第二特征码;Inputting the first feature code into the attention mechanism model to obtain a second feature code corresponding to the phoneme data;
将所述第二特征码输入所述解码器模型,得到所述音素数据中第1个音素单元对应的文字的特征码;Inputting the second feature code into the decoder model to obtain a feature code of a text corresponding to a first phoneme unit in the phoneme data;
设置所述音素数据对应的文字顺序号i等于1;Setting the text sequence number i corresponding to the phoneme data to be equal to 1;
将所述第一特征码和所述音素数据中第i个音素单元对应的文字的特征码,输入所述注意力机制模型,得到所述第i个音素单元对应的文字的融合特征码;Inputting the first feature code and the feature code of the text corresponding to the i-th phoneme unit in the phoneme data into the attention mechanism model to obtain a fusion feature code of the text corresponding to the i-th phoneme unit;
将所述第i个音素单元对应的文字的融合特征码,输入所述空间搜索模型,得到所述第i个音素单元对应的文字;Inputting the fusion feature code of the text corresponding to the i-th phoneme unit into the spatial search model to obtain the text corresponding to the i-th phoneme unit;
如果所述第i个音素单元不是所述音素数据中的最后一个音素单元,则将所述第i个音素单元对应的文字和所述第二特征码,输入所述解码器模型,得到所述音素数据中第i+1个音素单元对应的文字的特征码,将i的数值加1,并转至执行将所述第一特征码和所述音素数据中第i个音素单元对应的文字的特征码输入所述注意力机制模型的处理步骤;If the i-th phoneme unit is not the last phoneme unit in the phoneme data, inputting the text corresponding to the i-th phoneme unit and the second feature code into the decoder model to obtain the The feature code of the text corresponding to the i + 1th phoneme unit in the phoneme data, increment the value of i by 1 and go to the execution of the first feature code and the text corresponding to the i-th phoneme unit in the phoneme data. A processing step of inputting a feature code into the attention mechanism model;
如果所述第i个音素单元是所述音素数据中的最后一个音素单元,则将通过空间搜索模型得到的每个音素单元对应的文字,按照对应的音素单元在所述音素数据中的排序,组合在一起,得到所述语音数据对应的文字数据。If the i-th phoneme unit is the last phoneme unit in the phoneme data, the text corresponding to each phoneme unit obtained by the spatial search model is sorted according to the corresponding phoneme unit in the phoneme data, Combined together to obtain text data corresponding to the voice data.
可选的,所述编码器模型为卷积神经网络CNN,所述解码器模型为卷积神经网络CNN。Optionally, the encoder model is a convolutional neural network CNN, and the decoder model is a convolutional neural network CNN.
第二方面,本申请实施例提供了一种语音识别的装置,所述装置包括:In a second aspect, an embodiment of the present application provides a device for voice recognition, where the device includes:
获取模块,用于获取待识别的语音数据;An acquisition module for acquiring voice data to be identified;
确定模块,用于确定所述语音数据对应的音素数据;A determining module, configured to determine phoneme data corresponding to the voice data;
转换模块,用于将所述音素数据,输入预先训练的音素文字转换模型,得到所述语音数据对应的文字数据。A conversion module is configured to input the phoneme data into a pre-trained phoneme text conversion model to obtain text data corresponding to the speech data.
可选的,所述装置还包括训练模块,用于:Optionally, the device further includes a training module for:
获取样本音素数据和对应的样本文字数据;Obtaining sample phoneme data and corresponding sample text data;
将所述样本音素数据作为样本输入数据,所述样本文字数据作为样本输出数据,对初始音素文字转换模型进行训练,得到所述音素文字转换模型。Using the sample phoneme data as sample input data and the sample text data as sample output data, the initial phoneme text conversion model is trained to obtain the phoneme text conversion model.
可选的,所述确定模块,用于:Optionally, the determining module is configured to:
基于预先训练的语音声学模型,确定所述语音数据对应的音素数据。Based on a pre-trained speech acoustic model, phoneme data corresponding to the speech data is determined.
可选的,所述音素汉字转换模型,包括编码器模型、解码器模型、注意力机制模型和空间搜索模型;Optionally, the phoneme-to-Chinese character conversion model includes an encoder model, a decoder model, an attention mechanism model, and a spatial search model;
所述转换模块,用于:The conversion module is configured to:
将所述音素数据输入所述编码器模型,得到所述音素数据对应的第一特征码;Inputting the phoneme data into the encoder model to obtain a first feature code corresponding to the phoneme data;
将所述第一特征码输入所述注意力机制模型,得到所述音素数据对应的第二特征码;Inputting the first feature code into the attention mechanism model to obtain a second feature code corresponding to the phoneme data;
将所述第二特征码输入所述解码器模型,得到所述音素数据中第1个音素单元对应的文字的特征码;Inputting the second feature code into the decoder model to obtain a feature code of a text corresponding to a first phoneme unit in the phoneme data;
设置所述音素数据对应的文字顺序号i等于1;Setting the text sequence number i corresponding to the phoneme data to be equal to 1;
将所述第一特征码和所述音素数据中第i个音素单元对应的文字的特征码,输入所述注意力机制模型,得到所述第i个音素单元对应的文字的融合特征码;Inputting the first feature code and the feature code of the text corresponding to the i-th phoneme unit in the phoneme data into the attention mechanism model to obtain a fusion feature code of the text corresponding to the i-th phoneme unit;
将所述第i个音素单元对应的文字的融合特征码,输入所述空间搜索模型,得到所述第i个音素单元对应的文字;Inputting the fusion feature code of the text corresponding to the i-th phoneme unit into the spatial search model to obtain the text corresponding to the i-th phoneme unit;
如果所述第i个音素单元不是所述音素数据中的最后一个音素单元,则将所述第i个音素单元对应的文字和所述第二特征码,输入所述解码器模型,得到所述音素数据中第i+1个音素单元对应的文字的特征码,将i的数值加1,并转至执行将所述第一特征码和所述音素数据中第i个音素单元对应的文字的特征码输入所述注意力机制模型的处理步骤;If the i-th phoneme unit is not the last phoneme unit in the phoneme data, inputting the text corresponding to the i-th phoneme unit and the second feature code into the decoder model to obtain the The feature code of the text corresponding to the i + 1th phoneme unit in the phoneme data, increment the value of i by 1 and go to the execution of the first feature code and the text corresponding to the i-th phoneme unit in the phoneme data. A processing step of inputting a feature code into the attention mechanism model;
如果所述第i个音素单元是所述音素数据中的最后一个音素单元,则将通过空间搜索模型得到的每个音素单元对应的文字,按照对应的音素单元在所述音素数据中的排序,组合在一起,得到所述语音数据对应的文字数据。If the i-th phoneme unit is the last phoneme unit in the phoneme data, the text corresponding to each phoneme unit obtained by the spatial search model is sorted according to the corresponding phoneme unit in the phoneme data, Combined together to obtain text data corresponding to the voice data.
可选的,所述编码器模型为卷积神经网络CNN,所述解码器模型为卷积神经网络CNN。Optionally, the encoder model is a convolutional neural network CNN, and the decoder model is a convolutional neural network CNN.
第三方面,本申请实施例提供了一种终端,所述终端包括处理器和存储器,所述存储器中存储有至少一条指令,所述至少一条指令由所述处理器加载并执行以实现如上述第一方面所述的语音识别的方法。According to a third aspect, an embodiment of the present application provides a terminal. The terminal includes a processor and a memory. The memory stores at least one instruction, and the at least one instruction is loaded and executed by the processor to implement the foregoing. The method for speech recognition according to the first aspect.
第四方面,本申请实施例提供了一种计算机可读存储介质,所述存储介质中存储有至少一条指令,所述至少一条指令由所述处理器加载并执行以实现如上述第一方面所述的语音识别的方法。In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium, where the storage medium stores at least one instruction, and the at least one instruction is loaded and executed by the processor to implement the method as described in the first aspect above. The method of speech recognition described above.
本申请实施例提供的技术方案带来的有益效果至少包括:The beneficial effects brought by the technical solutions provided in the embodiments of the present application include at least:
本申请实施例中,将语音数据识别过程分为两部分,首先将语音数据转换为音素数据,然后再利用音素文字转换模型将音素数据转换为文字数据。这种转换方式,相对于直接将语音数据转换为文字数据,降低了数据转换的跨度,语音数据向音素数据的转换,以及音素数据向文字数据的转换具有更高的准确度。因而,本方案对语音数据的识别有更高的准确度。In the embodiment of the present application, the speech data recognition process is divided into two parts. First, the speech data is converted into phoneme data, and then the phoneme text conversion model is used to convert the phoneme data into text data. Compared with directly converting speech data to text data, this conversion method reduces the span of data conversion. The conversion of speech data to phoneme data and the conversion of phoneme data to text data have higher accuracy. Therefore, this scheme has a higher accuracy for speech data recognition.
附图说明BRIEF DESCRIPTION OF THE DRAWINGS
为了更清楚地说明本发明实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to explain the technical solutions in the embodiments of the present invention more clearly, the drawings used in the description of the embodiments are briefly introduced below. Obviously, the drawings in the following description are only some embodiments of the present invention. For those of ordinary skill in the art, other drawings can be obtained according to these drawings without paying creative labor.
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to explain the technical solution in the embodiments of the present application more clearly, the drawings used in the description of the embodiments will be briefly introduced below. Obviously, the drawings in the following description are only some embodiments of the present invention. For those of ordinary skill in the art, other drawings can be obtained according to these drawings without paying creative labor.
图1是本申请实施例提供的一种语音识别的方法的流程图;FIG. 1 is a flowchart of a speech recognition method provided by an embodiment of the present application;
图2是本申请实施例提供的一种语音声学模型和音素文字转换模型的示意图;2 is a schematic diagram of a speech acoustic model and a phoneme text conversion model provided by an embodiment of the present application;
图3是本申请实施例提供的一种语音识别的方法的流程图;FIG. 3 is a flowchart of a voice recognition method according to an embodiment of the present application; FIG.
图4是本申请实施例提供的一种语音识别的装置的结构示意图;FIG. 4 is a schematic structural diagram of a voice recognition device according to an embodiment of the present application; FIG.
图5是本申请实施例提供的一种终端的结构示意图;5 is a schematic structural diagram of a terminal according to an embodiment of the present application;
图6是本申请实施例提供的一种服务器的结构示意图。FIG. 6 is a schematic structural diagram of a server according to an embodiment of the present application.
具体实施方式detailed description
为使本发明的目的、技术方案和优点更加清楚,下面将结合附图对本发明实施方式作进一步地详细描述。To make the objectives, technical solutions, and advantages of the present invention clearer, the embodiments of the present invention will be described in further detail below with reference to the accompanying drawings.
本申请实施例提供了一种语音识别的方法,该方法可以由计算机设备实现,计算机设备可以是终端,也可以是服务器。例如,该方法可以由终端实现,终端可以是具有音频采集功能的设备,如手机、电脑、空调或电视等,在采集到语音数据后,终端可以直接基于该方法进行语音识别。又例如,该方法还可以由服务器实现,即在具有音频采集功能的终端将采集到语音数据发送给服务器,由服务器基于该方法进行语音识别。下面示例性的列举一种可以应用该方法的场景,在该场景下实现该方法的设备可以为具有音频采集功能的手机,也可以为服务器。An embodiment of the present application provides a method for speech recognition. The method may be implemented by a computer device, and the computer device may be a terminal or a server. For example, the method may be implemented by a terminal, and the terminal may be a device with an audio collection function, such as a mobile phone, a computer, an air conditioner, or a television. After collecting voice data, the terminal may directly perform voice recognition based on the method. As another example, the method can also be implemented by a server, that is, the collected voice data is sent to the server at a terminal with an audio collection function, and the server performs voice recognition based on the method. The following exemplifies a scenario to which the method can be applied. The device implementing the method in this scenario can be a mobile phone with an audio collection function, or a server.
在具有音频采集功能的手机中可以安装有即时通讯应用程序,在即时通讯应用程序的对话界面中可以提供语音识别选项,用户选中该语音识别选项后,可以向手机说出自己想要发送的语句,则手机可以通过本申请实施例提供的语音识别的方法,将用户说出的语句转换为文字,并显示在该对话界面中的文字输入框。另外,手机还可以将用户说出的语句发送给服务器,由服务器通过本申请实施例提供的语音识别的方法,将用户说出的语句转换为文字,并返回给手机,手机将文字显示在上述对话界面中的文字输入框。An instant messaging application can be installed in a mobile phone with audio capture function, and a voice recognition option can be provided in the dialogue interface of the instant messaging application. After the user selects the voice recognition option, he can say the sentence he wants to send to the phone , The mobile phone can use the speech recognition method provided in the embodiment of the present application to convert the sentence spoken by the user into text, and display the text input box in the dialog interface. In addition, the mobile phone can also send the sentence spoken by the user to the server, and the server converts the sentence spoken by the user into text through the speech recognition method provided in the embodiment of the present application, and returns it to the mobile phone, and the mobile phone displays the text in the above Text input box in the dialogue interface.
本实施例,仅以终端为执行主体为例进行说明,其它情况与此类似,在此不做累述。In this embodiment, only the terminal is used as an example for description. Other situations are similar to this, and are not described in detail here.
如图1所示,该方法的处理流程可以包括如下的步骤:As shown in FIG. 1, the processing flow of the method may include the following steps:
在步骤101中,获取待识别的语音数据。In step 101, voice data to be recognized is acquired.
在实施中,终端上可以安装有用于采集语音数据音频采集设备,如麦克风。用户想要要通过终端给他人发送文字时,可以先开启终端的语音识别功能,然后,对着终端的音频采集设备说出想要发送的文字对应的语音,终端便可以通过音频采集设备获取到相应的语音数据。例如,用户在终端的即时通讯应用程序中想要给好友发送文字“你在干什么”,则可以开启终端的语音识别功能,对着音频采集设备说出语音“你在干什么”。终端便可以通过音频采集设备获取到 “你在干什么”对应的语音数据。In implementation, an audio collection device, such as a microphone, for collecting voice data may be installed on the terminal. When the user wants to send text to others through the terminal, he can first turn on the voice recognition function of the terminal, and then speak the voice corresponding to the text to be sent to the terminal's audio collection device, and the terminal can obtain it through the audio collection device. Corresponding voice data. For example, if a user wants to send a text "What are you doing" to a friend in the terminal's instant messaging application, he can turn on the terminal's voice recognition function and say "What are you doing" to the audio collection device. The terminal can then obtain the voice data corresponding to "what are you doing" through the audio collection device.
或者,用户想要通过语音对终端进行控制时,可以先开启终端的语音识别功能,然后对着该终端说出相应的控制语。例如,用户想要控制手机中的即时通信应用向张三发送某文字信息时,可以对手机说“打开即时通信应用,向张三发送:今天下雨”;或者,用户想要控制空调的温度时,可以对空调说“将温度调整到25度”。Or, when the user wants to control the terminal by voice, he can first enable the voice recognition function of the terminal, and then speak the corresponding control word to the terminal. For example, when the user wants to control the instant messaging application in the mobile phone to send a text message to Zhang San, he can say "open the instant messaging application and send to Zhang San: rain today" to the mobile phone; or the user wants to control the temperature of the air conditioner At this time, you can say "adjust the temperature to 25 degrees" to the air conditioner.
终端通过音频采集设备采集到相应的语音数据,自用户开始说话起,直到检测到的音量低于预设阈值为止,此时获取的语音数据即为待识别的语音数据。当用户说话的内容比较多时,相应的持续时间也会比较长,终端自用户开始说话起,按预设时长获取多段语音数据,每段语音数据分别可以作为待识别的语音数据。The terminal collects the corresponding voice data through the audio collection device. Since the user starts speaking, until the detected volume is lower than the preset threshold, the voice data obtained at this time is the voice data to be recognized. When the user speaks more content, the corresponding duration will be longer. Since the user starts speaking, the terminal obtains multiple pieces of voice data according to a preset duration, and each piece of voice data can be used as the voice data to be identified.
在步骤102中,确定待识别的语音数据对应的音素数据。In step 102, phoneme data corresponding to the speech data to be identified is determined.
其中,音素数据是用于表示发音的标识组成的数据,例如,对于汉语来说,音素数据为汉字所对应的拼音。音素数据中可以包括一个或多个音素单元,每个音素单元对应一个字,每个音素单元可以由一个或多个发音的标识组成,对于汉语来说,发音的标识为每个拼音中的声母和韵母,例如,“我”字对应的音素单元为wǒ。例如,日语中“か”对应的音素单元为“ka”。Among them, the phoneme data is data used to indicate the composition of the pronunciation. For example, for Chinese, the phoneme data is the pinyin corresponding to the Chinese characters. The phoneme data can include one or more phoneme units, each phoneme unit corresponds to a word, and each phoneme unit can be composed of one or more pronunciation identifiers. For Chinese, the pronunciation identifier is the initial in each pinyin The phoneme unit corresponding to the vowel, for example, "I" is wǒ. For example, the phoneme unit corresponding to "か" in Japanese is "ka".
可选的,为了使语音数据转换为音素数据准确率和效率更高,可以采用机器训练模型进行转换,相应的,在步骤102的处理可以如下:基于预先训练的语音声学模型,确定所述语音数据对应的音素数据。Optionally, in order to convert the speech data into phoneme data with higher accuracy and efficiency, a machine training model may be used for conversion. Accordingly, the processing in step 102 may be as follows: the speech is determined based on a pre-trained speech acoustic model Phoneme data corresponding to the data.
其中,语音声学模型为基于CNN(Convolutional Neural Network,卷积神经网络)和RNN(Recurrent Neural Network,循环神经网络)构建的模型。Among them, the speech acoustic model is a model constructed based on CNN (Convolutional Neural Network) and RNN (Recurrent Neural Network).
在实施中,要预先对语音声学模型进行训练,技术人员可以从已有的数据库中获取语音数据及其对应的音素数据,或者从互联网获取语音数据,再由人工根据语音数据得到音素数据。其中,获取到的语音数据为样本语音数据,作为样本输入数据,语音数据对应的音素数据为样本音素数据,作为样本输出数据,一个样本语音数据和对应的样本音素数据可以作为一组训练样本,对初始的语音声学模型进行训练。通过大量的训练样本对语音声学模型进行训练,即可得到所需的语音声学模型。在语音声学模型的训练过程中,由于样本数据量庞大,对设备的计算性能和存储性能要求较高,所以可以在服务器中进行。In the implementation, the voice acoustic model should be trained in advance. A technician can obtain the voice data and its corresponding phoneme data from an existing database, or obtain the voice data from the Internet, and then obtain the phoneme data manually based on the voice data. The acquired voice data is sample voice data, which is used as sample input data. The phoneme data corresponding to the voice data is sample phoneme data. As sample output data, one sample voice data and corresponding sample phoneme data can be used as a set of training samples. Train the initial speech acoustic model. The speech acoustic model is trained by a large number of training samples to obtain the required speech acoustic model. During the training of the speech acoustic model, due to the large amount of sample data and the high computing and storage performance requirements of the device, it can be performed in the server.
对于确定语音数据所对应的音素数据的处理过程,可以在终端中进行。以 用户实时说话所产生的语音数据作为该语音声学模型的输入为例:当用户向终端说出一句话时,终端对语音数据进行获取,并将其输入到语音声学模型中,语音声学模型中的CNN对语音数据进行特征提取,得到语音数据对应的特征向量,特征向量再经过RNN的处理,即得到相应的音素数据。The process of determining the phoneme data corresponding to the voice data may be performed in the terminal. Take the voice data generated by the user's real-time speech as the input of the voice acoustic model as an example: when the user speaks a sentence to the terminal, the terminal acquires the voice data and inputs it into the voice acoustic model and the voice acoustic model The CNN performs feature extraction on the voice data to obtain the feature vector corresponding to the voice data. The feature vector is then processed by the RNN to obtain the corresponding phoneme data.
在步骤103中,将音素数据,输入预先训练的音素文字转换模型,得到语音数据对应的文字数据。In step 103, the phoneme data is input into a pre-trained phoneme text conversion model to obtain text data corresponding to the speech data.
其中,音素文字转换模型为机器训练模型。Among them, the phoneme text conversion model is a machine training model.
在实施例中,如图2所示为本申请实施例中语音声学模型和音素文字转换模型的示意图,可见,终端将语音数据经语音声学模型转换得到的音素数据,会输入到预先训练好的音素文字转换模型,以得到音素数据所对应的文字数据。终端可以对得到的文字数据进行显示,例如,在即时通信应用中,用户可以通过语音输入文字,终端显示文字数据之后,用户还可以对显示的文字数据进行编辑。或者,终端可以基于文字数据,执行对应的操作,例如,在智能手机的语音助手中,可以对其说出操作指令,“给李四打电话”,语音助手会在界面显示“给李四打电话”,并执行此操作。In the embodiment, as shown in FIG. 2 is a schematic diagram of a speech acoustic model and a phoneme text conversion model in the embodiment of the present application. It can be seen that the phoneme data obtained by the terminal converting the speech data through the speech acoustic model will be input to the pre-trained Phoneme text conversion model to obtain text data corresponding to phoneme data. The terminal can display the obtained text data. For example, in an instant messaging application, the user can input text through voice. After the terminal displays the text data, the user can also edit the displayed text data. Alternatively, the terminal can perform corresponding operations based on the text data. For example, in a voice assistant of a smartphone, an operation instruction can be spoken to it and "call Li Si", and the voice assistant displays "Give Li Si Da" on the interface. Phone "and do so.
可选的,在使用音素文字转换模型对音素文字进行转换之前,可以预先对音素文字转换模型进行训练,相应的,处理可以如下:获取样本音素数据和对应的样本文字数据;将样本音素数据作为样本输入数据,所述样本文字数据作为样本输出数据,对初始音素文字转换模型进行训练,得到音素文字转换模型。Optionally, before the phoneme text conversion model is used to convert the phoneme text, the phoneme text conversion model may be trained in advance. Accordingly, the processing may be as follows: obtaining sample phoneme data and corresponding sample text data; using the sample phoneme data as Sample input data, and the sample text data is used as sample output data to train an initial phoneme text conversion model to obtain a phoneme text conversion model.
在实施中,由于要利用近六百万的训练样本对初始音素文字转换模型进行训练,所以该训练过程可以在服务器中进行。对于样本音素数据和其对应的样本文字数据的获取,可以由技术人员先从互联网或者已有的数据库中获取文字数据,作为样本文字数据。对于获取到的每个样本文字数据,可以通过查询发音字典得到相对应的音素数据,作为样本音素数据。将样本音素数据作为样本输入数据,将与样本音素数据相对应的样本文字数据作为样本输出数据,组成训练样本。本方案可以采用反向传播算法作为预设的训练算法,对初始音素文字转换模型进行训练。将样本输入数据输入到初始音素文字转换模型中,得到输出数据,再由服务器基于输出数据、样本输出数据和预设的训练算法,确定模型中每个待调整参数的调整值,对相应的待调整参数进行调整。对于每个训练样本,都按照上述流程进行处理,得到最终的音素文字转换模型。In implementation, since nearly six million training samples are used to train the initial phonetic text conversion model, this training process can be performed in the server. For the acquisition of the sample phoneme data and its corresponding sample text data, a technician can first obtain the text data from the Internet or an existing database as the sample text data. For each sample text data obtained, the corresponding phoneme data can be obtained by querying the pronunciation dictionary as the sample phoneme data. The sample phoneme data is used as sample input data, and the sample text data corresponding to the sample phoneme data is used as sample output data to form a training sample. This solution can use the back-propagation algorithm as a preset training algorithm to train the initial phonetic text conversion model. The sample input data is input into the initial phoneme text conversion model to obtain the output data, and then the server determines the adjustment value of each parameter to be adjusted in the model based on the output data, the sample output data, and the preset training algorithm. Adjust the parameters. For each training sample, it is processed according to the above process to obtain the final phoneme text conversion model.
可选的,音素汉字转换模型包括编码器模型、解码器模型、注意力机制模 型和空间搜索模型,相应的,如图3所示,在步骤103的处理可以如下:Optionally, the phoneme-to-Chinese character conversion model includes an encoder model, a decoder model, an attention mechanism model, and a spatial search model. Correspondingly, as shown in FIG. 3, the processing in step 103 may be as follows:
步骤1031,将音素数据输入编码器模型,得到音素数据对应的第一特征码。Step 1031: Input phoneme data into the encoder model to obtain a first feature code corresponding to the phoneme data.
步骤1032,将第一特征码输入注意力机制模型,得到音素数据对应的第二特征码。Step 1032: Enter the first feature code into the attention mechanism model to obtain a second feature code corresponding to the phoneme data.
步骤1033,将第二特征码输入解码器模型,得到音素数据中第1个音素单元对应的文字的特征码;设置音素数据对应的文字顺序号i等于1。Step 1033: Enter the second feature code into the decoder model to obtain the feature code of the text corresponding to the first phoneme unit in the phoneme data; and set the sequence number i of the text corresponding to the phoneme data to 1.
步骤1034,将第一特征码和音素数据中第i个音素单元对应的文字的特征码,输入注意力机制模型,得到第i个音素单元对应的文字的融合特征码。In step 1034, the first feature code and the feature code of the character corresponding to the i-th phoneme unit in the phoneme data are input into the attention mechanism model to obtain the fusion feature code of the character corresponding to the i-th phoneme unit.
步骤1035,将第i个音素单元对应的文字的融合特征码,输入空间搜索模型,得到第i个音素单元对应的文字。Step 1035: Enter the fusion feature code of the text corresponding to the i-th phoneme unit into a spatial search model to obtain the text corresponding to the i-th phoneme unit.
步骤1036,如果第i个音素单元不是音素数据中的最后一个音素单元,则将第i个音素单元对应的文字和第二特征码,输入解码器模型,得到音素数据中第i+1个音素单元对应的文字的特征码,将i的数值加1,并转至执行步骤四。 Step 1036, if the i-th phoneme unit is not the last phoneme unit in the phoneme data, then input the text corresponding to the i-th phoneme unit and the second feature code into the decoder model to obtain the i + 1th phoneme in the phoneme data For the character code of the unit, increase the value of i by 1 and go to step 4.
步骤1037,如果第i个音素单元是音素数据中的最后一个音素单元,则将通过空间搜索模型得到的每个音素单元对应的文字,按照对应的音素单元在音素数据中的排序,组合在一起,得到语音数据对应的文字数据。In step 1037, if the i-th phoneme unit is the last phoneme unit in the phoneme data, the text corresponding to each phoneme unit obtained by the spatial search model is grouped according to the order of the corresponding phoneme unit in the phoneme data. To get the text data corresponding to the voice data.
其中,上述的编码器模型和解码器模型均可采用CNN。Among them, the above-mentioned encoder model and decoder model can both adopt CNN.
在实施中,对由上述语音声学模型得到的音素数据,以One-Hot(独热)编码形式编码,得到相应的输入序列,再将该输入序列输入编码器模型,在编码器模型中,通过Embeding(嵌入)操作,将输入序列映射到一个统一维度,以使输入序列中各元素间的关系更有效的表示。在编码器模型中,CNN中的每个卷积层间均采用残差连接,因此就要在该编码器模型输出之前进行Linear Mapping(线性映射)以改变向量维度。然后编码器模型便会输出音素数据对应的第一特征码,该第一特征码可以是特征向量的形式。再将该第一特征码输入注意力机制模型,得到与音素数据相对应的第二特征码,该第二特征码同样也可以为特征向量的形式。然后,将上述第二特征码输入到解码器模型中,得到音素数据中的第1个音素单元对应的文字的特征码,该文字的第二特征码同样也可以为特征向量的形式。In the implementation, the phoneme data obtained from the above-mentioned speech acoustic model is coded in the form of One-Hot coding, and the corresponding input sequence is obtained. The input sequence is then input into the encoder model. In the encoder model, the The Embeding operation maps the input sequence to a unified dimension, so that the relationship between the elements in the input sequence is more effectively represented. In the encoder model, each convolutional layer in the CNN uses residual connections, so it is necessary to perform Linear Mapping (linear mapping) to change the vector dimension before the output of the encoder model. The encoder model then outputs a first feature code corresponding to the phoneme data, which may be in the form of a feature vector. The first feature code is input into the attention mechanism model to obtain a second feature code corresponding to the phoneme data. The second feature code may also be in the form of a feature vector. Then, the second feature code is input into the decoder model to obtain the feature code of the character corresponding to the first phoneme unit in the phoneme data, and the second feature code of the character may also be in the form of a feature vector.
然后,由编码器模型得到的第一特征码和该音素数据的第一个音素单元所对应的文字的特征码,输入到注意力机制模型中,得到该第一个音素单元对应的文字的融合特征码。再将该第一个音素单元对应的文字的融合特征码,输入 到空间搜索模型,便可以得到该第1个音素单元对应的文字。Then, the first feature code obtained from the encoder model and the feature code of the text corresponding to the first phoneme unit of the phoneme data are input into the attention mechanism model to obtain a fusion of the text corresponding to the first phoneme unit. Feature code. Then, the fusion feature code of the text corresponding to the first phoneme unit is input to the spatial search model, and the text corresponding to the first phoneme unit can be obtained.
再将上述的第二特征码和第1个音素单元对应的文字,输入到解码器模型中,便可以得到第2个音素单元对应的文字的特征码。其中,解码器中也同样会进行Embeding(嵌入)操作,采用残差连接,在输出之前进行Linear Mapping(线性映射)操作。然后将由编码器模型得到的第一特征码和由解码器模型得到第2个音素单元对应的文字的特征码,输入注意力机制模型中,得到第2个音素单元对应的文字的融合特征码,再第2个音素单元对应的文字的融合特征码,输入到空间搜索模型中,便可以得到第2个音素单元对应的文字。对后续的第3个音素单元、第4个音素单元等均要进行上述操作,在此不做赘述。该循环的操作过程,直到音素数据的最后一个音素单元所对应的文字输出为止。最后,可以将得到的各文字按照其各自对应的音素单元在音素数据的顺序进行排序,再组合再一起,即可以得到音素数据对应的文字数据。Then, the character code corresponding to the second phoneme unit and the text corresponding to the first phoneme unit are input into the decoder model, and the feature code of the text corresponding to the second phoneme unit can be obtained. Among them, the decoder will also perform the Embeding (embedding) operation, using residual connections, and perform Linear Mapping (linear mapping) operation before output. Then, the first feature code obtained from the encoder model and the character code corresponding to the second phoneme unit from the decoder model are input into the attention mechanism model to obtain the fusion feature code of the text corresponding to the second phoneme unit. Then the fusion feature code of the text corresponding to the second phoneme unit is input into the spatial search model, and the text corresponding to the second phoneme unit can be obtained. The following operations are performed on the subsequent third phoneme unit, the fourth phoneme unit, and the like, and details are not described herein. This loop operation process is performed until the text corresponding to the last phoneme unit of the phoneme data is output. Finally, the obtained characters can be sorted according to the order of their respective phoneme units in the phoneme data, and then combined together to obtain the text data corresponding to the phoneme data.
另外需要注意,与音素文字转换模型使用过程不同的是,在训练音素文字转换模型的过程中,不是将音素文字转换模型预测得到的上一个音素单元对应的文字输入到解码器模型,而是将上一个音素单元对应的正确的文字输入到解码器模型。In addition, it should be noted that, unlike the process of using the phoneme text conversion model, in the process of training the phoneme text conversion model, the text corresponding to the last phoneme unit predicted by the phoneme text conversion model is not input to the decoder model, but the decoder model. The correct text corresponding to the last phoneme unit is input to the decoder model.
如上所述,本申请实施例中,将语音数据识别过程分为两部分,首先将语音数据转换为音素数据,然后再利用音素文字转换模型将音素数据转换为文字数据。这种转换方式,相对于直接将语音数据转换为文字数据,降低了数据转换的跨度,语音数据向音素数据的转换,以及音素数据向文字数据的转换具有更高的准确度。因而,本方案对语音数据的识别有更高的准确度。As described above, in the embodiment of the present application, the speech data recognition process is divided into two parts. The speech data is first converted into phoneme data, and then the phoneme data is converted into text data by using a phoneme text conversion model. Compared with directly converting speech data to text data, this conversion method reduces the span of data conversion. The conversion of speech data to phoneme data and the conversion of phoneme data to text data have higher accuracy. Therefore, this scheme has a higher accuracy for speech data recognition.
本实施例所描述的语音识别方法,通过基于卷积的序列学习的模型实现,在语音数据到文字数据的转换速度也有所提升。The speech recognition method described in this embodiment is implemented by a convolution-based sequence learning model, and the conversion speed of speech data to text data is also improved.
基于相同的技术构思,本申请实施例还提供了一种语音识别的装置,该装置可以为上述实施例中的终端,如图4所示,该装置包括:获取模块401,确定模块402,转换模块403和训练模块404。Based on the same technical concept, an embodiment of the present application further provides a device for speech recognition. The device may be a terminal in the foregoing embodiment. As shown in FIG. 4, the device includes an obtaining module 401, a determining module 402, and a conversion module. Module 403 and training module 404.
获取模块401,用于获取待识别的语音数据;An acquisition module 401, configured to acquire voice data to be identified;
确定模块402,用于确定所述语音数据对应的音素数据;A determining module 402, configured to determine phoneme data corresponding to the voice data;
转换模块403,用于将所述音素数据,输入预先训练的音素文字转换模型,得到所述语音数据对应的文字数据。A conversion module 403 is configured to input the phoneme data into a pre-trained phoneme text conversion model to obtain text data corresponding to the voice data.
可选的,所述装置还包括训练模块404,用于:Optionally, the device further includes a training module 404, configured to:
获取样本音素数据和对应的样本文字数据;Obtaining sample phoneme data and corresponding sample text data;
将所述样本音素数据作为样本输入数据,所述样本文字数据作为样本输出数据,对初始音素文字转换模型进行训练,得到所述音素文字转换模型。Using the sample phoneme data as sample input data and the sample text data as sample output data, the initial phoneme text conversion model is trained to obtain the phoneme text conversion model.
可选的,所述确定模块402,用于:Optionally, the determining module 402 is configured to:
基于预先训练的语音声学模型,确定所述语音数据对应的音素数据。Based on a pre-trained speech acoustic model, phoneme data corresponding to the speech data is determined.
可选的,所述音素汉字转换模型,包括编码器模型、解码器模型、注意力机制模型和空间搜索模型;Optionally, the phoneme-to-Chinese character conversion model includes an encoder model, a decoder model, an attention mechanism model, and a spatial search model;
所述转换模块403,用于:The conversion module 403 is configured to:
将所述音素数据输入所述编码器模型,得到所述音素数据对应的第一特征码;Inputting the phoneme data into the encoder model to obtain a first feature code corresponding to the phoneme data;
将所述第一特征码输入所述注意力机制模型,得到所述音素数据对应的第二特征码;Inputting the first feature code into the attention mechanism model to obtain a second feature code corresponding to the phoneme data;
将所述第二特征码输入所述解码器模型,得到所述音素数据中第1个音素单元对应的文字的特征码;Inputting the second feature code into the decoder model to obtain a feature code of a text corresponding to a first phoneme unit in the phoneme data;
设置所述音素数据对应的文字顺序号i等于1;Setting the text sequence number i corresponding to the phoneme data to be equal to 1;
将所述第一特征码和所述音素数据中第i个音素单元对应的文字的特征码,输入所述注意力机制模型,得到所述第i个音素单元对应的文字的融合特征码;Inputting the first feature code and the feature code of the text corresponding to the i-th phoneme unit in the phoneme data into the attention mechanism model to obtain a fusion feature code of the text corresponding to the i-th phoneme unit;
将所述第i个音素单元对应的文字的融合特征码,输入所述空间搜索模型,得到所述第i个音素单元对应的文字;Inputting the fusion feature code of the text corresponding to the i-th phoneme unit into the spatial search model to obtain the text corresponding to the i-th phoneme unit;
如果所述第i个音素单元不是所述音素数据中的最后一个音素单元,则将所述第i个音素单元对应的文字和所述第二特征码,输入所述解码器模型,得到所述音素数据中第i+1个音素单元对应的文字的特征码,将i的数值加1,并转至执行将所述第一特征码和所述音素数据中第i个音素单元对应的文字的特征码输入所述注意力机制模型的处理步骤;If the i-th phoneme unit is not the last phoneme unit in the phoneme data, inputting the text corresponding to the i-th phoneme unit and the second feature code into the decoder model to obtain the The feature code of the text corresponding to the i + 1th phoneme unit in the phoneme data, increment the value of i by 1 and go to the execution of the first feature code and the text corresponding to the i-th phoneme unit in the phoneme data. A processing step of inputting a feature code into the attention mechanism model;
如果所述第i个音素单元是所述音素数据中的最后一个音素单元,则将得到的每个音素单元对应的文字,按照对应的音素单元在所述音素数据中的排序,组合在一起,得到所述语音数据对应的文字数据。If the i-th phoneme unit is the last phoneme unit in the phoneme data, the obtained text corresponding to each phoneme unit is combined according to the order of the corresponding phoneme unit in the phoneme data, Get text data corresponding to the voice data.
可选的,所述编码器模型为卷积神经网络CNN,所述解码器模型为卷积神经网络CNN。Optionally, the encoder model is a convolutional neural network CNN, and the decoder model is a convolutional neural network CNN.
关于上述实施例中的装置,其中各个模块执行操作的具体方式已经在有关 该方法的实施例中进行了详细描述,此处将不做详细阐述说明。Regarding the device in the above embodiment, the specific manner in which each module performs operations has been described in detail in the embodiment of the method, and will not be described in detail here.
需要说明的是:上述实施例提供的语音识别的装置在语音识别时,仅以上述各功能模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能模块完成,即将设备的内部结构划分成不同的功能模块,以完成以上描述的全部或者部分功能。另外,上述实施例提供的语音识别的装置与语音识别的方法实施例属于同一构思,其具体实现过程详见方法实施例,这里不再赘述。It should be noted that the device for speech recognition provided in the foregoing embodiments only uses the division of the functional modules as an example for speech recognition. In practical applications, the above functions may be allocated by different functional modules as required. That is, the internal structure of the device is divided into different functional modules to complete all or part of the functions described above. In addition, the speech recognition apparatus and the speech recognition method embodiments provided by the foregoing embodiments belong to the same concept. For specific implementation processes, refer to the method embodiments, and details are not described herein again.
图5是本申请实施例提供的一种终端的结构框图。该终端500可以是便携式移动终端,比如:智能手机、平板电脑。终端500还可能被称为用户设备、便携式终端等其他名称。FIG. 5 is a structural block diagram of a terminal provided by an embodiment of the present application. The terminal 500 may be a portable mobile terminal, such as a smart phone or a tablet computer. The terminal 500 may also be called other names such as user equipment, portable terminal, and the like.
通常,终端500包括有:处理器501和存储器502。Generally, the terminal 500 includes a processor 501 and a memory 502.
处理器501可以包括一个或多个处理核心,比如4核心处理器、8核心处理器等。处理器501可以采用DSP(Digital Signal Processing,数字信号处理)、FPGA(Field-Programmable Gate Array,现场可编程门阵列)、PLA(Programmable Logic Array,可编程逻辑阵列)中的至少一种硬件形式来实现。处理器501也可以包括主处理器和协处理器,主处理器是用于对在唤醒状态下的数据进行处理的处理器,也称CPU(Central Processing Unit,中央处理器);协处理器是用于对在待机状态下的数据进行处理的低功耗处理器。在一些实施例中,处理器501可以在集成有GPU(Graphics Processing Unit,图像处理器),GPU用于负责显示屏所需要显示的内容的渲染和绘制。一些实施例中,处理器501还可以包括AI(Artificial Intelligence,人工智能)处理器,该AI处理器用于处理有关机器学习的计算操作。The processor 501 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and the like. The processor 501 may use at least one hardware form among DSP (Digital Signal Processing), FPGA (Field-Programmable Gate Array), and PLA (Programmable Logic Array). achieve. The processor 501 may also include a main processor and a co-processor. The main processor is a processor for processing data in the awake state, also called a CPU (Central Processing Unit). The co-processor is Low-power processor for processing data in standby. In some embodiments, the processor 501 may be integrated with a GPU (Graphics Processing Unit), and the GPU is responsible for rendering and drawing content required to be displayed on the display screen. In some embodiments, the processor 501 may further include an AI (Artificial Intelligence) processor, and the AI processor is configured to process computing operations related to machine learning.
存储器502可以包括一个或多个计算机可读存储介质,该计算机可读存储介质可以是有形的和非暂态的。存储器502还可包括高速随机存取存储器,以及非易失性存储器,比如一个或多个磁盘存储设备、闪存存储设备。在一些实施例中,存储器502中的非暂态的计算机可读存储介质用于存储至少一个指令,该至少一个指令用于被处理器501所执行以实现本申请中提供的语音识别方法。The memory 502 may include one or more computer-readable storage media, which may be tangible and non-transitory. The memory 502 may also include high-speed random access memory, and non-volatile memory, such as one or more disk storage devices, flash storage devices. In some embodiments, the non-transitory computer-readable storage medium in the memory 502 is used to store at least one instruction that is executed by the processor 501 to implement the speech recognition method provided in this application.
在一些实施例中,终端500还可选包括有:外围设备接口503和至少一个外围设备。具体地,外围设备包括:射频电路504、触摸显示屏505、音频电路506和电源507中的至少一种。In some embodiments, the terminal 500 may further include a peripheral device interface 503 and at least one peripheral device. Specifically, the peripheral device includes at least one of a radio frequency circuit 504, a touch display screen 505, an audio circuit 506, and a power source 507.
外围设备接口503可被用于将I/O(Input/Output,输入/输出)相关的至少一个外围设备连接到处理器501和存储器502。在一些实施例中,处理器501、存储器502和外围设备接口503被集成在同一芯片或电路板上;在一些其他实施例中,处理器501、存储器502和外围设备接口503中的任意一个或两个可以在单独的芯片或电路板上实现,本实施例对此不加以限定。The peripheral device interface 503 may be used to connect at least one peripheral device related to I / O (Input / Output) to the processor 501 and the memory 502. In some embodiments, the processor 501, the memory 502, and the peripheral device interface 503 are integrated on the same chip or circuit board; in some other embodiments, any one of the processor 501, the memory 502, and the peripheral device interface 503 or Two can be implemented on separate chips or circuit boards, which is not limited in this embodiment.
射频电路504用于接收和发射RF(Radio Frequency,射频)信号,也称电磁信号。射频电路504通过电磁信号与通信网络以及其他通信设备进行通信。射频电路504将电信号转换为电磁信号进行发送,或者,将接收到的电磁信号转换为电信号。可选地,射频电路504包括:天线系统、RF收发器、一个或多个放大器、调谐器、振荡器、数字信号处理器、编解码芯片组、用户身份模块卡等等。射频电路504可以通过至少一种无线通信协议来与其它终端进行通信。该无线通信协议包括但不限于:万维网、城域网、内联网、各代移动通信网络(2G、3G、4G及5G)、无线局域网和/或WiFi(Wireless Fidelity,无线保真)网络。在一些实施例中,射频电路504还可以包括NFC(Near Field Communication,近距离无线通信)有关的电路,本申请对此不加以限定。The radio frequency circuit 504 is used for receiving and transmitting an RF (Radio Frequency) signal, also called an electromagnetic signal. The radio frequency circuit 504 communicates with a communication network and other communication devices through electromagnetic signals. The radio frequency circuit 504 converts electrical signals into electromagnetic signals for transmission, or converts received electromagnetic signals into electrical signals. Optionally, the radio frequency circuit 504 includes: an antenna system, an RF transceiver, one or more amplifiers, a tuner, an oscillator, a digital signal processor, a codec chipset, a subscriber identity module card, and the like. The radio frequency circuit 504 can communicate with other terminals through at least one wireless communication protocol. The wireless communication protocols include, but are not limited to, the World Wide Web, metropolitan area networks, intranets, mobile communication networks (2G, 3G, 4G, and 5G) of various generations, wireless local area networks, and / or WiFi (Wireless Fidelity) networks. In some embodiments, the radio frequency circuit 504 may further include circuits related to Near Field Communication (NFC), which is not limited in this application.
触摸显示屏505用于显示UI(User Interface,用户界面)。该UI可以包括图形、文本、图标、视频及其它们的任意组合。触摸显示屏505还具有采集在触摸显示屏505的表面或表面上方的触摸信号的能力。该触摸信号可以作为控制信号输入至处理器501进行处理。触摸显示屏505用于提供虚拟按钮和/或虚拟键盘,也称软按钮和/或软键盘。在一些实施例中,触摸显示屏505可以为一个,设置终端500的前面板;在另一些实施例中,触摸显示屏505可以为至少两个,分别设置在终端500的不同表面或呈折叠设计;在再一些实施例中,触摸显示屏505可以是柔性显示屏,设置在终端500的弯曲表面上或折叠面上。甚至,触摸显示屏505还可以设置成非矩形的不规则图形,也即异形屏。触摸显示屏505可以采用LCD(Liquid Crystal Display,液晶显示器)、OLED(Organic Light-Emitting Diode,有机发光二极管)等材质制备。The touch display screen 505 is used to display a UI (User Interface). The UI can include graphics, text, icons, videos, and any combination thereof. The touch display screen 505 also has the ability to collect touch signals on or above the surface of the touch display screen 505. The touch signal can be input to the processor 501 as a control signal for processing. The touch display screen 505 is used to provide virtual buttons and / or virtual keyboards, which are also called soft buttons and / or soft keyboards. In some embodiments, the touch display screen 505 may be one, and the front panel of the terminal 500 is provided. In other embodiments, the touch display screen 505 may be at least two, which are respectively provided on different surfaces of the terminal 500 or have a folded design. In still other embodiments, the touch display screen 505 may be a flexible display screen disposed on a curved surface or a folded surface of the terminal 500. Furthermore, the touch display screen 505 can also be set as a non-rectangular irregular figure, that is, a special-shaped screen. The touch display screen 505 can be made of materials such as LCD (Liquid Crystal Display) and OLED (Organic Light-Emitting Diode).
音频电路506用于提供用户和终端500之间的音频接口。音频电路506可以包括麦克风和扬声器。麦克风用于采集用户及环境的声波,并将声波转换为电信号输入至处理器501进行处理,或者输入至射频电路504以实现语音通信。出于立体声采集或降噪的目的,麦克风可以为多个,分别设置在终端500的不同部位。麦克风还可以是阵列麦克风或全向采集型麦克风。扬声器则用于将来 自处理器501或射频电路504的电信号转换为声波。扬声器可以是传统的薄膜扬声器,也可以是压电陶瓷扬声器。当扬声器是压电陶瓷扬声器时,不仅可以将电信号转换为人类可听见的声波,也可以将电信号转换为人类听不见的声波以进行测距等用途。在一些实施例中,音频电路506还可以包括耳机插孔。The audio circuit 506 is used to provide an audio interface between the user and the terminal 500. The audio circuit 506 may include a microphone and a speaker. The microphone is used to collect sound waves of the user and the environment, and convert the sound waves into electrical signals and input them to the processor 501 for processing, or input them to the radio frequency circuit 504 to implement voice communication. For the purpose of stereo collection or noise reduction, there may be multiple microphones, which are respectively disposed at different parts of the terminal 500. The microphone can also be an array microphone or an omnidirectional acquisition microphone. The speaker is used to convert electrical signals from the processor 501 or the radio frequency circuit 504 into sound waves in the future. The speaker can be a traditional film speaker or a piezoelectric ceramic speaker. When the speaker is a piezoelectric ceramic speaker, it can not only convert electrical signals into sound waves audible to humans, but also convert electrical signals into sound waves inaudible to humans for ranging purposes. In some embodiments, the audio circuit 506 may further include a headphone jack.
电源507用于为终端500中的各个组件进行供电。电源507可以是交流电、直流电、一次性电池或可充电电池。当电源507包括可充电电池时,该可充电电池可以是有线充电电池或无线充电电池。有线充电电池是通过有线线路充电的电池,无线充电电池是通过无线线圈充电的电池。该可充电电池还可以用于支持快充技术。The power supply 507 is used to supply power to various components in the terminal 500. The power source 507 may be an alternating current, a direct current, a disposable battery, or a rechargeable battery. When the power source 507 includes a rechargeable battery, the rechargeable battery may be a wired rechargeable battery or a wireless rechargeable battery. The wired rechargeable battery is a battery charged through a wired line, and the wireless rechargeable battery is a battery charged through a wireless coil. The rechargeable battery can also be used to support fast charging technology.
本领域技术人员可以理解,图5中示出的结构并不构成对终端500的限定,可以包括比图示更多或更少的组件,或者组合某些组件,或者采用不同的组件布置。Those skilled in the art may understand that the structure shown in FIG. 5 does not constitute a limitation on the terminal 500, and may include more or fewer components than shown in the figure, or combine certain components, or adopt different component arrangements.
在示例性实施例中,还提供了一种计算机可读存储介质,存储介质中存储有至少一条指令,至少一条指令由处理器加载并执行以实现上述实施例中的识别动作类别的方法。例如,所述计算机可读存储介质可以是ROM、随机存取存储器(RAM)、CD-ROM、磁带、软盘和光数据存储设备等。In an exemplary embodiment, a computer-readable storage medium is also provided. The storage medium stores at least one instruction, and at least one instruction is loaded and executed by a processor to implement the method for identifying an action category in the foregoing embodiment. For example, the computer-readable storage medium may be a ROM, a random access memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.
图6是本申请实施例提供的一种服务器的结构示意图,该服务器600可因配置或性能不同而产生比较大的差异,可以包括一个或一个以上处理器(central processing units,CPU)601和一个或一个以上的存储器602,其中,所述存储器602中存储有至少一条指令,所述至少一条指令由所述处理器601加载并执行以实现上述语音识别的方法。FIG. 6 is a schematic structural diagram of a server provided by an embodiment of the present application. The server 600 may have a large difference due to different configurations or performance, and may include one or more processors (central processing units) (CPU) 601 and Or more than one memory 602, where at least one instruction is stored in the memory 602, and the at least one instruction is loaded and executed by the processor 601 to implement the method for speech recognition described above.
本领域普通技术人员可以理解实现上述实施例的全部或部分步骤可以通过硬件来完成,也可以通过程序来指令相关的硬件完成,所述的程序可以存储于一种计算机可读存储介质中,上述提到的存储介质可以是只读存储器,磁盘或光盘等。A person of ordinary skill in the art may understand that all or part of the steps of implementing the foregoing embodiments may be implemented by hardware, or may be instructed by a program to complete related hardware. The program may be stored in a computer-readable storage medium. The storage medium mentioned may be a read-only memory, a magnetic disk or an optical disk.
以上所述仅为本发明的较佳实施例,并不用以限制本发明,凡在本发明的 精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。The above are only preferred embodiments of the present invention and are not intended to limit the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention shall be included in the protection of the present invention. Within range.

Claims (13)

  1. 一种语音识别的方法,其特征在于,所述方法包括:A method for speech recognition, characterized in that the method includes:
    获取待识别的语音数据;Obtain the speech data to be identified;
    确定所述语音数据对应的音素数据;Determining phoneme data corresponding to the voice data;
    将所述音素数据,输入预先训练的音素文字转换模型,得到所述语音数据对应的文字数据。The phoneme data is input into a pre-trained phoneme text conversion model to obtain text data corresponding to the voice data.
  2. 根据权利要求1所述的方法,其特征在于,所述方法还包括:The method according to claim 1, further comprising:
    获取样本音素数据和对应的样本文字数据;Obtaining sample phoneme data and corresponding sample text data;
    将所述样本音素数据作为样本输入数据,所述样本文字数据作为样本输出数据,对初始音素文字转换模型进行训练,得到所述音素文字转换模型。Using the sample phoneme data as sample input data and the sample text data as sample output data, the initial phoneme text conversion model is trained to obtain the phoneme text conversion model.
  3. 根据权利要求1所述方法,其特征在于,所述确定所述语音数据对应的音素数据,包括:The method according to claim 1, wherein the determining phoneme data corresponding to the voice data comprises:
    基于预先训练的语音声学模型,确定所述语音数据对应的音素数据。Based on a pre-trained speech acoustic model, phoneme data corresponding to the speech data is determined.
  4. 根据权利要求1所述的方法,其特征在于,所述音素汉字转换模型,包括编码器模型、解码器模型、注意力机制模型和空间搜索模型;The method according to claim 1, wherein the phoneme-kanji conversion model comprises an encoder model, a decoder model, an attention mechanism model, and a spatial search model;
    所述将所述音素数据,输入预先训练的音素汉字转换模型,得到所述语音数据对应的汉字文本,包括:The inputting the phoneme data into a pre-trained phoneme Chinese character conversion model to obtain the Chinese character text corresponding to the voice data includes:
    将所述音素数据输入所述编码器模型,得到所述音素数据对应的第一特征码;Inputting the phoneme data into the encoder model to obtain a first feature code corresponding to the phoneme data;
    将所述第一特征码输入所述注意力机制模型,得到所述音素数据对应的第二特征码;Inputting the first feature code into the attention mechanism model to obtain a second feature code corresponding to the phoneme data;
    将所述第二特征码输入所述解码器模型,得到所述音素数据中第1个音素单元对应的文字的特征码;Inputting the second feature code into the decoder model to obtain a feature code of a text corresponding to a first phoneme unit in the phoneme data;
    设置所述音素数据对应的文字顺序号i等于1;Setting the text sequence number i corresponding to the phoneme data to be equal to 1;
    将所述第一特征码和所述音素数据中第i个音素单元对应的文字的特征码,输入所述注意力机制模型,得到所述第i个音素单元对应的文字的融合特征码;Inputting the first feature code and the feature code of the text corresponding to the i-th phoneme unit in the phoneme data into the attention mechanism model to obtain a fusion feature code of the text corresponding to the i-th phoneme unit;
    将所述第i个音素单元对应的文字的融合特征码,输入所述空间搜索模型,得到所述第i个音素单元对应的文字;Inputting the fusion feature code of the text corresponding to the i-th phoneme unit into the spatial search model to obtain the text corresponding to the i-th phoneme unit;
    如果所述第i个音素单元不是所述音素数据中的最后一个音素单元,则将所述第i个音素单元对应的文字和所述第二特征码,输入所述解码器模型,得到所 述音素数据中第i+1个音素单元对应的文字的特征码,将i的数值加1,并转至执行将所述第一特征码和所述音素数据中第i个音素单元对应的文字的特征码输入所述注意力机制模型的处理步骤;If the i-th phoneme unit is not the last phoneme unit in the phoneme data, inputting the text corresponding to the i-th phoneme unit and the second feature code into the decoder model to obtain the The feature code of the text corresponding to the i + 1th phoneme unit in the phoneme data, increment the value of i by 1 and go to the execution of the first feature code and the text corresponding to the ith phoneme unit in the phoneme data A processing step of inputting a feature code into the attention mechanism model;
    如果所述第i个音素单元是所述音素数据中的最后一个音素单元,则将通过空间搜索模型得到的每个音素单元对应的文字,按照对应的音素单元在所述音素数据中的排序,组合在一起,得到所述语音数据对应的文字数据。If the i-th phoneme unit is the last phoneme unit in the phoneme data, the text corresponding to each phoneme unit obtained by the spatial search model is sorted according to the corresponding phoneme unit in the phoneme data, Combined together to obtain text data corresponding to the voice data.
  5. 根据权利要求4所述的方法,其特征在于,所述编码器模型为卷积神经网络CNN,所述解码器模型为卷积神经网络CNN。The method according to claim 4, wherein the encoder model is a convolutional neural network CNN, and the decoder model is a convolutional neural network CNN.
  6. 一种语音识别的装置,其特征在于,所述装置包括:A device for speech recognition, characterized in that the device includes:
    获取模块,用于获取待识别的语音数据;An acquisition module for acquiring voice data to be identified;
    确定模块,用于确定所述语音数据对应的音素数据;A determining module, configured to determine phoneme data corresponding to the voice data;
    转换模块,用于将所述音素数据,输入预先训练的音素文字转换模型,得到所述语音数据对应的文字数据。A conversion module is configured to input the phoneme data into a pre-trained phoneme text conversion model to obtain text data corresponding to the speech data.
  7. 根据权利要求6所述的装置,其特征在于,所述装置还包括训练模块,用于:The apparatus according to claim 6, further comprising a training module, configured to:
    获取样本音素数据和对应的样本文字数据;Obtaining sample phoneme data and corresponding sample text data;
    将所述样本音素数据作为样本输入数据,所述样本文字数据作为样本输出数据,对初始音素文字转换模型进行训练,得到所述音素文字转换模型。Using the sample phoneme data as sample input data and the sample text data as sample output data, the initial phoneme text conversion model is trained to obtain the phoneme text conversion model.
  8. 根据权利要求6所述方法,其特征在于,所述确定模块,用于:The method according to claim 6, wherein the determining module is configured to:
    基于预先训练的语音声学模型,确定所述语音数据对应的音素数据。Based on a pre-trained speech acoustic model, phoneme data corresponding to the speech data is determined.
  9. 根据权利要求6所述的装置,其特征在于,所述音素汉字转换模型,包括编码器模型、解码器模型、注意力机制模型和空间搜索模型;The device according to claim 6, wherein the phoneme-kanji conversion model comprises an encoder model, a decoder model, an attention mechanism model, and a spatial search model;
    所述转换模块,用于:The conversion module is configured to:
    将所述音素数据输入所述编码器模型,得到所述音素数据对应的第一特征码;Inputting the phoneme data into the encoder model to obtain a first feature code corresponding to the phoneme data;
    将所述第一特征码输入所述注意力机制模型,得到所述音素数据对应的第二特征码;Inputting the first feature code into the attention mechanism model to obtain a second feature code corresponding to the phoneme data;
    将所述第二特征码输入所述解码器模型,得到所述音素数据中第1个音素单元对应的文字的特征码;Inputting the second feature code into the decoder model to obtain a feature code of a text corresponding to a first phoneme unit in the phoneme data;
    设置所述音素数据对应的文字顺序号i等于1;Setting the text sequence number i corresponding to the phoneme data to be equal to 1;
    将所述第一特征码和所述音素数据中第i个音素单元对应的文字的特征码,输入所述注意力机制模型,得到所述第i个音素单元对应的文字的融合特征码;Inputting the first feature code and the feature code of the text corresponding to the i-th phoneme unit in the phoneme data into the attention mechanism model to obtain a fusion feature code of the text corresponding to the i-th phoneme unit;
    将所述第i个音素单元对应的文字的融合特征码,输入所述空间搜索模型,得到所述第i个音素单元对应的文字;Inputting the fusion feature code of the text corresponding to the i-th phoneme unit into the spatial search model to obtain the text corresponding to the i-th phoneme unit;
    如果所述第i个音素单元不是所述音素数据中的最后一个音素单元,则将所述第i个音素单元对应的文字和所述第二特征码,输入所述解码器模型,得到所述音素数据中第i+1个音素单元对应的文字的特征码,将i的数值加1,并转至执行将所述第一特征码和所述音素数据中第i个音素单元对应的文字的特征码输入所述注意力机制模型的处理步骤;If the i-th phoneme unit is not the last phoneme unit in the phoneme data, inputting the text corresponding to the i-th phoneme unit and the second feature code into the decoder model to obtain the The feature code of the text corresponding to the i + 1th phoneme unit in the phoneme data, increment the value of i by 1 and go to the execution of the first feature code and the text corresponding to the i-th phoneme unit in the phoneme data. A processing step of inputting a feature code into the attention mechanism model;
    如果所述第i个音素单元是所述音素数据中的最后一个音素单元,则将通过空间搜索模型得到的每个音素单元对应的文字,按照对应的音素单元在所述音素数据中的排序,组合在一起,得到所述语音数据对应的文字数据。If the i-th phoneme unit is the last phoneme unit in the phoneme data, the text corresponding to each phoneme unit obtained by the spatial search model is sorted according to the corresponding phoneme unit in the phoneme data, Combined together to obtain text data corresponding to the voice data.
  10. 根据权利要求9所述的装置,其特征在于,所述编码器模型为卷积神经网络CNN,所述解码器模型为卷积神经网络CNN。The apparatus according to claim 9, wherein the encoder model is a convolutional neural network CNN, and the decoder model is a convolutional neural network CNN.
  11. 一种终端,其特征在于,所述终端包括处理器、存储器、音频采集设备和显示器,其中:A terminal, characterized in that the terminal includes a processor, a memory, an audio acquisition device, and a display, wherein:
    所述音频采集设备,用于获取待识别的语音数据;The audio acquisition device is configured to acquire voice data to be identified;
    所述处理器,用于确定所述语音数据对应的音素数据;将所述音素数据,输入所述存储器中存储的预先训练的音素文字转换模型,得到所述语音数据对应的文字数据;The processor is configured to determine phoneme data corresponding to the voice data; input the phoneme data into a pre-trained phoneme text conversion model stored in the memory to obtain text data corresponding to the voice data;
    所述显示器,用于对所述文字数据进行显示。The display is configured to display the text data.
  12. 根据权利要求11所述的终端,其特征在于,所述处理器,用于:The terminal according to claim 11, wherein the processor is configured to:
    基于预先训练的语音声学模型,确定所述语音数据对应的音素数据。Based on a pre-trained speech acoustic model, phoneme data corresponding to the speech data is determined.
  13. 根据权利要求11所述的终端,其特征在于,所述音素汉字转换模型,包括编码器模型、解码器模型、注意力机制模型和空间搜索模型;The terminal according to claim 11, wherein the phoneme-kanji conversion model comprises an encoder model, a decoder model, an attention mechanism model, and a spatial search model;
    所述处理器,用于:The processor is configured to:
    将所述音素数据输入所述存储器存储的所述编码器模型,得到所述音素数据对应的第一特征码;Inputting the phoneme data into the encoder model stored in the memory to obtain a first feature code corresponding to the phoneme data;
    将所述第一特征码输入所述存储器存储的所述注意力机制模型,得到所述音素数据对应的第二特征码;Inputting the first feature code into the attention mechanism model stored in the memory to obtain a second feature code corresponding to the phoneme data;
    将所述第二特征码输入所述存储器存储的所述解码器模型,得到所述音素数据中第1个音素单元对应的文字的特征码;Inputting the second feature code into the decoder model stored in the memory to obtain a feature code of a character corresponding to a first phoneme unit in the phoneme data;
    设置所述音素数据对应的文字顺序号i等于1;Setting the text sequence number i corresponding to the phoneme data to be equal to 1;
    将所述第一特征码和所述音素数据中第i个音素单元对应的文字的特征码,输入所述注意力机制模型,得到所述第i个音素单元对应的文字的融合特征码;Inputting the first feature code and the feature code of the text corresponding to the i-th phoneme unit in the phoneme data into the attention mechanism model to obtain a fusion feature code of the text corresponding to the i-th phoneme unit;
    将所述第i个音素单元对应的文字的融合特征码,输入所述存储器存储的所述空间搜索模型,得到所述第i个音素单元对应的文字;Inputting the fusion feature code of the text corresponding to the i-th phoneme unit into the spatial search model stored in the memory to obtain the text corresponding to the i-th phoneme unit;
    如果所述第i个音素单元不是所述音素数据中的最后一个音素单元,则将所述第i个音素单元对应的文字和所述第二特征码,输入所述解码器模型,得到所述音素数据中第i+1个音素单元对应的文字的特征码,将i的数值加1,并转至执行将所述第一特征码和所述音素数据中第i个音素单元对应的文字的特征码输入所述注意力机制模型的处理步骤;If the i-th phoneme unit is not the last phoneme unit in the phoneme data, inputting the text corresponding to the i-th phoneme unit and the second feature code into the decoder model to obtain the The feature code of the text corresponding to the i + 1th phoneme unit in the phoneme data, increment the value of i by 1 and go to the execution of the first feature code and the text corresponding to the i-th phoneme unit in the phoneme data. A processing step of inputting a feature code into the attention mechanism model;
    如果所述第i个音素单元是所述音素数据中的最后一个音素单元,则将通过空间搜索模型得到的每个音素单元对应的文字,按照对应的音素单元在所述音素数据中的排序,组合在一起,得到所述语音数据对应的文字数据。If the i-th phoneme unit is the last phoneme unit in the phoneme data, the text corresponding to each phoneme unit obtained by the spatial search model is sorted according to the corresponding phoneme unit in the phoneme data, Combined together to obtain text data corresponding to the voice data.
PCT/CN2019/106909 2018-09-20 2019-09-20 Voice recognition method and apparatus WO2020057624A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201811099967.4A CN110931000B (en) 2018-09-20 2018-09-20 Method and device for speech recognition
CN201811099967.4 2018-09-20

Publications (1)

Publication Number Publication Date
WO2020057624A1 true WO2020057624A1 (en) 2020-03-26

Family

ID=69856142

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/106909 WO2020057624A1 (en) 2018-09-20 2019-09-20 Voice recognition method and apparatus

Country Status (2)

Country Link
CN (1) CN110931000B (en)
WO (1) WO2020057624A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112507958A (en) * 2020-12-22 2021-03-16 成都东方天呈智能科技有限公司 System and method for converting feature codes of different face recognition models and readable storage medium
CN113782007A (en) * 2021-09-07 2021-12-10 上海企创信息科技有限公司 Voice recognition method and device, voice recognition equipment and storage medium
CN113838456A (en) * 2021-09-28 2021-12-24 科大讯飞股份有限公司 Phoneme extraction method, voice recognition method, device, equipment and storage medium

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111986653A (en) * 2020-08-06 2020-11-24 杭州海康威视数字技术股份有限公司 Voice intention recognition method, device and equipment
CN113160820B (en) * 2021-04-28 2024-02-27 百度在线网络技术(北京)有限公司 Speech recognition method, training method, device and equipment of speech recognition model

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090292538A1 (en) * 2008-05-20 2009-11-26 Calabrio, Inc. Systems and methods of improving automated speech recognition accuracy using statistical analysis of search terms
CN104637482A (en) * 2015-01-19 2015-05-20 孔繁泽 Voice recognition method, device, system and language switching system
CN106021249A (en) * 2015-09-16 2016-10-12 展视网(北京)科技有限公司 Method and system for voice file retrieval based on content
CN107731228A (en) * 2017-09-20 2018-02-23 百度在线网络技术(北京)有限公司 The text conversion method and device of English voice messaging
CN108417202A (en) * 2018-01-19 2018-08-17 苏州思必驰信息科技有限公司 Audio recognition method and system
CN108492820A (en) * 2018-03-20 2018-09-04 华南理工大学 Chinese speech recognition method based on Recognition with Recurrent Neural Network language model and deep neural network acoustic model

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9159317B2 (en) * 2013-06-14 2015-10-13 Mitsubishi Electric Research Laboratories, Inc. System and method for recognizing speech
KR20180080446A (en) * 2017-01-04 2018-07-12 삼성전자주식회사 Voice recognizing method and voice recognizing appratus
CN108170686B (en) * 2017-12-29 2020-02-14 科大讯飞股份有限公司 Text translation method and device

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090292538A1 (en) * 2008-05-20 2009-11-26 Calabrio, Inc. Systems and methods of improving automated speech recognition accuracy using statistical analysis of search terms
CN104637482A (en) * 2015-01-19 2015-05-20 孔繁泽 Voice recognition method, device, system and language switching system
CN106021249A (en) * 2015-09-16 2016-10-12 展视网(北京)科技有限公司 Method and system for voice file retrieval based on content
CN107731228A (en) * 2017-09-20 2018-02-23 百度在线网络技术(北京)有限公司 The text conversion method and device of English voice messaging
CN108417202A (en) * 2018-01-19 2018-08-17 苏州思必驰信息科技有限公司 Audio recognition method and system
CN108492820A (en) * 2018-03-20 2018-09-04 华南理工大学 Chinese speech recognition method based on Recognition with Recurrent Neural Network language model and deep neural network acoustic model

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112507958A (en) * 2020-12-22 2021-03-16 成都东方天呈智能科技有限公司 System and method for converting feature codes of different face recognition models and readable storage medium
CN112507958B (en) * 2020-12-22 2024-04-02 成都东方天呈智能科技有限公司 Conversion system of different face recognition model feature codes and readable storage medium
CN113782007A (en) * 2021-09-07 2021-12-10 上海企创信息科技有限公司 Voice recognition method and device, voice recognition equipment and storage medium
CN113838456A (en) * 2021-09-28 2021-12-24 科大讯飞股份有限公司 Phoneme extraction method, voice recognition method, device, equipment and storage medium

Also Published As

Publication number Publication date
CN110931000B (en) 2022-08-02
CN110931000A (en) 2020-03-27

Similar Documents

Publication Publication Date Title
WO2020057624A1 (en) Voice recognition method and apparatus
CN110288077B (en) Method and related device for synthesizing speaking expression based on artificial intelligence
WO2021093449A1 (en) Wakeup word detection method and apparatus employing artificial intelligence, device, and medium
WO2021036644A1 (en) Voice-driven animation method and apparatus based on artificial intelligence
WO2021022992A1 (en) Dialog generation model training method and device, and dialog generation method and device, and medium
CN110634507A (en) Speech classification of audio for voice wakeup
EP3824462B1 (en) Electronic apparatus for processing user utterance and controlling method thereof
CN110570840B (en) Intelligent device awakening method and device based on artificial intelligence
CN110890093A (en) Intelligent device awakening method and device based on artificial intelligence
CN108922525B (en) Voice processing method, device, storage medium and electronic equipment
CN110263131B (en) Reply information generation method, device and storage medium
CN110364156A (en) Voice interactive method, system, terminal and readable storage medium storing program for executing
CN109308900B (en) Earphone device, voice processing system and voice processing method
CN110225386A (en) A kind of display control method, display equipment
CN112912955B (en) Electronic device and system for providing speech recognition based services
CN114333774B (en) Speech recognition method, device, computer equipment and storage medium
CN114360510A (en) Voice recognition method and related device
WO2022147692A1 (en) Voice command recognition method, electronic device and non-transitory computer-readable storage medium
JP6448950B2 (en) Spoken dialogue apparatus and electronic device
CN110111795B (en) Voice processing method and terminal equipment
WO2021147417A1 (en) Voice recognition method and apparatus, computer device, and computer-readable storage medium
CN114708849A (en) Voice processing method and device, computer equipment and computer readable storage medium
CN115841814A (en) Voice interaction method and electronic equipment
CN117012202B (en) Voice channel recognition method and device, storage medium and electronic equipment
CN113823278B (en) Speech recognition method, device, electronic equipment and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19861479

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19861479

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 19861479

Country of ref document: EP

Kind code of ref document: A1