WO2020057624A1 - Procédé et appareil de reconnaissance vocale - Google Patents

Procédé et appareil de reconnaissance vocale Download PDF

Info

Publication number
WO2020057624A1
WO2020057624A1 PCT/CN2019/106909 CN2019106909W WO2020057624A1 WO 2020057624 A1 WO2020057624 A1 WO 2020057624A1 CN 2019106909 W CN2019106909 W CN 2019106909W WO 2020057624 A1 WO2020057624 A1 WO 2020057624A1
Authority
WO
WIPO (PCT)
Prior art keywords
phoneme
data
model
text
feature code
Prior art date
Application number
PCT/CN2019/106909
Other languages
English (en)
Chinese (zh)
Inventor
郝婧
陈凯
谢迪
浦世亮
Original Assignee
杭州海康威视数字技术股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 杭州海康威视数字技术股份有限公司 filed Critical 杭州海康威视数字技术股份有限公司
Publication of WO2020057624A1 publication Critical patent/WO2020057624A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/18Artificial neural networks; Connectionist approaches

Definitions

  • the present invention relates to the technical field of speech recognition, and in particular, to a method and a device for speech recognition.
  • the related technology of speech recognition processing mainly uses a speech recognition model to directly convert speech data into text data, which can be obtained through training and learning.
  • embodiments of the present application provide a method and a device for speech recognition.
  • an embodiment of the present application provides a method for speech recognition, where the method includes:
  • the phoneme data is input into a pre-trained phoneme text conversion model to obtain text data corresponding to the voice data.
  • the method further includes:
  • the initial phoneme text conversion model is trained to obtain the phoneme text conversion model.
  • the determining phoneme data corresponding to the voice data includes:
  • phoneme data corresponding to the speech data is determined.
  • the phoneme-to-Chinese character conversion model includes an encoder model, a decoder model, an attention mechanism model, and a spatial search model;
  • the text corresponding to each phoneme unit obtained by the spatial search model is sorted according to the corresponding phoneme unit in the phoneme data, Combined together to obtain text data corresponding to the voice data.
  • the encoder model is a convolutional neural network CNN
  • the decoder model is a convolutional neural network CNN
  • an embodiment of the present application provides a device for voice recognition, where the device includes:
  • a determining module configured to determine phoneme data corresponding to the voice data
  • a conversion module is configured to input the phoneme data into a pre-trained phoneme text conversion model to obtain text data corresponding to the speech data.
  • the device further includes a training module for:
  • the initial phoneme text conversion model is trained to obtain the phoneme text conversion model.
  • the determining module is configured to:
  • phoneme data corresponding to the speech data is determined.
  • the phoneme-to-Chinese character conversion model includes an encoder model, a decoder model, an attention mechanism model, and a spatial search model;
  • the conversion module is configured to:
  • the text corresponding to each phoneme unit obtained by the spatial search model is sorted according to the corresponding phoneme unit in the phoneme data, Combined together to obtain text data corresponding to the voice data.
  • the encoder model is a convolutional neural network CNN
  • the decoder model is a convolutional neural network CNN
  • an embodiment of the present application provides a terminal.
  • the terminal includes a processor and a memory.
  • the memory stores at least one instruction, and the at least one instruction is loaded and executed by the processor to implement the foregoing.
  • an embodiment of the present application provides a computer-readable storage medium, where the storage medium stores at least one instruction, and the at least one instruction is loaded and executed by the processor to implement the method as described in the first aspect above.
  • the speech data recognition process is divided into two parts. First, the speech data is converted into phoneme data, and then the phoneme text conversion model is used to convert the phoneme data into text data. Compared with directly converting speech data to text data, this conversion method reduces the span of data conversion. The conversion of speech data to phoneme data and the conversion of phoneme data to text data have higher accuracy. Therefore, this scheme has a higher accuracy for speech data recognition.
  • FIG. 1 is a flowchart of a speech recognition method provided by an embodiment of the present application
  • FIG. 2 is a schematic diagram of a speech acoustic model and a phoneme text conversion model provided by an embodiment of the present application;
  • FIG. 3 is a flowchart of a voice recognition method according to an embodiment of the present application.
  • FIG. 4 is a schematic structural diagram of a voice recognition device according to an embodiment of the present application.
  • FIG. 5 is a schematic structural diagram of a terminal according to an embodiment of the present application.
  • FIG. 6 is a schematic structural diagram of a server according to an embodiment of the present application.
  • An embodiment of the present application provides a method for speech recognition.
  • the method may be implemented by a computer device, and the computer device may be a terminal or a server.
  • the method may be implemented by a terminal, and the terminal may be a device with an audio collection function, such as a mobile phone, a computer, an air conditioner, or a television.
  • the terminal may directly perform voice recognition based on the method.
  • the method can also be implemented by a server, that is, the collected voice data is sent to the server at a terminal with an audio collection function, and the server performs voice recognition based on the method.
  • the following exemplifies a scenario to which the method can be applied.
  • the device implementing the method in this scenario can be a mobile phone with an audio collection function, or a server.
  • An instant messaging application can be installed in a mobile phone with audio capture function, and a voice recognition option can be provided in the dialogue interface of the instant messaging application.
  • a voice recognition option can be provided in the dialogue interface of the instant messaging application.
  • the mobile phone can use the speech recognition method provided in the embodiment of the present application to convert the sentence spoken by the user into text, and display the text input box in the dialog interface.
  • the mobile phone can also send the sentence spoken by the user to the server, and the server converts the sentence spoken by the user into text through the speech recognition method provided in the embodiment of the present application, and returns it to the mobile phone, and the mobile phone displays the text in the above Text input box in the dialogue interface.
  • the processing flow of the method may include the following steps:
  • step 101 voice data to be recognized is acquired.
  • an audio collection device such as a microphone, for collecting voice data may be installed on the terminal.
  • the user wants to send text to others through the terminal, he can first turn on the voice recognition function of the terminal, and then speak the voice corresponding to the text to be sent to the terminal's audio collection device, and the terminal can obtain it through the audio collection device.
  • Corresponding voice data For example, if a user wants to send a text "What are you doing" to a friend in the terminal's instant messaging application, he can turn on the terminal's voice recognition function and say "What are you doing" to the audio collection device. The terminal can then obtain the voice data corresponding to "what are you doing" through the audio collection device.
  • the user wants to control the terminal by voice, he can first enable the voice recognition function of the terminal, and then speak the corresponding control word to the terminal. For example, when the user wants to control the instant messaging application in the mobile phone to send a text message to Zhang San, he can say "open the instant messaging application and send to Zhang San: rain today" to the mobile phone; or the user wants to control the temperature of the air conditioner At this time, you can say “adjust the temperature to 25 degrees" to the air conditioner.
  • the terminal collects the corresponding voice data through the audio collection device. Since the user starts speaking, until the detected volume is lower than the preset threshold, the voice data obtained at this time is the voice data to be recognized. When the user speaks more content, the corresponding duration will be longer. Since the user starts speaking, the terminal obtains multiple pieces of voice data according to a preset duration, and each piece of voice data can be used as the voice data to be identified.
  • step 102 phoneme data corresponding to the speech data to be identified is determined.
  • the phoneme data is data used to indicate the composition of the pronunciation.
  • the phoneme data is the pinyin corresponding to the Chinese characters.
  • the phoneme data can include one or more phoneme units, each phoneme unit corresponds to a word, and each phoneme unit can be composed of one or more pronunciation identifiers.
  • the pronunciation identifier is the initial in each pinyin
  • the phoneme unit corresponding to the vowel, for example, "I” is w ⁇ .
  • the phoneme unit corresponding to " ⁇ " in Japanese is "ka".
  • a machine training model may be used for conversion. Accordingly, the processing in step 102 may be as follows: the speech is determined based on a pre-trained speech acoustic model Phoneme data corresponding to the data.
  • the speech acoustic model is a model constructed based on CNN (Convolutional Neural Network) and RNN (Recurrent Neural Network).
  • the voice acoustic model should be trained in advance.
  • a technician can obtain the voice data and its corresponding phoneme data from an existing database, or obtain the voice data from the Internet, and then obtain the phoneme data manually based on the voice data.
  • the acquired voice data is sample voice data, which is used as sample input data.
  • the phoneme data corresponding to the voice data is sample phoneme data.
  • sample output data one sample voice data and corresponding sample phoneme data can be used as a set of training samples.
  • Train the initial speech acoustic model The speech acoustic model is trained by a large number of training samples to obtain the required speech acoustic model. During the training of the speech acoustic model, due to the large amount of sample data and the high computing and storage performance requirements of the device, it can be performed in the server.
  • the process of determining the phoneme data corresponding to the voice data may be performed in the terminal.
  • the CNN performs feature extraction on the voice data to obtain the feature vector corresponding to the voice data.
  • the feature vector is then processed by the RNN to obtain the corresponding phoneme data.
  • step 103 the phoneme data is input into a pre-trained phoneme text conversion model to obtain text data corresponding to the speech data.
  • the phoneme text conversion model is a machine training model.
  • FIG. 2 is a schematic diagram of a speech acoustic model and a phoneme text conversion model in the embodiment of the present application.
  • the phoneme data obtained by the terminal converting the speech data through the speech acoustic model will be input to the pre-trained Phoneme text conversion model to obtain text data corresponding to phoneme data.
  • the terminal can display the obtained text data.
  • the user can input text through voice.
  • the terminal displays the text data
  • the user can also edit the displayed text data.
  • the terminal can perform corresponding operations based on the text data.
  • an operation instruction can be spoken to it and "call Li Si", and the voice assistant displays "Give Li Si Da" on the interface. Phone "and do so.
  • the phoneme text conversion model may be trained in advance. Accordingly, the processing may be as follows: obtaining sample phoneme data and corresponding sample text data; using the sample phoneme data as Sample input data, and the sample text data is used as sample output data to train an initial phoneme text conversion model to obtain a phoneme text conversion model.
  • this training process can be performed in the server.
  • a technician can first obtain the text data from the Internet or an existing database as the sample text data.
  • the corresponding phoneme data can be obtained by querying the pronunciation dictionary as the sample phoneme data.
  • the sample phoneme data is used as sample input data, and the sample text data corresponding to the sample phoneme data is used as sample output data to form a training sample.
  • This solution can use the back-propagation algorithm as a preset training algorithm to train the initial phonetic text conversion model.
  • the sample input data is input into the initial phoneme text conversion model to obtain the output data, and then the server determines the adjustment value of each parameter to be adjusted in the model based on the output data, the sample output data, and the preset training algorithm. Adjust the parameters. For each training sample, it is processed according to the above process to obtain the final phoneme text conversion model.
  • the phoneme-to-Chinese character conversion model includes an encoder model, a decoder model, an attention mechanism model, and a spatial search model.
  • the processing in step 103 may be as follows:
  • Step 1031 Input phoneme data into the encoder model to obtain a first feature code corresponding to the phoneme data.
  • Step 1032 Enter the first feature code into the attention mechanism model to obtain a second feature code corresponding to the phoneme data.
  • Step 1033 Enter the second feature code into the decoder model to obtain the feature code of the text corresponding to the first phoneme unit in the phoneme data; and set the sequence number i of the text corresponding to the phoneme data to 1.
  • step 1034 the first feature code and the feature code of the character corresponding to the i-th phoneme unit in the phoneme data are input into the attention mechanism model to obtain the fusion feature code of the character corresponding to the i-th phoneme unit.
  • Step 1035 Enter the fusion feature code of the text corresponding to the i-th phoneme unit into a spatial search model to obtain the text corresponding to the i-th phoneme unit.
  • Step 1036 if the i-th phoneme unit is not the last phoneme unit in the phoneme data, then input the text corresponding to the i-th phoneme unit and the second feature code into the decoder model to obtain the i + 1th phoneme in the phoneme data For the character code of the unit, increase the value of i by 1 and go to step 4.
  • step 1037 if the i-th phoneme unit is the last phoneme unit in the phoneme data, the text corresponding to each phoneme unit obtained by the spatial search model is grouped according to the order of the corresponding phoneme unit in the phoneme data. To get the text data corresponding to the voice data.
  • the above-mentioned encoder model and decoder model can both adopt CNN.
  • the phoneme data obtained from the above-mentioned speech acoustic model is coded in the form of One-Hot coding, and the corresponding input sequence is obtained.
  • the input sequence is then input into the encoder model.
  • the The Embeding operation maps the input sequence to a unified dimension, so that the relationship between the elements in the input sequence is more effectively represented.
  • each convolutional layer in the CNN uses residual connections, so it is necessary to perform Linear Mapping (linear mapping) to change the vector dimension before the output of the encoder model.
  • the encoder model then outputs a first feature code corresponding to the phoneme data, which may be in the form of a feature vector.
  • the first feature code is input into the attention mechanism model to obtain a second feature code corresponding to the phoneme data.
  • the second feature code may also be in the form of a feature vector.
  • the second feature code is input into the decoder model to obtain the feature code of the character corresponding to the first phoneme unit in the phoneme data, and the second feature code of the character may also be in the form of a feature vector.
  • the first feature code obtained from the encoder model and the feature code of the text corresponding to the first phoneme unit of the phoneme data are input into the attention mechanism model to obtain a fusion of the text corresponding to the first phoneme unit.
  • Feature code Then, the fusion feature code of the text corresponding to the first phoneme unit is input to the spatial search model, and the text corresponding to the first phoneme unit can be obtained.
  • the character code corresponding to the second phoneme unit and the text corresponding to the first phoneme unit are input into the decoder model, and the feature code of the text corresponding to the second phoneme unit can be obtained.
  • the decoder will also perform the Embeding (embedding) operation, using residual connections, and perform Linear Mapping (linear mapping) operation before output.
  • the first feature code obtained from the encoder model and the character code corresponding to the second phoneme unit from the decoder model are input into the attention mechanism model to obtain the fusion feature code of the text corresponding to the second phoneme unit.
  • the fusion feature code of the text corresponding to the second phoneme unit is input into the spatial search model, and the text corresponding to the second phoneme unit can be obtained.
  • the following operations are performed on the subsequent third phoneme unit, the fourth phoneme unit, and the like, and details are not described herein.
  • This loop operation process is performed until the text corresponding to the last phoneme unit of the phoneme data is output.
  • the obtained characters can be sorted according to the order of their respective phoneme units in the phoneme data, and then combined together to obtain the text data corresponding to the phoneme data.
  • the text corresponding to the last phoneme unit predicted by the phoneme text conversion model is not input to the decoder model, but the decoder model.
  • the correct text corresponding to the last phoneme unit is input to the decoder model.
  • the speech data recognition process is divided into two parts.
  • the speech data is first converted into phoneme data, and then the phoneme data is converted into text data by using a phoneme text conversion model.
  • this conversion method reduces the span of data conversion.
  • the conversion of speech data to phoneme data and the conversion of phoneme data to text data have higher accuracy. Therefore, this scheme has a higher accuracy for speech data recognition.
  • the speech recognition method described in this embodiment is implemented by a convolution-based sequence learning model, and the conversion speed of speech data to text data is also improved.
  • an embodiment of the present application further provides a device for speech recognition.
  • the device may be a terminal in the foregoing embodiment.
  • the device includes an obtaining module 401, a determining module 402, and a conversion module.
  • An acquisition module 401 configured to acquire voice data to be identified
  • a determining module 402 configured to determine phoneme data corresponding to the voice data
  • a conversion module 403 is configured to input the phoneme data into a pre-trained phoneme text conversion model to obtain text data corresponding to the voice data.
  • the device further includes a training module 404, configured to:
  • the initial phoneme text conversion model is trained to obtain the phoneme text conversion model.
  • the determining module 402 is configured to:
  • phoneme data corresponding to the speech data is determined.
  • the phoneme-to-Chinese character conversion model includes an encoder model, a decoder model, an attention mechanism model, and a spatial search model;
  • the conversion module 403 is configured to:
  • the obtained text corresponding to each phoneme unit is combined according to the order of the corresponding phoneme unit in the phoneme data, Get text data corresponding to the voice data.
  • the encoder model is a convolutional neural network CNN
  • the decoder model is a convolutional neural network CNN
  • the device for speech recognition only uses the division of the functional modules as an example for speech recognition.
  • the above functions may be allocated by different functional modules as required. That is, the internal structure of the device is divided into different functional modules to complete all or part of the functions described above.
  • the speech recognition apparatus and the speech recognition method embodiments provided by the foregoing embodiments belong to the same concept. For specific implementation processes, refer to the method embodiments, and details are not described herein again.
  • FIG. 5 is a structural block diagram of a terminal provided by an embodiment of the present application.
  • the terminal 500 may be a portable mobile terminal, such as a smart phone or a tablet computer.
  • the terminal 500 may also be called other names such as user equipment, portable terminal, and the like.
  • the terminal 500 includes a processor 501 and a memory 502.
  • the processor 501 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and the like.
  • the processor 501 may use at least one hardware form among DSP (Digital Signal Processing), FPGA (Field-Programmable Gate Array), and PLA (Programmable Logic Array). achieve.
  • the processor 501 may also include a main processor and a co-processor.
  • the main processor is a processor for processing data in the awake state, also called a CPU (Central Processing Unit).
  • the co-processor is Low-power processor for processing data in standby.
  • the processor 501 may be integrated with a GPU (Graphics Processing Unit), and the GPU is responsible for rendering and drawing content required to be displayed on the display screen.
  • the processor 501 may further include an AI (Artificial Intelligence) processor, and the AI processor is configured to process computing operations related to machine learning.
  • AI Artificial Intelligence
  • the memory 502 may include one or more computer-readable storage media, which may be tangible and non-transitory.
  • the memory 502 may also include high-speed random access memory, and non-volatile memory, such as one or more disk storage devices, flash storage devices.
  • non-transitory computer-readable storage medium in the memory 502 is used to store at least one instruction that is executed by the processor 501 to implement the speech recognition method provided in this application.
  • the terminal 500 may further include a peripheral device interface 503 and at least one peripheral device.
  • the peripheral device includes at least one of a radio frequency circuit 504, a touch display screen 505, an audio circuit 506, and a power source 507.
  • the peripheral device interface 503 may be used to connect at least one peripheral device related to I / O (Input / Output) to the processor 501 and the memory 502.
  • the processor 501, the memory 502, and the peripheral device interface 503 are integrated on the same chip or circuit board; in some other embodiments, any one of the processor 501, the memory 502, and the peripheral device interface 503 or Two can be implemented on separate chips or circuit boards, which is not limited in this embodiment.
  • the radio frequency circuit 504 is used for receiving and transmitting an RF (Radio Frequency) signal, also called an electromagnetic signal.
  • the radio frequency circuit 504 communicates with a communication network and other communication devices through electromagnetic signals.
  • the radio frequency circuit 504 converts electrical signals into electromagnetic signals for transmission, or converts received electromagnetic signals into electrical signals.
  • the radio frequency circuit 504 includes: an antenna system, an RF transceiver, one or more amplifiers, a tuner, an oscillator, a digital signal processor, a codec chipset, a subscriber identity module card, and the like.
  • the radio frequency circuit 504 can communicate with other terminals through at least one wireless communication protocol.
  • the wireless communication protocols include, but are not limited to, the World Wide Web, metropolitan area networks, intranets, mobile communication networks (2G, 3G, 4G, and 5G) of various generations, wireless local area networks, and / or WiFi (Wireless Fidelity) networks.
  • the radio frequency circuit 504 may further include circuits related to Near Field Communication (NFC), which is not limited in this application.
  • NFC Near Field Communication
  • the touch display screen 505 is used to display a UI (User Interface).
  • the UI can include graphics, text, icons, videos, and any combination thereof.
  • the touch display screen 505 also has the ability to collect touch signals on or above the surface of the touch display screen 505.
  • the touch signal can be input to the processor 501 as a control signal for processing.
  • the touch display screen 505 is used to provide virtual buttons and / or virtual keyboards, which are also called soft buttons and / or soft keyboards.
  • the touch display screen 505 may be one, and the front panel of the terminal 500 is provided. In other embodiments, the touch display screen 505 may be at least two, which are respectively provided on different surfaces of the terminal 500 or have a folded design.
  • the touch display screen 505 may be a flexible display screen disposed on a curved surface or a folded surface of the terminal 500. Furthermore, the touch display screen 505 can also be set as a non-rectangular irregular figure, that is, a special-shaped screen.
  • the touch display screen 505 can be made of materials such as LCD (Liquid Crystal Display) and OLED (Organic Light-Emitting Diode).
  • the audio circuit 506 is used to provide an audio interface between the user and the terminal 500.
  • the audio circuit 506 may include a microphone and a speaker.
  • the microphone is used to collect sound waves of the user and the environment, and convert the sound waves into electrical signals and input them to the processor 501 for processing, or input them to the radio frequency circuit 504 to implement voice communication.
  • the microphone can also be an array microphone or an omnidirectional acquisition microphone.
  • the speaker is used to convert electrical signals from the processor 501 or the radio frequency circuit 504 into sound waves in the future.
  • the speaker can be a traditional film speaker or a piezoelectric ceramic speaker.
  • the speaker When the speaker is a piezoelectric ceramic speaker, it can not only convert electrical signals into sound waves audible to humans, but also convert electrical signals into sound waves inaudible to humans for ranging purposes.
  • the audio circuit 506 may further include a headphone jack.
  • the power supply 507 is used to supply power to various components in the terminal 500.
  • the power source 507 may be an alternating current, a direct current, a disposable battery, or a rechargeable battery.
  • the rechargeable battery may be a wired rechargeable battery or a wireless rechargeable battery.
  • the wired rechargeable battery is a battery charged through a wired line
  • the wireless rechargeable battery is a battery charged through a wireless coil.
  • the rechargeable battery can also be used to support fast charging technology.
  • FIG. 5 does not constitute a limitation on the terminal 500, and may include more or fewer components than shown in the figure, or combine certain components, or adopt different component arrangements.
  • a computer-readable storage medium stores at least one instruction, and at least one instruction is loaded and executed by a processor to implement the method for identifying an action category in the foregoing embodiment.
  • the computer-readable storage medium may be a ROM, a random access memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.
  • FIG. 6 is a schematic structural diagram of a server provided by an embodiment of the present application.
  • the server 600 may have a large difference due to different configurations or performance, and may include one or more processors (central processing units) (CPU) 601 and Or more than one memory 602, where at least one instruction is stored in the memory 602, and the at least one instruction is loaded and executed by the processor 601 to implement the method for speech recognition described above.
  • processors central processing units
  • memory 602 or more than one memory 602
  • the program may be stored in a computer-readable storage medium.
  • the storage medium mentioned may be a read-only memory, a magnetic disk or an optical disk.

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Machine Translation (AREA)

Abstract

L'invention concerne un procédé et un appareil de reconnaissance vocale, qui se rapportent au domaine de la reconnaissance vocale. Le procédé consiste : à acquérir des données vocales devant être reconnues (101) ; à déterminer des données de phonème correspondant aux données vocales (102) ; et à saisir les données de phonème dans un modèle de conversion de texte de phonème pré-appris, de façon à obtenir des données de texte correspondant aux données vocales (103). Le procédé peut améliorer la précision de la reconnaissance de données vocales.
PCT/CN2019/106909 2018-09-20 2019-09-20 Procédé et appareil de reconnaissance vocale WO2020057624A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201811099967.4A CN110931000B (zh) 2018-09-20 2018-09-20 语音识别的方法和装置
CN201811099967.4 2018-09-20

Publications (1)

Publication Number Publication Date
WO2020057624A1 true WO2020057624A1 (fr) 2020-03-26

Family

ID=69856142

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/106909 WO2020057624A1 (fr) 2018-09-20 2019-09-20 Procédé et appareil de reconnaissance vocale

Country Status (2)

Country Link
CN (1) CN110931000B (fr)
WO (1) WO2020057624A1 (fr)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111986653A (zh) * 2020-08-06 2020-11-24 杭州海康威视数字技术股份有限公司 一种语音意图识别方法、装置及设备
CN112507958A (zh) * 2020-12-22 2021-03-16 成都东方天呈智能科技有限公司 不同人脸识别模型特征码的转换系统、方法及可读存储介质
CN113782007A (zh) * 2021-09-07 2021-12-10 上海企创信息科技有限公司 一种语音识别方法、装置、语音识别设备及存储介质
CN113838456A (zh) * 2021-09-28 2021-12-24 科大讯飞股份有限公司 音素提取方法、语音识别方法、装置、设备及存储介质
CN113889089A (zh) * 2021-09-29 2022-01-04 北京百度网讯科技有限公司 语音识别模型的获取方法、装置、电子设备以及存储介质

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113160820B (zh) * 2021-04-28 2024-02-27 百度在线网络技术(北京)有限公司 语音识别的方法、语音识别模型的训练方法、装置及设备

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090292538A1 (en) * 2008-05-20 2009-11-26 Calabrio, Inc. Systems and methods of improving automated speech recognition accuracy using statistical analysis of search terms
CN104637482A (zh) * 2015-01-19 2015-05-20 孔繁泽 一种语音识别方法、装置、系统以及语言交换系统
CN106021249A (zh) * 2015-09-16 2016-10-12 展视网(北京)科技有限公司 一种基于内容的语音文件检索方法和系统
CN107731228A (zh) * 2017-09-20 2018-02-23 百度在线网络技术(北京)有限公司 英文语音信息的文本转换方法和装置
CN108417202A (zh) * 2018-01-19 2018-08-17 苏州思必驰信息科技有限公司 语音识别方法及系统
CN108492820A (zh) * 2018-03-20 2018-09-04 华南理工大学 基于循环神经网络语言模型和深度神经网络声学模型的中文语音识别方法

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9159317B2 (en) * 2013-06-14 2015-10-13 Mitsubishi Electric Research Laboratories, Inc. System and method for recognizing speech
KR20180080446A (ko) * 2017-01-04 2018-07-12 삼성전자주식회사 음성 인식 방법 및 음성 인식 장치
CN108170686B (zh) * 2017-12-29 2020-02-14 科大讯飞股份有限公司 文本翻译方法及装置

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090292538A1 (en) * 2008-05-20 2009-11-26 Calabrio, Inc. Systems and methods of improving automated speech recognition accuracy using statistical analysis of search terms
CN104637482A (zh) * 2015-01-19 2015-05-20 孔繁泽 一种语音识别方法、装置、系统以及语言交换系统
CN106021249A (zh) * 2015-09-16 2016-10-12 展视网(北京)科技有限公司 一种基于内容的语音文件检索方法和系统
CN107731228A (zh) * 2017-09-20 2018-02-23 百度在线网络技术(北京)有限公司 英文语音信息的文本转换方法和装置
CN108417202A (zh) * 2018-01-19 2018-08-17 苏州思必驰信息科技有限公司 语音识别方法及系统
CN108492820A (zh) * 2018-03-20 2018-09-04 华南理工大学 基于循环神经网络语言模型和深度神经网络声学模型的中文语音识别方法

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111986653A (zh) * 2020-08-06 2020-11-24 杭州海康威视数字技术股份有限公司 一种语音意图识别方法、装置及设备
CN112507958A (zh) * 2020-12-22 2021-03-16 成都东方天呈智能科技有限公司 不同人脸识别模型特征码的转换系统、方法及可读存储介质
CN112507958B (zh) * 2020-12-22 2024-04-02 成都东方天呈智能科技有限公司 不同人脸识别模型特征码的转换系统及可读存储介质
CN113782007A (zh) * 2021-09-07 2021-12-10 上海企创信息科技有限公司 一种语音识别方法、装置、语音识别设备及存储介质
CN113838456A (zh) * 2021-09-28 2021-12-24 科大讯飞股份有限公司 音素提取方法、语音识别方法、装置、设备及存储介质
CN113838456B (zh) * 2021-09-28 2024-05-31 中国科学技术大学 音素提取方法、语音识别方法、装置、设备及存储介质
CN113889089A (zh) * 2021-09-29 2022-01-04 北京百度网讯科技有限公司 语音识别模型的获取方法、装置、电子设备以及存储介质

Also Published As

Publication number Publication date
CN110931000A (zh) 2020-03-27
CN110931000B (zh) 2022-08-02

Similar Documents

Publication Publication Date Title
WO2020057624A1 (fr) Procédé et appareil de reconnaissance vocale
CN110288077B (zh) 一种基于人工智能的合成说话表情的方法和相关装置
WO2021093449A1 (fr) Procédé et appareil de détection de mot de réveil employant l'intelligence artificielle, dispositif, et support
WO2021022992A1 (fr) Procédé et dispositif d'apprentissage de modèle de génération de dialogue, et procédé et dispositif de génération de dialogue, et support
CN110634507A (zh) 用于语音唤醒的音频的语音分类
EP3824462B1 (fr) Appareil électronique pour traitement d'énoncé utilisateur et son procédé de commande
CN110890093A (zh) 一种基于人工智能的智能设备唤醒方法和装置
CN110570840B (zh) 一种基于人工智能的智能设备唤醒方法和装置
CN108922525B (zh) 语音处理方法、装置、存储介质及电子设备
CN110364156A (zh) 语音交互方法、系统、终端及可读存储介质
CN112912955B (zh) 提供基于语音识别的服务的电子装置和系统
CN109308900B (zh) 耳机装置、语音处理系统和语音处理方法
CN110555329A (zh) 一种手语翻译的方法、终端以及存储介质
CN114360510A (zh) 一种语音识别方法和相关装置
CN114333774A (zh) 语音识别方法、装置、计算机设备及存储介质
JP6448950B2 (ja) 音声対話装置及び電子機器
CN116978359A (zh) 音素识别方法、装置、电子设备及存储介质
WO2021147417A1 (fr) Appareil et procédé de reconnaissance vocale, dispositif informatique et support d'enregistrement lisible par ordinateur
CN114708849A (zh) 语音处理方法、装置、计算机设备及计算机可读存储介质
CN112712788A (zh) 语音合成方法、语音合成模型的训练方法及装置
CN115841814A (zh) 语音交互方法及电子设备
CN117012202B (zh) 语音通道识别方法、装置、存储介质及电子设备
CN113823278B (zh) 语音识别方法、装置、电子设备及存储介质
CN110288999B (zh) 语音识别方法、装置、计算机设备及存储介质
CN115331672B (zh) 设备控制方法、装置、电子设备及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19861479

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19861479

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 19861479

Country of ref document: EP

Kind code of ref document: A1