WO2023193442A1 - 语音识别方法、装置、设备和介质 - Google Patents

语音识别方法、装置、设备和介质 Download PDF

Info

Publication number
WO2023193442A1
WO2023193442A1 PCT/CN2022/132456 CN2022132456W WO2023193442A1 WO 2023193442 A1 WO2023193442 A1 WO 2023193442A1 CN 2022132456 W CN2022132456 W CN 2022132456W WO 2023193442 A1 WO2023193442 A1 WO 2023193442A1
Authority
WO
WIPO (PCT)
Prior art keywords
candidate
pinyin
text
score
acoustic
Prior art date
Application number
PCT/CN2022/132456
Other languages
English (en)
French (fr)
Inventor
程强
贾磊
钱胜
Original Assignee
北京百度网讯科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京百度网讯科技有限公司 filed Critical 北京百度网讯科技有限公司
Publication of WO2023193442A1 publication Critical patent/WO2023193442A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • G10L2015/0631Creating reference templates; Clustering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L2015/086Recognition of spelled words

Definitions

  • the present disclosure relates to the field of Internet technology, artificial intelligence, natural language processing, speech technology and deep learning technology, for example, a speech recognition method, device, equipment, media and program products.
  • Speech recognition services can convert speech signals into text information and are widely used in a variety of fields.
  • Speech recognition services usually consist of an acoustic model and a language model.
  • the acoustic model is used to classify the acoustic features of the input speech into units such as phonemes or words, and give the probability of occurrence of a sequence composed of these phonemes or words, that is, the acoustic model Score;
  • the language model is used to decode words into a complete sentence and determine the probability of occurrence of these sentences, which is the language score. Then, the final recognition result of the input speech is obtained based on the acoustic score and the language score.
  • the present disclosure provides a speech recognition method, device, equipment, media and program products.
  • a speech recognition method including:
  • the acoustic score of at least one candidate pinyin corresponding to the current frame audio data is obtained, wherein the text pronunciation dictionary is used to record text and Pinyin correspondence;
  • the speech recognition result of the current frame audio data is determined according to the acoustic score and the language score of the at least one candidate pinyin.
  • a speech recognition device including:
  • a text acoustic score determination module configured to use an acoustic model to determine the acoustic score of at least one first candidate text unit corresponding to the current frame of audio data;
  • Pinyin acoustic score mapping module configured to obtain the acoustic score of at least one candidate Pinyin corresponding to the current frame audio data according to the pre-established text pronunciation dictionary and the acoustic score of the at least one first candidate text unit, wherein, the The text pronunciation dictionary is used to record the correspondence between text and pinyin;
  • the language score determination module is configured to determine the language score of the at least one candidate pinyin based on a pre-established pinyin and text diagram and using a language model;
  • the recognition result determination module is configured to determine the speech recognition result of the current frame audio data based on the acoustic score and the language score of the at least one candidate pinyin.
  • an electronic device including:
  • the memory stores instructions that can be executed by the at least one processor, and the instructions are executed by the at least one processor, so that the at least one processor can perform the above-mentioned speech recognition method.
  • a non-transitory computer-readable storage medium storing computer instructions for causing a computer to execute the above-mentioned speech recognition method.
  • a computer program product including a computer program that implements the above speech recognition method when executed by a processor.
  • Figure 1 is a schematic flowchart of a speech recognition method provided by an embodiment of the present disclosure
  • FIG. 2 is a schematic flowchart of another speech recognition method provided by an embodiment of the present disclosure.
  • FIG. 3 is a schematic flowchart of another speech recognition method provided by an embodiment of the present disclosure.
  • Figure 4 is a schematic structural diagram of a speech recognition device provided by an embodiment of the present disclosure.
  • Figure 5 is a schematic structural diagram of a pinyin acoustic score mapping module provided by an embodiment of the present disclosure
  • Figure 6 is a schematic structural diagram of another pinyin acoustic score mapping module provided by an embodiment of the present disclosure.
  • FIG. 8 is a block diagram of an electronic device that implements a speech recognition method provided by an embodiment of the present disclosure.
  • the acoustic scores output by some acoustic models are the scores of Chinese characters, and the scores of each Chinese character are basically different. Therefore, the training data of the acoustic model corresponds to the scene with good recognition accuracy. But when the content of the user's speech is some unique vertical scenes that have not been seen in the training data, the recognition accuracy is still low.
  • the present disclosure can improve the accuracy of speech recognition in vertical scenarios, which is described through the following embodiments.
  • Figure 1 is a schematic flow chart of a speech recognition method provided by an embodiment of the present disclosure. This embodiment can be applied to the situation of recognizing input speech and converting the input speech into corresponding text. It relates to the field of Internet technology, artificial intelligence, natural Language processing, speech technology and deep learning technology.
  • the method can be performed by a speech recognition device, which is implemented in software and/or hardware, and can be configured in electronic equipment, such as computer equipment or servers. As shown in Figure 1, the method includes the following:
  • S101 Use an acoustic model to determine the acoustic score of at least one first candidate text unit corresponding to the current frame of audio data.
  • the acoustic model may be pre-trained based on deep learning technology, and the model network structure of the acoustic model is not limited in any way by the embodiment of the present disclosure.
  • the function of using the acoustic model is to output at least one first candidate text unit corresponding to the current frame audio data and its acoustic score according to the acoustic characteristics of the input current frame audio data.
  • the number of first candidate text units may be multiple, that is, all possible first candidate text units corresponding to the current frame audio data predicted by the acoustic model.
  • the first candidate text unit may be, for example, a single Chinese character, or may be composed of multiple A word or phrase composed of Chinese characters.
  • the text pronunciation dictionary may be, for example, a Chinese character pronunciation dictionary that records the correspondence between Chinese characters and their pinyin, where the pinyin includes syllables and tones.
  • the embodiment of the present disclosure uses a text pronunciation dictionary to map the acoustic score of the text output by the acoustic model to the acoustic score of the pinyin corresponding to the text.
  • At least one first candidate text unit corresponding to the current frame audio data can be mapped to the candidate pinyin corresponding to each first candidate text unit through the text pronunciation dictionary. , and then, based on the acoustic score of the first candidate text unit, the acoustic score of the corresponding pinyin candidate can be obtained.
  • the embodiment of the present disclosure obtains the acoustic score of at least one pinyin candidate through S102, and then obtains the language score of the candidate pinyin through S103.
  • the final speech recognition result can be determined based on the acoustic score and language score of the candidate pinyin.
  • the language model in the embodiment of the disclosure may be pre-trained based on deep learning technology.
  • the embodiment of the disclosure does not make any limitations regarding the network structure of the language model.
  • Pinyin and text graphs can be created in advance based on graph technology, and will not be described in detail here.
  • the relationship between pinyin and text is recorded in the pinyin and text diagram. Therefore, based on the pinyin and text diagram, you can obtain the text corresponding to the candidate pinyin, input the text into the language model, and output the predicted language for the text. score, that is, the language score of the corresponding candidate pinyin.
  • a piece of speech information to be recognized will be divided into multiple frames of audio data.
  • Each frame of audio data is recognized according to the speech recognition method of the embodiment of the present disclosure, and the speech recognition result of each frame is obtained.
  • the speech recognition result of the complete speech information to be recognized is obtained.
  • the acoustic model outputs the acoustic score of the first candidate text unit, based on the correspondence between the text and Pinyin in the text pronunciation dictionary, the score of the text output by the acoustic model is applied to Pinyin to obtain at least
  • the acoustic score of a candidate pinyin is not applied to Chinese characters as in related technologies, and then based on the map of pinyin and text, the language model is used to obtain the language score of at least one candidate pinyin, and finally the acoustic score and language score of at least one candidate pinyin are combined. Confirm speech recognition results.
  • the embodiment of the present disclosure keeps the acoustic model unchanged, ensures good speech discrimination by outputting the acoustic model as the acoustic score of the text, and then determines the mapped pinyin through the text pronunciation dictionary and the acoustic score of the text.
  • Acoustic score that is: Pinyin is determined by the acoustic model, and then which text is determined by the language model, thereby increasing the role of the language model as a whole, allowing it to be better applied in vertical scenarios, while meeting basic general needs Under the premise of the accuracy requirements of speech recognition in the field, we can solve the problem of keyword recognition errors unique to the vertical field in related technologies and improve the accuracy of speech recognition in vertical scenarios.
  • FIG 2 is a schematic flowchart of another speech recognition method provided by an embodiment of the present disclosure. This embodiment is explained based on the above embodiment. As shown in Figure 2, the method includes the following:
  • S201 Use an acoustic model to determine the acoustic score of at least one first candidate text unit corresponding to the current frame of audio data.
  • This target text can be used as the intermediate result of speech recognition of the previous frame of audio data.
  • the current frame is the third frame
  • the top three speech recognition results of the first and second frames are: "we”, “them” and “Xiao Ming”
  • the recognition result of the second frame corresponds to
  • the target candidate pinyin includes "mén” and "m ⁇ ng”
  • the back mapping of the text pronunciation dictionary find the text corresponding to these two pinyin, and then select the target text with the highest score from the acoustic scores of these texts, which is the second Intermediate speech recognition results for frames of audio data.
  • the preset condition may include, for example, the highest acoustic score, that is, among the acoustic scores of the first candidate text unit corresponding to each candidate pinyin, select the highest acoustic score as the acoustic score of each candidate pinyin. .
  • the sum of the acoustic scores of the first candidate text units corresponding to each candidate pinyin is calculated, and the obtained sum value is used as the acoustic score of each candidate pinyin.
  • the above two methods of determining the acoustic scores of candidate pinyin are both acceptable. When implemented, they can be configured according to the needs of the actual scenario.
  • S205 Determine the language score of the at least one candidate pinyin based on the pre-established pinyin and text diagram and using a language model.
  • FIG. 3 is a schematic flowchart of another speech recognition method provided by an embodiment of the present disclosure. This embodiment is explained based on the above embodiment. As shown in Figure 3, the method includes the following:
  • the pinyin versus text map is built based on the text pronunciation dictionary. Since the correspondence between words, words and pinyin is recorded in the figure, the second candidate text unit obtained according to the figure may be the word corresponding to the current candidate pinyin, or it may be the candidate pinyin corresponding to the current candidate pinyin and the previous frame of audio data. words composed of.
  • S305 Perform weighted summation of the acoustic score and language score of at least one candidate pinyin respectively, and determine the speech recognition result of the current frame audio data based on the weighted summation result.
  • the acoustic score may be weighted less than the speech score.
  • the technical solution of the disclosed embodiments not only ensures a certain recognition accuracy in the general field, but also improves the recognition accuracy in vertical categories by improving the role of the language model and solves the problem of keyword recognition errors unique to the vertical category.
  • Figure 4 is a schematic structural diagram of a speech recognition device provided by an embodiment of the present disclosure. This embodiment can be applied to recognize input speech and convert the input speech into corresponding text. It relates to the field of Internet technology, artificial intelligence, natural Language processing, speech technology and deep learning technology.
  • the device can implement the speech recognition method described in any embodiment of the present disclosure. As shown in Figure 4, the device 400 includes:
  • the Pinyin acoustic score mapping module 420 includes:
  • the second candidate pinyin acoustic score determination unit 423 is configured to calculate the sum of the acoustic scores of the first candidate text units corresponding to each candidate pinyin, and use the obtained sum value as the acoustic score of each candidate pinyin.
  • the language score determination module 430 includes:
  • the weight of the acoustic score is smaller than the weight of the language score.
  • the language model is trained using training corpus of vertical scenes.
  • the above-mentioned products can execute the methods provided by any embodiment of the present disclosure, and have corresponding functional modules and effects for executing the methods.
  • the collection, storage, use, processing, transmission, provision and disclosure of user personal information are in compliance with relevant laws and regulations and do not violate public order and good customs.
  • Electronic device 500 is intended to represent many forms of digital computers, such as laptop computers, desktop computers, workstations, personal digital assistants, servers, blade servers, mainframe computers, and other suitable computers.
  • Electronic devices may also represent various forms of mobile devices, such as personal digital assistants, cellular phones, smart phones, wearable devices, and other similar computing devices.
  • the components shown herein, their connections and relationships, and their functions are examples only and are not intended to limit implementations of the disclosure described and/or claimed herein.
  • the I/O interface 505 includes: an input unit 506, such as a keyboard, a mouse, etc.; an output unit 507, such as various types of displays, speakers, etc.; a storage unit 508, such as a magnetic disk, an optical disk, etc. etc.; and communication unit 509, such as network card, modem, wireless communication transceiver, etc.
  • the communication unit 509 allows the electronic device 500 to exchange information/data with other devices through a computer network such as the Internet and/or various telecommunications networks.
  • Computing unit 501 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of the computing unit 501 include, but are not limited to, a central processing unit (CPU), a graphics processing unit (GPU), a variety of dedicated artificial intelligence (Artificial Intelligence, AI) computing chips, a variety of running The computing unit of the machine learning model algorithm, Digital Signal Processing (DSP), and any appropriate processor, controller, microcontroller, etc.
  • the computing unit 501 performs a plurality of methods and processes described above, such as speech recognition methods. For example, in some embodiments, the speech recognition method may be implemented as a computer software program that is tangibly embodied in a machine-readable medium, such as storage unit 508.
  • part or all of the computer program may be loaded and/or installed onto the electronic device 500 via the ROM 502 and/or the communication unit 509.
  • the computer program When the computer program is loaded into the RAM 503 and executed by the computing unit 501, one or more steps of the speech recognition method described above may be performed.
  • the computing unit 501 may be configured to perform the speech recognition method in any other suitable manner (eg, by means of firmware).
  • Various implementations of the systems and techniques described above may be implemented in digital electronic circuit systems, integrated circuit systems, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Parts (ASSP), System on Chip (SOC), Complex Programmable Logic Device (CPLD), computer hardware, firmware, software, and/or their realized in combination.
  • Various implementations may include implementation in one or more computer programs that may be executed and/or interpreted on a programmable system including at least one programmable processor that may is a special-purpose or general-purpose programmable processor that can receive data and instructions from a storage system, at least one input device, and at least one output device, and transmit data and instructions to the storage system, the at least one input device, and the at least one output device.
  • Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general-purpose computer, special-purpose computer, or other programmable data processing device, such that the program codes, when executed by the processor or controller, cause the functions specified in the flowcharts and/or block diagrams/ The operation is implemented.
  • the program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.
  • a machine-readable medium may be a tangible medium that may contain or store a program for use by or in connection with an instruction execution system, apparatus, or device.
  • the machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium.
  • Machine-readable media may include, but are not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, devices or devices, or any suitable combination of the foregoing.
  • the systems and techniques described herein may be implemented on a computer having a display device (e.g., a cathode ray tube (CRT)) or a liquid crystal display (e.g., a CRT) configured to display information to a user.
  • a display device e.g., a cathode ray tube (CRT)
  • a liquid crystal display e.g., a CRT
  • LCD Liquid Crystal Display
  • keyboard and pointing device e.g., a mouse or a trackball
  • Other kinds of devices may also be configured to provide interaction with the user; for example, the feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and may be provided in any form, including Acoustic input, voice input or tactile input) to receive input from the user.
  • the systems and techniques described herein may be implemented in a computing system that includes back-end components (e.g., as a data server), or a computing system that includes middleware components (e.g., an application server), or a computing system that includes front-end components (e.g., A user's computer having a graphical user interface or web browser through which the user can interact with implementations of the systems and technologies described herein), or including such backend components, middleware components, or any combination of front-end components in a computing system.
  • the components of the system may be interconnected by any form or medium of digital data communication (e.g., a communications network). Examples of communication networks include: Local Area Network (LAN), Wide Area Network (WAN), blockchain network, and the Internet.
  • Artificial intelligence is the study of using computers to simulate some human thinking processes and intelligent behaviors (such as learning, reasoning, thinking, planning, etc.). It has both hardware-level technology and software-level technology. Artificial intelligence hardware technology generally includes sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing and other technologies; artificial intelligence software technology mainly includes computer vision technology, speech recognition technology, natural language processing technology and machine learning/depth Learning technology, big data processing technology, knowledge graph technology and other major directions.
  • Cloud computing refers to a flexible and scalable shared physical or virtual resource pool through network access.
  • Resources can include servers, operating systems, networks, software, applications, storage devices, etc., and can be on-demand and self-service.
  • Steps can be reordered, added, or removed using various forms of the process shown above.
  • multiple steps described in the present disclosure can be executed in parallel, sequentially, or in different orders.
  • the desired results of the technical solution provided by the present disclosure can be achieved, there is no limitation here.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Machine Translation (AREA)

Abstract

本公开提供了一种语音识别方法、装置、设备、介质和程序产品。语音识别方法包括:利用声学模型确定当前帧音频数据对应的至少一个第一候选文本单元的声学得分;根据预先建立的文本发音词典和所述至少一个第一候选文本单元的声学得分,获取所述当前帧音频数据对应的至少一个候选拼音的声学得分,其中,所述文本发音词典用于记载文本与拼音的对应关系;根据预先建立的拼音与文本的图,并利用语言模型,确定所述至少一个候选拼音的语言得分;根据所述至少一个候选拼音的声学得分和语言得分,确定所述当前帧音频数据的语音识别结果。

Description

语音识别方法、装置、设备和介质
本申请要求在2022年04月06日提交中国专利局、申请号为202210357646.X的中国专利申请的优先权,该申请的全部内容通过引用结合在本申请中。
技术领域
本公开涉及互联网技术领域,涉及人工智能、自然语言处理、语音技术和深度学习技术,例如涉及一种语音识别方法、装置、设备、介质和程序产品。
背景技术
语音识别服务可以将语音信号转换为文本信息,并广泛应用于多种领域。语音识别服务通常由声学模型和语言模型构成,声学模型用于将输入语音的声学特征分类对应到音素或字词这样的单元,给出由这些音素或字词组成的序列出现的概率,即声学得分;语言模型则用于把字词解码成一个完整的句子,确定这些句子出现的概率,即语言得分。然后,根据声学得分和语言得分得到输入语音的最终识别结果。
发明内容
本公开提供了一种语音识别方法、装置、设备、介质和程序产品。
根据本公开的一方面,提供了一种语音识别方法,包括:
利用声学模型确定当前帧音频数据对应的至少一个第一候选文本单元的声学得分;
根据预先建立的文本发音词典和所述至少一个第一候选文本单元的声学得分,获取所述当前帧音频数据对应的至少一个候选拼音的声学得分,其中,所述文本发音词典用于记载文本与拼音的对应关系;
根据预先建立的拼音与文本的图,并利用语言模型,确定所述至少一个候选拼音的语言得分;
根据所述至少一个候选拼音的声学得分和语言得分,确定所述当前帧音频数据的语音识别结果。
根据本公开的另一方面,提供了一种语音识别装置,包括:
文本声学得分确定模块,设置为利用声学模型确定当前帧音频数据对应的至少一个第一候选文本单元的声学得分;
拼音声学得分映射模块,设置为根据预先建立的文本发音词典和所述至少一个第一候选文本单元的声学得分,获取所述当前帧音频数据对应的至少一个候选拼音的声学得分,其中,所述文本发音词典用于记载文本与拼音的对应关系;
语言得分确定模块,设置为根据预先建立的拼音与文本的图,并利用语言模型,确定所述至少一个候选拼音的语言得分;
识别结果确定模块,设置为根据所述至少一个候选拼音的声学得分和语言得分,确定所述当前帧音频数据的语音识别结果。
根据本公开的另一方面,提供了一种电子设备,包括:
至少一个处理器;以及
与所述至少一个处理器通信连接的存储器;其中,
所述存储器存储有可被所述至少一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够执行上述的语音识别方法。
根据本公开的另一方面,提供了一种存储有计算机指令的非瞬时计算机可读存储介质,所述计算机指令用于使计算机执行上述的语音识别方法。
根据本公开的另一方面,提供了一种计算机程序产品,包括计算机程序,所述计算机程序在被处理器执行时实现上述的语音识别方法。
附图说明
图1是本公开实施例提供的一种语音识别方法的流程示意图;
图2是本公开实施例提供的另一种语音识别方法的流程示意图;
图3是本公开实施例提供的另一种语音识别方法的流程示意图;
图4是本公开实施例提供的一种语音识别装置的结构示意图;
图5是本公开实施例提供的一种拼音声学得分映射模块的结构示意图;
图6是本公开实施例提供的另一种拼音声学得分映射模块的结构示意图;
图7是本公开实施例提供的一种语言得分确定模块的结构示意图;
图8是本公开实施例提供的一种实现语音识别方法的电子设备的框图。
具体实施方式
以下结合附图对本公开的示范性实施例做出说明,其中包括本公开实施例的多种细节以助于理解,应当将它们认为仅仅是示范性的。为了清楚和简 明,以下的描述中省略了对公知功能和结构以及与下述实施例相关性低的功能和结构的描述。
语音识别技术中,有些声学模型输出的声学得分是中文汉字的得分,每个汉字的得分基本都不相同,因此,该声学模型的训练数据对应的场景,具有较好的识别精度。但当用户说话的内容是一些特有的、训练数据没见过的垂类场景时,其识别精度仍然较低。本公开可以提高垂类场景下语音识别的准确度,通过如下实施例进行描述。
图1是本公开实施例提供的一种语音识别方法的流程示意图,本实施例可适用于对输入语音进行识别,将输入语音转化为对应文本的情况,涉及互联网技术领域,涉及人工智能、自然语言处理、语音技术和深度学习技术。该方法可由一种语音识别装置来执行,该装置采用软件和/或硬件的方式实现,可以配置于电子设备中,例如计算机设备或服务器等。如图1所示,该方法包括如下:
S101、利用声学模型确定当前帧音频数据对应的至少一个第一候选文本单元的声学得分。
声学模型可以是基于深度学习技术预先训练得到,声学模型的模型网络结构,本公开实施例不做任何限定。使用声学模型的作用在于,根据输入的当前帧音频数据的声学特征输出与当前帧音频数据对应的至少一个第一候选文本单元及其声学得分。第一候选文本单元的数量可以为多个,也即声学模型预测的当前帧音频数据对应的所有可能的第一候选文本单元,该第一候选文本单元例如可以是单独的汉字,或由多个汉字组成的词语或短语。
本公开实施例中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。
S102、根据预先建立的文本发音词典和所述至少一个第一候选文本单元的声学得分,获取当前帧音频数据对应的至少一个候选拼音的声学得分,其中,所述文本发音词典用于记载文本与拼音的对应关系。
文本发音词典例如可以是汉字发音词典,记载汉字与其拼音的对应关系,此处的拼音包括音节和声调。本公开实施例利用文本发音词典,将声学模型输出的文本的声学得分映射为该文本对应的拼音的声学得分。
由于文本发音词典中记载了文本与拼音的对应关系,因此,当前帧音频数据对应的至少一个第一候选文本单元,就可以通过文本发音词典映射为与每一个第一候选文本单元对应的候选拼音,然后,根据第一候选文本单元的声学得分即可得到与之对应的候选拼音的声学得分。
S103、根据预先建立的拼音与文本的图,并利用语言模型,确定所述至少一个候选拼音的语言得分。
本公开实施例通过S102获取到至少一个候选拼音的声学得分,接下来通过S103获取到候选拼音的语言得分,就可以基于候选拼音的声学得分和语言得分确定最终的语音识别结果。
本公开实施例中的语言模型可以是基于深度学习技术预先训练得到,关于语言模型的网络结构,本公开实施例不做任何限定。拼音与文本的图可以基于图技术预先建立即可,此处不再赘述。拼音与文本的图中记载了拼音与文本之间的关系,因此,基于拼音与文本的图,就可以获取候选拼音对应的文本,将该文本输入语言模型,就可以输出对该文本预测的语言得分,也即对应的候选拼音的语言得分。
S104、根据所述至少一个候选拼音的声学得分和语言得分,确定当前帧音频数据的语音识别结果。
通常,一段待识别的语音信息会被划分为多个帧的音频数据,对每一帧音频数据均按照本公开实施例的语音识别方法进行识别,得到每一帧的语音识别结果,即可得到该段待识别的完整语音信息的语音识别结果。
相关技术中的语音识别服务中采用声学模型,通常输出的是对应中文汉字的声学得分,虽然在通用领域的识别精度较高,但是在训练数据没有见过的垂类场景下,就容易造成近音或同音识别的问题,例如将医疗场景里的“肠鸣音”错识别成“长鸣音”,从而导致识别准确性的下降。而本公开实施例的技术方案中,在声学模型输出第一候选文本单元的声学得分后,基于文本发音词典中文本与拼音的对应关系,将声学模型输出的文本的得分作用到拼音,得到至少一个候选拼音的声学得分,而不是像相关技术那样作用到汉字,然后再基于拼音与文本的图,利用语言模型得到至少一个候选拼音的语言得分,最终结合至少一个候选拼音的声学得分和语言得分确定语音识别结果。
在这个过程中,本公开实施例保留声学模型不变,通过输出为文本声学得分的声学模型来保证良好的语音区分度,然后通过文本发音词典和文本的声学得分来确定经映射得到的拼音的声学得分,即:拼音由声学模型决定,然后,是哪个文本,则由语言模型决定,从而在整体上增加语言模型的作用,使之能够更好地应用在垂类场景,在满足基本的通用领域对语音识别精度要求的前提下,解决相关技术中存在的垂类领域特有的关键词识别错误问题,提高垂类场景下语音识别的准确度。
图2是本公开实施例提供的另一种语音识别方法的流程示意图,本实施 例在上述实施例的基础上进行说明。如图2所示,该方法包括如下:
S201、利用声学模型确定当前帧音频数据对应的至少一个第一候选文本单元的声学得分。
第一候选文本单元的声学得分,是根据当前帧音频数据的前一帧音频数据的语音识别中间结果确定的;所述前一帧音频数据的语音识别中间结果是:根据所述文本发音词典,对前一帧音频数据的语音识别结果对应的目标候选拼音进行反映射,得到的声学得分最高的目标文本。
为了提高声学模型预测的准确性,对当前帧音频数据的预测需要结合其前一帧音频数据的语音识别中间结果。由于声学模型输出的是文本的声学得分,且本公开实施例中将该文本的声学得分映射为对应拼音的声学得分,再利用图和语言模型得到语言得分。因此,需要先确定前一帧音频数据的语音识别结果对应的目标候选拼音,然后再根据文本发音词典,将目标候选拼音反映射得到的文本中,选择声学得分最高的文本作为目标文本,这样,该目标文本才能作为前一帧音频数据的语音识别中间结果。例如,当前帧为第三帧,第一帧和第二帧的语音识别结果选取排名靠前的前三个为:“我们”、“他们”和“小明”,那么第二帧识别结果对应的目标候选拼音就包括“mén”和“míng”,然后通过文本发音词典反映射,找到这两个拼音对应的文本,再从这些文本的声学得分中选择分值最高的目标文本,即为第二帧音频数据的语音识别中间结果。
S202、从文本发音词典中获取至少一个第一候选文本单元对应的至少一个候选拼音。
将每一个第一候选文本单元在文本发音词典中进行映射,获取与之对应的候选拼音。由于可能存在不同的文本对应拼音相同的情况,因此,至少一个候选拼音的数量应小于或等于第一候选文本单元的数量。
S203、将每个候选拼音对应的第一候选文本单元的声学得分中满足预设条件的声学得分,作为每个候选拼音的声学得分。
S204、计算每个候选拼音对应的第一候选文本单元的声学得分之和,将得到的和值作为每个候选拼音的声学得分。
在上述S203和S204中,分别提供了不同的获取候选拼音的声学得分的方法。在S203中,预设条件例如可以包括声学得分最高,也就是说,在每个候选拼音对应的第一候选文本单元的声学得分中,选择分值最高的声学得分作为每个候选拼音的声学得分。而在S204中,则是计算每个候选拼音对应的第一候选文本单元的声学得分之和,将得到的和值作为每个候选拼音的 声学得分。上述两种确定候选拼音的声学得分的方法均可,在实现时,根据实际场景的需要进行配置即可。
S205、根据预先建立的拼音与文本的图,并利用语言模型,确定所述至少一个候选拼音的语言得分。
语言模型是利用垂类场景的训练语料训练得到的,以提升垂类场景的识别精度。
S206、根据至少一个候选拼音的声学得分和语言得分,确定当前帧音频数据的语音识别结果。
本公开实施例的技术方案,将文本的声学得分映射为拼音的声学得分,然后在基于拼音与文本的图和语言模型确定语言得分,最终结合候选拼音的声学得分和语言得分得到最终的语音识别结果,这在整体上增加了语言模型的作用,以提高垂类场景中语音识别的精度。而且,一方面,利用垂类场景的训练样本来训练语言模型后,基于该语言模型构成的语音识别服务,就可以更好的适用于垂类场景,提高该垂类场景的识别精度;另一方面,相比于声学模型来说,由于语言模型的训练十分方便快捷,因此,即使在不同的垂类场景下需要重新对语言模型进行训练,也并不会增加很多训练成本。因此,本公开实施例方案的适用场景更广,且更易实施,对于将通用声学模型推广应用到多种垂类识别场景有很大帮助作用。
图3是本公开实施例提供的另一种语音识别方法的流程示意图,本实施例在上述实施例的基础上进行说明。如图3所示,该方法包括如下:
S301、利用声学模型确定当前帧音频数据对应的至少一个第一候选文本单元的声学得分。
S302、根据预先建立的文本发音词典和至少一个第一候选文本单元的声学得分,获取当前帧音频数据对应的至少一个候选拼音的声学得分,其中,文本发音词典用于记载文本与拼音的对应关系。
S303、根据拼音与文本的图,获取每个候选拼音对应的第二候选文本单元。
拼音与文本的图是根据文本发音词典建立的。由于图中记载了字、词与拼音的对应关系,因此,根据图获取的第二候选文本单元可能是当前候选拼音对应的字,也可能是当前候选拼音与上一帧音频数据对应的候选拼音所组成的词语。
S304、将第二候选文本单元输入语言模型,得到每个候选拼音的语言得分。
每一个候选拼音都有对应的第二候选文本单元,该第二候选文本单元的数量可以是一个,也可以是多个。语言模型对这些第二候选文本单元进行打分,根据这些第二候选文本单元的语言得分即可确定与之对应的候选拼音的语言得分。
S305、将至少一个候选拼音的声学得分和语言得分分别进行加权求和,并根据加权求和的结果,确定当前帧音频数据的语音识别结果。
对加权求和的结果进行排序,选择值最大的候选拼音在语言模型中对应的文本,当前帧音频数据的语音识别结果。
在一种实施方式中,声学得分的权重可以小于语言得分的权重。这样做的好处在于,可以抵消一部分因将文本的声学得分映射到拼音而带来的通用领域的识别效果可能会有一点点下降的风险。
本公开实施例的技术方案,不仅能保证在通用领域具有一定的识别精度,而且还通过提升语言模型的作用,提升垂类场景下的识别精度,解决垂类领域特有的关键词识别错误问题。
图4是本公开实施例提供的一种语音识别装置的结构示意图,本实施例可适用于对输入语音进行识别,将输入语音转化为对应文本的情况,涉及互联网技术领域,涉及人工智能、自然语言处理、语音技术和深度学习技术。该装置可实现本公开任意实施例所述的语音识别方法。如图4所示,该装置400包括:
文本声学得分确定模块410,设置为利用声学模型确定当前帧音频数据对应的至少一个第一候选文本单元的声学得分;拼音声学得分映射模块420,设置为根据预先建立的文本发音词典和所述至少一个第一候选文本单元的声学得分,获取所述当前帧音频数据对应的至少一个候选拼音的声学得分,其中,所述文本发音词典用于记载文本与拼音的对应关系;语言得分确定模块430,设置为根据预先建立的拼音与文本的图,并利用语言模型,确定所述至少一个候选拼音的语言得分;识别结果确定模块440,设置为根据所述至少一个候选拼音的声学得分和语言得分,确定所述当前帧音频数据的语音识别结果。
如图5所示,一实施例中,拼音声学得分映射模块420包括:
候选拼音获取单元421,设置为从所述文本发音词典中获取所述至少一个第一候选文本单元对应的至少一个候选拼音;第一候选拼音声学得分确定单元422,设置为将每个所述候选拼音对应的第一候选文本单元的声学得分中满足预设条件的声学得分,作为每个所述候选拼音的声学得分。
如图6所示,一实施例中,所述拼音声学得分映射模块420还包括:
第二候选拼音声学得分确定单元423,设置为计算每个所述候选拼音对应的第一候选文本单元的声学得分之和,将得到的和值作为每个所述候选拼音的声学得分。
如图7所示,一实施例中,所述语言得分确定模块430包括:
第二候选文本单元获取单元431,设置为根据所述拼音与文本的图,获取每个所述候选拼音对应的第二候选文本单元;候选拼音语言得分确定单元432,设置为将所述第二候选文本单元输入所述语言模型,得到每个所述候选拼音的语言得分。
一实施例中,所述识别结果确定模块440设置为:
将所述至少一个候选拼音的声学得分和语言得分分别进行加权求和,并根据所述加权求和的结果,确定所述当前帧音频数据的语音识别结果。
一实施例中,所述声学得分的权重小于所述语言得分的权重。
一实施例中,所述第一候选文本单元的声学得分,是根据所述当前帧音频数据的前一帧音频数据的语音识别中间结果确定的;其中,所述前一帧音频数据的语音识别中间结果是:根据所述文本发音词典,对所述前一帧音频数据的语音识别结果对应的目标候选拼音进行反映射,得到的声学得分最高的目标文本。
一实施例中,所述语言模型是利用垂类场景的训练语料训练得到的。
一实施例中,所述拼音与文本的图是根据所述文本发音词典建立的。
上述产品可执行本公开任意实施例所提供的方法,具备执行方法相应的功能模块和效果。
本公开的技术方案中,所涉及的用户个人信息的收集、存储、使用、加工、传输、提供和公开等处理,均符合相关法律法规的规定,且不违背公序良俗。
根据本公开的实施例,本公开还提供了一种电子设备、一种可读存储介质和一种计算机程序产品。
图8示出了可以用来实施本公开的实施例的示例电子设备500的示意性框图。电子设备500旨在表示多种形式的数字计算机,诸如,膝上型计算机、台式计算机、工作台、个人数字助理、服务器、刀片式服务器、大型计算机、和其它适合的计算机。电子设备还可以表示多种形式的移动装置,诸如,个人数字处理、蜂窝电话、智能电话、可穿戴设备和其它类似的计算装置。本 文所示的部件、它们的连接和关系、以及它们的功能仅仅作为示例,并且不意在限制本文中描述的和/或者要求的本公开的实现。
如图8所示,电子设备500包括计算单元501,其可以根据存储在只读存储器(Read-Only Memory,ROM)502中的计算机程序或者从存储单元508加载到随机访问存储器(Random Access Memory,RAM)503中的计算机程序,来执行多种适当的动作和处理。在RAM 503中,还可存储电子设备500操作所需的多种程序和数据。计算单元501、ROM 502以及RAM 503通过总线504彼此相连。输入/输出(Input/Output,I/O)接口505也连接至总线504。
电子设备500中的多个部件连接至I/O接口505,包括:输入单元506,例如键盘、鼠标等;输出单元507,例如多种类型的显示器、扬声器等;存储单元508,例如磁盘、光盘等;以及通信单元509,例如网卡、调制解调器、无线通信收发机等。通信单元509允许电子设备500通过诸如因特网的计算机网络和/或多种电信网络与其他设备交换信息/数据。
计算单元501可以是多种具有处理和计算能力的通用和/或专用处理组件。计算单元501的一些示例包括但不限于中央处理单元(Central Processing Unit,CPU)、图形处理单元(Graphics Processing Unit,GPU)、多种专用的人工智能(Artificial Intelligence,AI)计算芯片、多种运行机器学习模型算法的计算单元、数字信号处理器(Digital Signal Processing,DSP)、以及任何适当的处理器、控制器、微控制器等。计算单元501执行上文所描述的多个方法和处理,例如语音识别方法。例如,在一些实施例中,语音识别方法可被实现为计算机软件程序,其被有形地包含于机器可读介质,例如存储单元508。在一些实施例中,计算机程序的部分或者全部可以经由ROM 502和/或通信单元509而被载入和/或安装到电子设备500上。当计算机程序加载到RAM 503并由计算单元501执行时,可以执行上文描述的语音识别方法的一个或多个步骤。备选地,在其他实施例中,计算单元501可以通过其他任何适当的方式(例如,借助于固件)而被配置为执行语音识别方法。
本文中以上描述的系统和技术的多种实施方式可以在数字电子电路系统、集成电路系统、现场可编程门阵列(Field Programmable Gate Array,FPGA)、专用集成电路(Application Specific Integrated Circuit,ASIC)、专用标准产品(Application Specific Standard Parts,ASSP)、芯片上的系统(System on Chip,SOC)、复杂可编程逻辑设备(Complex Programmable Logic Device,CPLD)、计算机硬件、固件、软件、和/或它们的组合中实现。多种实施方式可以包括:实施在一个或者多个计算机程序中,该一个或者多个计算机程 序可在包括至少一个可编程处理器的可编程系统上执行和/或解释,该可编程处理器可以是专用或者通用可编程处理器,可以从存储系统、至少一个输入装置、和至少一个输出装置接收数据和指令,并且将数据和指令传输至该存储系统、该至少一个输入装置、和该至少一个输出装置。
用于实施本公开的方法的程序代码可以采用一个或多个编程语言的任何组合来编写。这些程序代码可以提供给通用计算机、专用计算机或其他可编程数据处理装置的处理器或控制器,使得程序代码当由处理器或控制器执行时使流程图和/或框图中所规定的功能/操作被实施。程序代码可以完全在机器上执行、部分地在机器上执行,作为独立软件包部分地在机器上执行且部分地在远程机器上执行或完全在远程机器或服务器上执行。
在本公开的上下文中,机器可读介质可以是有形的介质,其可以包含或存储以供指令执行系统、装置或设备使用或与指令执行系统、装置或设备结合地使用的程序。机器可读介质可以是机器可读信号介质或机器可读储存介质。机器可读介质可以包括但不限于电子的、磁性的、光学的、电磁的、红外的、或半导体系统、装置或设备,或者上述内容的任何合适组合。机器可读存储介质的示例会包括基于一个或多个线的电气连接、便携式计算机盘、硬盘、RAM、ROM、可擦除可编程只读存储器(Erasable Programmable Read-Only Memory,EPROM或快闪存储器)、光纤、便捷式紧凑盘只读存储器(Compact Disc Read-Only Memory,CD-ROM)、光学储存设备、磁储存设备、或上述内容的任何合适组合。
为了提供与用户的交互,可以在计算机上实施此处描述的系统和技术,该计算机具有:设置为向用户显示信息的显示装置(例如,阴极射线管(Cathode Ray Tube,CRT)或者液晶显示器(Liquid Crystal Display,LCD)监视器);以及键盘和指向装置(例如,鼠标或者轨迹球),用户可以通过该键盘和该指向装置来将输入提供给计算机。其它种类的装置还可以设置为提供与用户的交互;例如,提供给用户的反馈可以是任何形式的传感反馈(例如,视觉反馈、听觉反馈、或者触觉反馈);并且可以用任何形式(包括声输入、语音输入或者、触觉输入)来接收来自用户的输入。
可以将此处描述的系统和技术实施在包括后台部件的计算系统(例如,作为数据服务器)、或者包括中间件部件的计算系统(例如,应用服务器)、或者包括前端部件的计算系统(例如,具有图形用户界面或者网络浏览器的用户计算机,用户可以通过该图形用户界面或者该网络浏览器来与此处描述的系统和技术的实施方式交互)、或者包括这种后台部件、中间件部件、或者前端部件的任何组合的计算系统中。可以通过任何形式或者介质的数字数 据通信(例如,通信网络)来将系统的部件相互连接。通信网络的示例包括:局域网(Local Area Network,LAN)、广域网(Wide Area Network,WAN)、区块链网络和互联网。
计算机系统可以包括客户端和服务器。客户端和服务器一般远离彼此并且通常通过通信网络进行交互。通过在相应的计算机上运行并且彼此具有客户端-服务器关系的计算机程序来产生客户端和服务器的关系。服务器可以是云服务器,又称为云计算服务器或云主机,是云计算服务体系中的一项主机产品,以解决了传统物理主机与虚拟专用服务器(Virtual Private Server,VPS)服务中,存在的管理难度大,业务扩展性弱的缺陷。服务器也可以为分布式系统的服务器,或者是结合了区块链的服务器。
人工智能是研究使计算机来模拟人的一些思维过程和智能行为(如学习、推理、思考、规划等)的学科,既有硬件层面的技术也有软件层面的技术。人工智能硬件技术一般包括如传感器、专用人工智能芯片、云计算、分布式存储、大数据处理等技术;人工智能软件技术主要包括计算机视觉技术、语音识别技术、自然语言处理技术及机器学习/深度学习技术、大数据处理技术、知识图谱技术等几大方向。
云计算(cloud computing),指的是通过网络接入弹性可扩展的共享物理或虚拟资源池,资源可以包括服务器、操作系统、网络、软件、应用和存储设备等,并可以按需、自服务的方式对资源进行部署和管理的技术体系。通过云计算技术,可以为人工智能、区块链等技术应用、模型训练提供高效强大的数据处理能力。
可以使用上面所示的多种形式的流程,重新排序、增加或删除步骤。例如,本公开中记载的多个步骤可以并行地执行也可以顺序地执行也可以不同的次序执行,只要能够实现本公开提供的技术方案所期望的结果,本文在此不进行限制。

Claims (13)

  1. 一种语音识别方法,包括:
    利用声学模型确定当前帧音频数据对应的至少一个第一候选文本单元的声学得分;
    根据预先建立的文本发音词典和所述至少一个第一候选文本单元的声学得分,获取所述当前帧音频数据对应的至少一个候选拼音的声学得分,其中,所述文本发音词典用于记载文本与拼音的对应关系;
    根据预先建立的拼音与文本的图,并利用语言模型,确定所述至少一个候选拼音的语言得分;
    根据所述至少一个候选拼音的声学得分和语言得分,确定所述当前帧音频数据的语音识别结果。
  2. 根据权利要求1所述的方法,其中,所述根据预先建立的文本发音词典和所述至少一个第一候选文本单元的声学得分,获取所述当前帧音频数据对应的至少一个候选拼音的声学得分,包括:
    从所述文本发音词典中获取所述至少一个第一候选文本单元对应的至少一个候选拼音;
    将每个候选拼音对应的第一候选文本单元的声学得分中满足预设条件的声学得分,作为所述每个候选拼音的声学得分。
  3. 根据权利要求1所述的方法,其中,所述根据预先建立的文本发音词典和所述至少一个第一候选文本单元的声学得分,获取所述当前帧音频数据对应的至少一个候选拼音的声学得分,包括:
    从所述文本发音词典中获取所述至少一个第一候选文本单元对应的至少一个候选拼音;
    计算每个候选拼音对应的第一候选文本单元的声学得分之和,将得到的和值作为所述每个候选拼音的声学得分。
  4. 根据权利要求1所述的方法,其中,所述根据预先建立的拼音与文本的图,并利用语言模型,确定所述至少一个候选拼音的语言得分,包括:
    根据所述拼音与文本的图,获取每个候选拼音对应的第二候选文本单元;
    将所述第二候选文本单元输入所述语言模型,得到所述每个候选拼音的语言得分。
  5. 根据权利要求1所述的方法,其中,所述根据所述至少一个候选拼音的声学得分和语言得分,确定所述当前帧音频数据的语音识别结果,包括:
    将所述至少一个候选拼音的声学得分和语言得分分别进行加权求和,并根据所述加权求和的结果,确定所述当前帧音频数据的语音识别结果。
  6. 根据权利要求5所述的方法,其中,所述声学得分的权重小于所述语言得分的权重。
  7. 根据权利要求1所述的方法,其中,所述第一候选文本单元的声学得分,是根据所述当前帧音频数据的前一帧音频数据的语音识别中间结果确定的;
    其中,所述前一帧音频数据的语音识别中间结果是:根据所述文本发音词典,对所述前一帧音频数据的语音识别结果对应的目标候选拼音进行反映射,得到的声学得分最高的目标文本。
  8. 根据权利要求1所述的方法,其中,所述语言模型是利用垂类场景的训练语料训练得到的。
  9. 根据权利要求1所述的方法,其中,所述拼音与文本的图是根据所述文本发音词典建立的。
  10. 一种语音识别装置,包括:
    文本声学得分确定模块,设置为利用声学模型确定当前帧音频数据对应的至少一个第一候选文本单元的声学得分;
    拼音声学得分映射模块,设置为根据预先建立的文本发音词典和所述至少一个第一候选文本单元的声学得分,获取所述当前帧音频数据对应的至少一个候选拼音的声学得分,其中,所述文本发音词典用于记载文本与拼音的对应关系;
    语言得分确定模块,设置为根据预先建立的拼音与文本的图,并利用语言模型,确定所述至少一个候选拼音的语言得分;
    识别结果确定模块,设置为根据所述至少一个候选拼音的声学得分和语言得分,确定所述当前帧音频数据的语音识别结果。
  11. 一种电子设备,包括:
    至少一个处理器;以及
    与所述至少一个处理器通信连接的存储器;其中,
    所述存储器存储有可被所述至少一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够执行权利要求1-9中任一项所述的语音识别方法。
  12. 一种存储有计算机指令的非瞬时计算机可读存储介质,其中,所述计算 机指令用于使计算机执行权利要求1-9中任一项所述的语音识别方法。
  13. 一种计算机程序产品,包括计算机程序,所述计算机程序在被处理器执行时实现权利要求1-9中任一项所述的语音识别方法。
PCT/CN2022/132456 2022-04-06 2022-11-17 语音识别方法、装置、设备和介质 WO2023193442A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210357646.X 2022-04-06
CN202210357646.XA CN114758649B (zh) 2022-04-06 2022-04-06 一种语音识别方法、装置、设备和介质

Publications (1)

Publication Number Publication Date
WO2023193442A1 true WO2023193442A1 (zh) 2023-10-12

Family

ID=82328912

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/132456 WO2023193442A1 (zh) 2022-04-06 2022-11-17 语音识别方法、装置、设备和介质

Country Status (2)

Country Link
CN (1) CN114758649B (zh)
WO (1) WO2023193442A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114758649B (zh) * 2022-04-06 2024-04-19 北京百度网讯科技有限公司 一种语音识别方法、装置、设备和介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1895509A1 (de) * 2006-09-04 2008-03-05 Siemens VDO Automotive AG Verfahren zur Spracherkennung
CN106843523A (zh) * 2016-12-12 2017-06-13 百度在线网络技术(北京)有限公司 基于人工智能的文字输入方法和装置
CN107016994A (zh) * 2016-01-27 2017-08-04 阿里巴巴集团控股有限公司 语音识别的方法及装置
CN113782030A (zh) * 2021-09-10 2021-12-10 平安科技(深圳)有限公司 基于多模态语音识别结果纠错方法及相关设备
CN114758649A (zh) * 2022-04-06 2022-07-15 北京百度网讯科技有限公司 一种语音识别方法、装置、设备和介质

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104238991B (zh) * 2013-06-21 2018-05-25 腾讯科技(深圳)有限公司 语音输入匹配方法及装置
CN103578464B (zh) * 2013-10-18 2017-01-11 威盛电子股份有限公司 语言模型的建立方法、语音辨识方法及电子装置
CN107657947B (zh) * 2017-09-20 2020-11-24 百度在线网络技术(北京)有限公司 基于人工智能的语音处理方法及其装置
CN108932941B (zh) * 2017-10-13 2020-07-03 北京猎户星空科技有限公司 语音识别方法、装置及计算机设备、存储介质及程序产品
CN108417202B (zh) * 2018-01-19 2020-09-01 苏州思必驰信息科技有限公司 语音识别方法及系统
CN111435592B (zh) * 2018-12-25 2023-12-01 Tcl科技集团股份有限公司 一种语音识别方法、装置及终端设备
CN111063336A (zh) * 2019-12-30 2020-04-24 天津中科智能识别产业技术研究院有限公司 一种基于深度学习的端对端语音识别系统
CN111554297B (zh) * 2020-05-15 2023-08-22 阿波罗智联(北京)科技有限公司 语音识别方法、装置、设备及可读存储介质
CN111627445B (zh) * 2020-05-26 2023-07-07 福建省海峡智汇科技有限公司 一种用于场地或人员的匹配方法和系统
CN112466288B (zh) * 2020-12-18 2022-05-31 北京百度网讯科技有限公司 语音识别方法、装置、电子设备及存储介质

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1895509A1 (de) * 2006-09-04 2008-03-05 Siemens VDO Automotive AG Verfahren zur Spracherkennung
CN107016994A (zh) * 2016-01-27 2017-08-04 阿里巴巴集团控股有限公司 语音识别的方法及装置
CN106843523A (zh) * 2016-12-12 2017-06-13 百度在线网络技术(北京)有限公司 基于人工智能的文字输入方法和装置
CN113782030A (zh) * 2021-09-10 2021-12-10 平安科技(深圳)有限公司 基于多模态语音识别结果纠错方法及相关设备
CN114758649A (zh) * 2022-04-06 2022-07-15 北京百度网讯科技有限公司 一种语音识别方法、装置、设备和介质

Also Published As

Publication number Publication date
CN114758649B (zh) 2024-04-19
CN114758649A (zh) 2022-07-15

Similar Documents

Publication Publication Date Title
EP4060565A1 (en) Method and apparatus for acquiring pre-trained model
CN108170749B (zh) 基于人工智能的对话方法、装置及计算机可读介质
US20220004811A1 (en) Method and apparatus of training model, device, medium, and program product
CN113205817B (zh) 语音语义识别方法、系统、设备及介质
WO2020001458A1 (zh) 语音识别方法、装置及系统
US20230058437A1 (en) Method for human-computer interaction, apparatus for human-computer interaction, device, and storage medium
KR20220064940A (ko) 음성 생성 방법, 장치, 전자기기 및 저장매체
US20230127787A1 (en) Method and apparatus for converting voice timbre, method and apparatus for training model, device and medium
US20230004798A1 (en) Intent recognition model training and intent recognition method and apparatus
US20230178067A1 (en) Method of training speech synthesis model and method of synthesizing speech
CN113674746B (zh) 人机交互方法、装置、设备以及存储介质
WO2023142454A1 (zh) 语音翻译和模型训练方法、装置、电子设备以及存储介质
WO2023193442A1 (zh) 语音识别方法、装置、设备和介质
JP2023002690A (ja) セマンティックス認識方法、装置、電子機器及び記憶媒体
WO2023045186A1 (zh) 意图识别方法、装置、电子设备和存储介质
WO2021051564A1 (zh) 语音识别方法、装置、计算设备和存储介质
US20230410794A1 (en) Audio recognition method, method of training audio recognition model, and electronic device
CN117043856A (zh) 高效流式非递归设备上的端到端模型
US20230070966A1 (en) Method for processing question, electronic device and storage medium
EP4254256A1 (en) Spoken language processing method and apparatus, electronic device, and storage medium
US20230075339A1 (en) Method of training information generation model, method of generating information, and device
JP7349523B2 (ja) 音声認識方法、音声認識装置、電子機器、記憶媒体コンピュータプログラム製品及びコンピュータプログラム
JP2023078411A (ja) 情報処理方法、モデルトレーニング方法、装置、機器、媒体及びプログラム製品
US20220122586A1 (en) Fast Emit Low-latency Streaming ASR with Sequence-level Emission Regularization
KR20240065125A (ko) 희귀 단어 스피치 인식을 위한 대규모 언어 모델 데이터 선택

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22936370

Country of ref document: EP

Kind code of ref document: A1