WO2022166218A1 - 一种语音识别中添加标点符号的方法及语音识别装置 - Google Patents

一种语音识别中添加标点符号的方法及语音识别装置 Download PDF

Info

Publication number
WO2022166218A1
WO2022166218A1 PCT/CN2021/120413 CN2021120413W WO2022166218A1 WO 2022166218 A1 WO2022166218 A1 WO 2022166218A1 CN 2021120413 W CN2021120413 W CN 2021120413W WO 2022166218 A1 WO2022166218 A1 WO 2022166218A1
Authority
WO
WIPO (PCT)
Prior art keywords
speech
voice
punctuation
symbol
words
Prior art date
Application number
PCT/CN2021/120413
Other languages
English (en)
French (fr)
Inventor
陈文明
尚天赐
邓高锋
张世明
吕周谨
Original Assignee
虫洞创新平台(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 虫洞创新平台(深圳)有限公司 filed Critical 虫洞创新平台(深圳)有限公司
Publication of WO2022166218A1 publication Critical patent/WO2022166218A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/183Speech classification or search using natural language modelling using context dependencies, e.g. language models
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems

Definitions

  • the present invention relates to the technical field of audio, and in particular, to the technical field of speech recognition.
  • the direct output result of traditional speech recognition technology is usually a text composed of a long string of characters or word information. As the length of the speech increases, the difficulty of reading the text also increases. Therefore, we need to automatically add punctuation to the output of the speech recognition system to improve intelligibility and efficiency.
  • the present application provides a method for adding punctuation marks in speech recognition and a speech recognition device for automatically adding punctuation marks to text information output by a speech recognition device.
  • a method for adding punctuation marks in speech recognition includes: a discriminator discriminates and extracts a speech feature of a speech signal, and obtains a speech data stream; a speech decoder decodes the speech data stream, and according to a The searchable state space and the voice feature determine the first symbol; the space state includes a pronunciation dictionary, an acoustic model and a language model; the pronunciation dictionary contains a set of words and their corresponding pronunciations; the deep neural network classifier is based on The context further discriminates the first symbol, and outputs text information marked with the second symbol; wherein, the deep neural network classifier is a pre-trained fast deep neural network classifier.
  • a speech recognition device comprising: a discriminator for discriminating and extracting speech features of a speech signal, and obtaining a speech data stream; a speech decoder for decoding the speech data stream, and The first symbol is determined according to a searchable state space and the speech feature; the space state includes a pronunciation dictionary, an acoustic model and a language model; the pronunciation dictionary contains a set of words and their corresponding pronunciations; deep neural network classification The device is used to further discriminate the first symbol according to the context, and output the text information marked with the second symbol; wherein, the deep neural network classifier is a pre-trained fast deep neural network classifier.
  • a speech recognition apparatus which includes: a processor and a memory; the processor invokes a program in the memory to execute any one of the above-mentioned methods for adding punctuation marks in speech recognition.
  • a computer-readable storage medium on which a program of a method for adding punctuation marks in speech recognition is stored, and the program of the method for adding punctuation marks in speech recognition is executed by a processor The method for adding punctuation marks in any one of the above speech recognition is implemented at the time of execution.
  • the beneficial effect of the present application is that, starting from the three parts of the voice signal, the language model and the DNN classifier at the same time, the problem of automatically adding punctuation marks to the speech recognition results is solved. After that, it is further optimized by the DNN classifier, and the text information containing the optimized punctuation marks is output. The accuracy of adding punctuation marks is improved, thereby improving the readability and legibility of the text output by speech recognition, and improving the user experience.
  • FIG. 1 is a schematic diagram of a system architecture to which an embodiment of the present application is applied.
  • FIG. 2 is a flowchart of a method for adding punctuation marks in speech recognition according to Embodiment 1 of the present application.
  • FIG. 3 is a flowchart of training a pronunciation dictionary in Embodiment 1 of the present application.
  • FIG. 4 is a flowchart of training a language model in Embodiment 1 of the present application.
  • FIG. 5 is a flowchart of training a DNN classifier in Embodiment 1 of the present application.
  • FIG. 6 is a schematic block diagram of a speech recognition apparatus according to Embodiment 2 of the present application.
  • FIG. 7 is a schematic structural diagram of a speech recognition apparatus according to Embodiment 3 of the present application.
  • FIG. 1 is a schematic diagram of a system architecture of a speech recognition system architecture 100 to which an embodiment of the present application is applied.
  • the speech recognition system architecture 100 includes: an acoustic model 110, a pronunciation dictionary 120, a language model 130, a discriminator 140 that analyzes and captures the characteristics of the signal itself, a speech decoder 150, and a DNN (Deep Neural Networks, deep neural network) classifier 16.
  • the speech recognition system architecture 100 contains a complete speech recognition process.
  • the acoustic model 110, the pronunciation dictionary 120 and the language model 130 together constitute the main body of the speech recognition system.
  • the pronunciation dictionary 120 contains a set of words that can be processed by the speech recognition system architecture 100, and indicates their pronunciations.
  • the mapping relationship between the modeling unit of the acoustic model 110 and the modeling unit of the language model 130 is obtained through the pronunciation dictionary 120 , so that the acoustic model 110 and the language model 130 are linked and shared with the pronunciation dictionary 120
  • a searchable state space is formed for the speech decoder 150 to perform decoding work.
  • the input speech signal passes through the discriminator 140, and the discriminator 140 discriminates and extracts the speech features of the speech signal, and obtains a speech data stream.
  • the voice decoder 150 decodes the voice data stream, and determines the first symbol of the voice information according to the state space and the voice feature.
  • the DNN classifier 160 is a pre-trained fast DNN classifier, and the speech decoder 150 further discriminates the preliminarily marked punctuation symbols, optimizes the first symbols, and outputs text information including the second symbols. Specifically, the DNN classifier 160 further discriminates the first symbol of the speech decoder 150 by combining the context recognition text feature vector and the speech information feature vector. Therefore, the function of automatically adding punctuation marks to the speech recognition result is realized, and the accuracy of punctuation mark recognition is improved.
  • the embodiments of the present application can be applied to various apparatuses with a speech recognition function.
  • voice recorder audio conference terminal, intelligent conference recording device, or intelligent electronic equipment with voice recognition function, etc.
  • audio conference terminal audio conference terminal
  • intelligent conference recording device intelligent electronic equipment with voice recognition function
  • a method for adding punctuation marks in speech recognition provided by Embodiment 1 of the present application.
  • the method includes:
  • the discriminator discriminates and extracts the speech features of the speech signal, and obtains a voice data stream; wherein, the discriminator analyzes the characteristics of the signal itself; optionally, the speech features extracted by the discriminator include unvoiced speech fragments , and the timestamp of the unvoiced segment;
  • the speech decoder decodes the speech data stream, and determines the first symbol according to a searchable state space and the speech feature;
  • the space state includes a pronunciation dictionary, an acoustic model and a language model;
  • the pronunciation dictionary Contains a set of words and their corresponding pronunciations;
  • the deep neural network classifier further discriminates the first symbol according to the context, and outputs text information marked with the second symbol; wherein, the deep neural network classifier is a pre-trained fast deep neural network classifier.
  • the deep neural network classifier is to identify the text feature vector and the voice feature vector according to the context to the first described first. Symbols are classified.
  • the DNN classifier is a text classifier obtained by separate training, the input is the speech recognition text with the first symbol, and the output is the text with the second symbol.
  • the discriminator discriminates and extracts the voice feature of the voice signal, and obtains a voice data stream, including:
  • S212 establish the timestamp of this unvoiced speech segment; Specifically, add the information of the time stamp into the feature vector according to the unvoiced speech segment, and convert it into WFST (Weighted Finite-State Transducers, Weighted Finite-State Transducers, Weighted Finite-State Transducers) Finite state machine) to do the calculation.
  • WFST Weighted Finite-State Transducers, Weighted Finite-State Transducers, Weighted Finite-State Transducers
  • the process of external logical judgment can be omitted, which is beneficial to simplify the process of adding punctuation marks and the calculation process.
  • the information judged by the speech recognition technology in the continuous time stamp can assist the aforementioned acoustic model, pronunciation dictionary and language model to judge, which is more conducive to improving speech recognition.
  • the accuracy of adding punctuation marks during the process can be omitted, which is beneficial to simplify the process of adding punctuation marks and the calculation process.
  • determining the first symbol according to a searchable state space and the voice feature in S220 includes:
  • the punctuation mark confirms that the punctuation mark corresponding to the punctuation mark of the unvoiced speech segment in the preliminary symbol is the first symbol. That is, when it is recognized that the voice information corresponding to the time stamp is a punctuation mark according to the state space, and it can also be determined according to the duration that the voice information corresponding to the time stamp is a punctuation mark, the punctuation mark is reserved.
  • the pronunciation dictionary includes silence words, and the silence words include: the first silence words correspond to mid-sentence punctuation marks, the second silence words correspond to sentence-ending punctuation marks, and the third silence words correspond to meaningless silence words.
  • the pronunciation dictionary contains a collection of words that can be processed by the speech recognition device, and indicates the pronunciation of the word.
  • the pronunciation dictionary in speech recognition technology does not contain punctuation marks because punctuation marks are not pronounced.
  • the utterance dictionary is improved, and the words corresponding to the unvoiced speech fragments are set as silent words. And the silent words are divided into the above three categories. Three types of silent words correspond to different symbols in the pronunciation dictionary.
  • the first symbol is a symbol used to indicate the position and type of punctuation in the voice information, wherein different symbols in different first symbols correspond to different types of punctuation, for example: the first symbol can be used in The three types of silent words in the pronunciation dictionary are marked in the voice information. For example, " ⁇ " represents the punctuation mark at the end of the sentence, " ⁇ " represents the punctuation mark in the sentence, and there is no special representation for meaningless silent words.
  • the second symbol is a specific punctuation symbol, for example, in the final output text, replace " ⁇ " with a period, and replace " ⁇ " with a comma.
  • the mapping relationship between the modeling unit of the acoustic model and the modeling unit of the language model is obtained through the pronunciation dictionary, so as to connect the acoustic model and the language model, and form a search state with the acoustic model and the language model.
  • the space is used by the decoder to do the decoding work.
  • the method for adding punctuation marks in speech recognition further includes: S240, train a pronunciation dictionary, which specifically includes:
  • S244 define the pronunciation of silent words, and add these words to the CMUdict pronunciation dictionary to form the pronunciation dictionary described herein; wherein, silent words correspond to unvoiced speech fragments, and are divided into three categories: mid-sentence punctuation, sentence-end punctuation symbols and meaningless silence segments.
  • the language model is a pre-trained model.
  • the method for adding punctuation marks in the speech recognition also includes: S250, training a language model, which specifically includes:
  • the normalization processing of the text corpus may include at least one of the following: deleting not within the recognition target range punctuation, such as dashes, book titles, and other uncommon punctuation; normalize non-standard words, such as converting Roman numerals to decimal representation; convert non-ASCII characters to the nearest ASCII equivalent; split the original text, Correct possible wrong normalization;
  • S252 construct a training vocabulary based on the M words with the highest occurrence frequency and the punctuation marks within the N target recognition ranges; M and N are both positive integers greater than or equal to 1;
  • steps S251 to S253 are not processing speech information, but a process of using existing information and corpus to train the language model during the construction and training process of the language model.
  • the first fact of the present application is that in the existing N-Gram (N-gram) language model, definitions of punctuation marks corresponding to pronunciation dictionaries are introduced. Prediction of punctuation is defined more as a simple statistical prediction of punctuation using contextual text information. For example, there is usually a punctuation mark before the subject "you" "me” "he”.
  • the method for adding punctuation marks in speech recognition further includes: S260, training the deep neural network classification, specifically including:
  • S261 classify the target punctuation marks in the normalized text corpus; optionally, the classification includes determining which punctuation marks are mid-sentence punctuation and which punctuation marks are sentence-ending punctuation; optionally, the text corpus is normalized Processing can include at least one of the following: removing punctuation marks that are not within the scope of target recognition, such as dashes, book titles and other uncommon punctuation marks; normalizing non-standard words, such as converting Roman numerals to decimal representation; converting non-ASCII characters Convert to the closest ASCII equivalent.
  • Target punctuation refers to the punctuation within the target recognition range.
  • the target recognition range is a common punctuation mark: commas, commas, periods, etc., but dashes and book titles are not common and are not within the target recognition range.
  • LSTM neural network is a special RNN model.
  • the training method used by the DNN classifier is to use the general DNN classifier to perform pre-punctuation classification processing on the training text, and then send the processed training text into the LSTM network structure for context feature extraction. Judge for accurate punctuation. Finally, a DNN classifier that performs fine punctuation classification of the text with the first symbol proposed before is obtained.
  • the first type is to use the text content generated by speech recognition, convert all words in the text into word vectors through word2vector, and send the word vectors to the deep neural network.
  • DNN calculates the probability of punctuation after the word, and then takes the addition method with the highest probability as the final addition scheme.
  • word2vector is an algorithm that converts the words in the corpus into vectors so that various calculations can be performed on the basis of the word vectors.
  • This type of method is separated from the speech signal itself, and determines the addition of punctuation marks based solely on the text content, without taking into account information such as silence at the speech level, which may result in some relatively long terms and proper nouns being separated by punctuation.
  • the second type is to use the voice information itself to determine whether the position should be punctuated by judging whether the duration of silence in the voice signal exceeds the threshold. If it is added, the voice information before and after the position is sent to a neural network training classification to determine which punctuation marks to add.
  • the third category is to combine the language model itself in speech recognition technology to model the language model for the gaps between words, and use the characteristics of the weighted finite state converter to automatically add punctuation marks. There are certain limitations in the way of adding punctuation marks in the three types of speech recognition process.
  • the acoustic model, the pronunciation dictionary and the language model together constitute the main body of the speech recognition system.
  • Silence words are introduced into the pronunciation dictionary; and the information of timestamp is vectorized into the feature vector, which is converted into WFST for calculation; using the characteristics of WFST, the silent phoneme is subdivided into three categories in the language model.
  • a searchable state space is formed by the pronunciation dictionary, the language model and the acoustic model. Therefore, the pronunciation dictionary, the language model and the acoustic model can be combined to realize the preliminary addition of punctuation marks.
  • the discriminator extracts speech features in the speech information, and the speech features include the duration of the unvoiced speech segment and its corresponding timestamp.
  • the voice decoder determines the places in the voice data stream that need to be marked with punctuation marks according to the state space, and at the same time, also determines which of the unvoiced voice fragments extracted by the decider correspond. is a punctuation mark, and the two are linked according to the time stamp. The place where the two overlap is the place where the first symbol is marked.
  • the speech decoder further discriminates the first symbol, and outputs text information including the second symbol after optimization. Therefore, the function of automatically adding punctuation marks to the speech recognition result is realized, and the accuracy of punctuation mark recognition is improved.
  • the language model and the DNN classifier under the premise of hardly affecting the accuracy of speech recognition, starting from the three parts of the speech feature of the speech signal, the language model and the DNN classifier at the same time, to solve the problem of automatically adding punctuation marks to the speech recognition result, in After the punctuation is initially given by the speech features and language model, it is further optimized by the DNN classifier, and the text information containing the optimized punctuation is output. The accuracy of adding punctuation marks is improved, thereby improving the readability and legibility of the text output by speech recognition, which can improve the user experience.
  • the voice recognition device 300 includes, but is not limited to, a voice recorder, an audio conference terminal, an intelligent conference recording device, or an intelligent electronic device with a voice recognition function, which is not limited in the second embodiment.
  • the voice recognition device 300 includes:
  • the discriminator 310 is used to discriminate and extract the speech feature of the speech signal, and obtain the speech data stream;
  • the speech decoder 320 is used for decoding the speech data stream and determining the first symbol according to a searchable state space and the speech feature;
  • the space state includes a pronunciation dictionary, an acoustic model and a language model;
  • the pronunciation dictionary contains words and the set of pronunciations corresponding to the word;
  • the deep neural network classifier 330 is configured to further discriminate the first symbol according to the context, and output text information marked with the second symbol; wherein, the deep neural network classifier is a pre-trained fast deep neural network classifier.
  • the speech feature includes the duration of the unvoiced speech segment and the timestamp of the unvoiced speech segment.
  • the discriminator 310 is specifically configured to determine the duration of the unvoiced voice segment by utilizing the human voice recognition technology after receiving the voice information; the information of the time stamp is vectorized and added according to the feature of the unvoiced voice fragment. In the vector, it is converted into a weighted finite state machine for calculation; and the voice data stream is obtained.
  • the voice decoder 320 is specifically configured to decode the voice data stream, and determine whether the unvoiced voice fragment is a punctuation mark or a meaningless silent segment according to the duration of the unvoiced voice fragment;
  • the state space identifies a preliminary symbol in the speech data stream; and according to the timestamp, confirms that the punctuation mark in the preliminary symbol corresponding to the punctuation mark of the unvoiced speech segment is the first symbol.
  • the pronunciation dictionary further includes the following three types of silent words: the first silent word corresponds to mid-sentence punctuation, the second silent word corresponds to sentence-ending punctuation, and the third silent word corresponds to meaningless silent words.
  • the language model is a pre-trained model.
  • the speech recognition apparatus 300 further includes: a language model training unit 340, configured to count the most frequently occurring M words and the N punctuation marks within the target recognition range based on the normalized text corpus; The M words of M and the punctuation marks within the N target recognition range construct a training vocabulary; M and N are both positive integers greater than or equal to 1; the language model is trained according to the training vocabulary.
  • the speech recognition device 300 includes and further includes: a deep neural network classifier training unit 350 for classifying target punctuation marks in the normalized text corpus; sending the classified text corpus into a long
  • the context feature extraction training is performed in the short-term memory neural network to obtain a discriminant model.
  • FIG. 7 is a schematic structural diagram of a speech recognition apparatus 400 according to Embodiment 3 of the present application.
  • the video processing apparatus 400 includes: a processor 410 , a memory 420 and a communication interface 430 .
  • the processor 410, the memory 420 and the communication interface 430 are connected to each other through a bus system.
  • the processor 410 invokes the program in the memory 420, executes any one of the speech analysis methods provided in the first embodiment above, and outputs the output result to other devices, such as printers, computers, smart devices, through the communication interface 430 in a wireless or wired manner.
  • Electronic devices such as electronic devices that can display textual information.
  • the processor 410 may be an independent component, or may be a collective term for multiple processing components. For example, it may be a CPU, an ASIC, or one or more integrated circuits configured to implement the above method, such as at least one microprocessor DSP, or at least one programmable gate FPGA, etc.
  • the memory 420 is a computer-readable storage medium on which programs executable on the processor 410 are stored.
  • the functions described in the specific embodiments of the present application may be implemented in whole or in part by software, hardware, firmware or any combination thereof.
  • software When implemented in software, it may be implemented by a processor executing software instructions.
  • the software instructions may consist of corresponding software modules.
  • the software modules may be stored in a computer-readable storage medium, which may be any available medium that can be accessed by a computer or a data storage device such as a server, a data center, or the like that includes an integration of one or more available media.
  • the available media may be magnetic media (eg, floppy disks, hard disks, magnetic tapes), optical media (eg, Digital Video Disc (DVD)), or semiconductor media (eg, Solid State Disk (SSD)) )Wait.
  • the computer-readable storage medium includes but is not limited to random access memory (Random Access Memory, RAM), flash memory, read only memory (Read Only Memory, ROM), Erasable Programmable Read Only Memory (Erasable Programmable ROM, EPROM) ), Electrically Erasable Programmable Read-Only Memory (Electrically EPROM, EEPROM), registers, hard disks, removable hard disks, compact disks (CD-ROMs), or any other form of storage medium known in the art.
  • An exemplary computer-readable storage medium is coupled to the processor, such that the processor can read information from, and write information to, the computer-readable storage medium.
  • the computer-readable storage medium can also be an integral part of the processor.
  • the processor and computer-readable storage medium may reside in an ASIC. Additionally, the ASIC may reside in access network equipment, target network equipment or core network equipment.
  • the processor and the computer-readable storage medium may also exist as discrete components in the access network device, the target network device or the core network device. When implemented in software, it can also be implemented in whole or in part in the form of a computer program product.
  • the computer program product includes one or more computer instructions.
  • the computer may be a general purpose computer, special purpose computer, computer network, or other programmable device.
  • the computer program instructions may be stored on or transmitted from one computer readable storage medium to another computer readable storage medium as described above, for example, the computer instructions may be downloaded from a website, computer, server or The data center transmits to another website site, computer, server or data center through wired (such as coaxial cable, optical fiber, Digital Subscriber Line, DSL) or wireless (such as infrared, wireless, microwave, etc.).
  • wired such as coaxial cable, optical fiber, Digital Subscriber Line, DSL
  • wireless such as infrared, wireless, microwave, etc.

Abstract

一种语音识别中添加标点符号的方法及语音识别装置。该方法包括:判别器判别并提取语音信号的语音特征,并获得语音数据流;语音解码器对所述语音数据流进行解码,并根据一个可搜索的状态空间以及所述语音特征确定第一符号;所述空间状态包括发音词典、声学模型和语言模型;所述发音词典包含单词及其该单词对应发音的集合;深度神经网络分类器根据上下文对所述第一符号做进一步判别,并输出标注第二符号的文本信息;其中,该深度神经网络分类器为预先训练的快速深度神经网络分类器。该方案可提升语音识别中标点符号添加的准确率。

Description

一种语音识别中添加标点符号的方法及语音识别装置 技术领域
本发明涉及音频技术领域,尤其涉及一种语音识别的技术领域。
背景技术
随着通信技术以及信息处理技术的长足发展与设备计算力的日渐充足,语音识别技术的应用也越来越广泛,如:同声翻译,语音转写,人机交互,语音控制等。
但,传统的语音识别技术中,仅针对实际文本内容与其对应的声音进行建模并分析识别语音内容,对于标点符号却难以同有声文本一样进行建模,因此往往对标签符号忽略不计。因此,传统的语音识别技术直接输出的结果通常为一长串字符或单词信息构成的文本。随着语音长度的增加,文本阅读难度也随之提升。因此,我们需要对语音识别系统的输出结果自动添加标点符号处理,提高易懂性和效率。
发明内容
本申请提供一种可在语音识别装置输出的文本信息中自动添加标点符号的一种语音识别中添加标点符号的方法及语音识别装置。
本申请提供以下技术方案:
一方面,提供一种语音识别中添加标点符号的方法,其包括:判别器判别并提取语音信号的语音特征,并获得语音数据流;语音解码器对所述语音数据流进行解码,并根据一个可搜索的状态空间以及所述语音特征确定第一符号;所述空间状态包括发音词典、声学模型和语言模型;所述发音词典包含单词及其该单词对应发音的集合;深度神经网络分类器根据上下文对所述第一符号做进一步判别,并输出标注第二符号的文本信息;其中,该深度神经网络分类器为预先训练的快速深度神经网络分类器。
又一方面,提供一种语音识别装置,其包括:判别器,用于判别并提取语音信号的语音特征,并获得语音数据流;语音解码器,用于对所述语音数据流进行解码,并根据一个可搜索的状态空间以及所述语音特征确定第一符号;所述空间状态包括发音词典、声学模型和语言模型;所述发音词典包含单词及其该单词对应发音的集合;深度神经网 络分类器,用于根据上下文对所述第一符号做进一步判别,并输出标注第二符号的文本信息;其中,该深度神经网络分类器为预先训练的快速深度神经网络分类器。
又一方面,提供一种语音识别装置,其包括:处理器以及存储器;所述处理器调用所述存储器中的程序,执行上述任意一个语音识别中添加标点符号的方法。
又一方面,提供一种计算机可读存储介质,所述计算机可读存储介质上存储有语音识别中添加标点符号的方法的程序,所述语音识别中添加标点符号的方法的程序被处理器执行时实现执行上述任意一个语音识别中添加标点符号的方法。
本申请的有益效果在于,从语音信号的语音特征,语言模型和DNN分类器三部分同时出发,来解决针对语音识别结果自动添加标点符号的问题,在通过语音特征以及语言模型初步给出标点符号后,再经过DNN分类器对其进行进一步优化,输出包含优化后的标点符号的文本信息。提升了标点符号添加的准确率,从而提升了语音识别输出的文本的可读性和易读性,提升用户体验。
附图说明
图1为本申请实施方式应用的系统架构示意图。
图2为本申请实施方式一提供的一种语音识别中添加标点符号的方法的流程图。
图3为本申请实施方式一中训练发音词典的流程图。
图4为本申请实施方式一中训练语言模型的流程图。
图5为本申请实施方式一中训练DNN分类器的流程图。
图6为本申请实施方式二提供的一种语音识别装置的方框示意图。
图7本申请实施方式三提供的一种语音识别装置的结构示意图。
具体实施方式
为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施方式,对本申请进行进一步详细说明。应当理解,此处所描述的实施方式仅用以解释本申请,并不用于限定本申请。但是,本申请可以以多种不同的形式来实现,并不限于本文所描述的实施方式。相反地,提供这些实施方式的目的是使对本实用新型的公开内容的理解更加透彻全面。
除非另有定义,本文所实用的所有的技术和科学术语与属于本申请的技术领域的技术人员通常理解的含义相同。本文中在本申请的说明书中所使用的术语只是为了描述具体的实施方式的目的,不是旨在限制本申请。
应理解,本文中术语“系统”或“网络”在本文中常被可互换使用。本文中术语“和/或”,仅仅是一种描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。另外,本文中字符“/”,一般表示前后关联对象是一种“或”的关系。
请参见图1,其为本申请实施方式应用的语音识别系统架构100的系统架构示意图。该语音识别系统架构100包括:声学模型110、发音词典120、语言模型130、对信号本身特征进行分析抓取的判别器140、语音解码器150和DNN(Deep Neural Networks,深度神经网络)分类器16。该语音识别系统架构100包含了完整的语音识别过程。
该声学模型110、该发音词典120和该语言模型130共同构成语音识别系统的主体。该发音词典120包含该语音识别系统架构100所能处理的单词的集合,并标明其发音。通过该发音词典120获取该声学模型110的建模单元和该语言模型130的建模单元之间的映射关系,从而把该声学模型110和该语言模型130联系起来,并与该发音词典120共同组成一个可搜索的状态空间用于该语音解码器150进行解码工作。
输入的语音信号经过该判别器140,该判别器140判别并提取该语音信号的语音特征,并获得语音数据流。该语音解码器150对该语音数据流进行解码,并根据该状态空间以及该语音特征,确定该语音信息的第一符号。该DNN分类器160是预先训练好的快速DNN分类器,其对该语音解码器150对初步标注的标点符号做进一步判别,对该第一符号进行优化后输出包含第二符号的文本信息。具体而言,该DNN分类器160结合上下文识别文本特征化向量和语音信息特征向量,对该语音解码器150第一符号做进一步判别。从而,实现对语音识别结果自动添加标点符号的功能,并提升了标点符号识别的准确率。
本申请实施例可以应用于各种带有语音识别功能的装置。例如:录音笔、音频会议终端、智能会议记录装置或者有语音识别功能的智能电子设备等。以下将通过具体的实施方式对本申请的技术方案进行阐述。
实施方式一
请参看图2,为本申请实施方式一提供的一种语音识别中添加标点符号的方法。该方法包括:
S210,判别器判别并提取语音信号的语音特征,并获得语音数据流;其中,该判决器是对信号本身的特征进行分析;可选的,该判决器提取的语音特征包括无人声语音片段的时长、以及该无人声语音片段的时间戳;
S220,语音解码器对所述语音数据流进行解码,并根据一个可搜索的状态空间以及所述语音特征确定第一符号;所述空间状态包括发音词典、声学模型和语言模型;所述发音词典包含单词及其该单词对应发音的集合;
S230,深度神经网络分类器根据上下文对所述第一符号做进一步判别,并输出标注第二符号的文本信息;其中,该深度神经网络分类器为预先训练的快速深度神经网络分类器。可选的,该上下文的文本特征与该语音特征均采用特征向量的形式表述,则,S230,具体可以是:深度神经网络分类器是根据上下文识别文本特征向量和语音特征向量对所述第一符号进行分类。
可选的,该DNN分类器是另外单独训练获得的文本分类器,其输入的是带有第一符号的语音识别文本,输出是带有第二符号的文本。
可选的,S210,判别器判别并提取语音信号的语音特征,并获得语音数据流,包括:
S211,接收语音信息后利用人声识别技术确定无人声语音片段的时长;
S212,建立该无人声语音片段的时间戳;具体而言,将所述时间戳的信息向量化加入根据该无人声语音片段的特征向量里,转化为WFST(Weighted Finite-State Transducers,加权有限状态机)进行计算。
如此,可省去在外部进行逻辑判断的过程,有利于简化标点符号的添加流程和计算过程。同时,由于将语音时长信息加入到时间戳中,在连续的时间戳内,通过语音识别技术判断出的信息,可辅助前述的声学模型、发音词典和语言模型进行判断,更有利于提高语音识别过程中添加标点符号的准确率。
可选的,S220中所述根据一个可搜索的状态空间以及所述语音特征确定第一符号,包括:
S221,根据所述无人声语音片段的时长确定所述无人声语音片段是标点符号还是无意义静音段;
S222,根据所述状态空间识别所述语音数据流中的初步符号;
S223,根据所述时间戳,确认所述初步符号中与所述无人声语音片段的标点符号对应的标点符号为所述第一符号。即,当根据所述状态空间识别出该时间戳对应的语音 信息是标点符号,并且根据时长也能确定该时间戳对应的语音信息是标点符号时,才保留该标点符号。
可选的,所述发音词典包括静音词,该静音词包括:第一静音词对应句中标点符号,第二静音词对应句尾标点符号,第三静音词对应无意义静音词。发音词典包含语音识别装置所能处理的单词的集合,并标明了单词的发音。一般而言,语音识别技术中的发声词典是不包含标点符号的,因为标点符号不发音。在本申请的实施方式一中,为了可在语音识别过程中,自动添加标点符号,对发声词典进行了改进,将无人声语音片段对应的词设定为静音词。并该静音词分为上述三类。在发音词典中三类静音词对应不同的符号。
可选的,该第一符号是用于指示语音信息中标点符号的位置和类型的符号,其中,不同的第一符号中不同的符号对应不同的标点符号类型,例如:第一符号可用于在语音信息中标注发音词典中的三类静音词,如用“^^”代表句尾标点符号,“^”代表句中标点符号,无意义静音词无特殊表示。该第二符号为具体的标点符号,如,在最终输出的文本中,将“^^”替换为句号,将“^”替换为逗号。
可选的,通过该发音字典得到声学模型的建模单元和语言模型建模单元间的映射关系,从而把声学模型和语言模型连接起来,并与该声学模型和该语言模型组成一个搜索的状态空间用于解码器进行解码工作。
可选的,请参见图3,该语音识别中添加标点符号的方法还包括:S240,训练发音词典,具体包括:
S241,准备CMUdict(Carnegie Mellon University dictionary,美国卡内基梅隆大学词典)发音词典;
S242,基于CMUdict发音词典训练G2P(Grapheme-to-Phoneme,单词转音素)模型;
S243,采用训练好的G2P模型自动生成在语言模型的训练词汇表中但不在CMUdict发音词典中的单词的发音;
S244,定义静音词的发音,并将这些词添加到CMUdict发音词典中,形成本所述发音词典;其中,静音词对应无人声语音片段,分为三类:句中标点符号、句末标点符号以及无意义静音段。
可选的,所述语言模型为预先训练的模型。请参见图4,该语音识别中添加标点符号的方法还包:S250,训练语言模型,具体包括:
S251,基于经过规范化处理的文本语料,统计出现频率最高的M个单词和N个目标识别范围内的标点符号;可选的,文本语料规范化处理可以包括以下至少一种:删除不在识别目标范围内的标点符号,如破折号、书名号等不常见的标点符号;对非标准单词进行规范化处理,如将罗马数字转化为十进制表示;将非ASCII字符转换为最接近的ASCII等效字符;分割原始文本,纠正可能的错误规范化;
S252,基于所述出现频率最高的所述M个单词和所述N个目标识别范围内的标点符号构造训练词汇表;M、N均为大于等于1的正整数;
S253,根据所述训练词汇表训练所述语言模型。
需要说明的是,上述步骤S251至S253不是对语音信息进行处理,而是在语言模型的构造训练过程中,利用已有的信息和语料对语言模型进行训练的过程。
本申请的事实方式一,在现有的N-Gram(N元模型)语言模型中,与发音词典对应的引入标点符号的定义。对标点符号的预测定义更多是利用上下文文本的信息对标点做统计性地简单预测。例如,通常在主语“你”“我”“他”之前通常是有标点符号的。
请参见图5,该语音识别中添加标点符号的方法还包:S260,训练所述深度神经网络分类,具体包括:
S261,对经过规范化处理的文本语料中的目标标点符号进行分类;可选的,该分类包括判断何种标点符号是句中标点,何种标点符号是句尾标点;可选的,文本语料规范化处理可以包括以下至少一种:删除不在目标识别范围内的标点符号,如破折号、书名号等不常见的标点符号;对非标准单词进行规范化处理,如将罗马数字转化为十进制表示;将非ASCII字符转换为最接近的ASCII等效字符。目标标点符号是指在目标识别范围内的标点符号,如目标识别范围是常见的标点符号:逗号、顿号、句号等,但是破折号、书名号不常见,不在目标识别范围内。
S262,将分类后的所述文本语料送入长短期记忆(Long-Short Term Memory,LSTM)神经网络中进行上下文特征提取训练,获得判别模型。其中,LSTM神经网络是一种特殊的RNN模型。
该DNN分类器使用的训练方法是使用通用的DNN分类器对训练文本做前置标点分类处理,而后将处理后的训练文本送入LSTM网络结构中进行上下文特征提取在进行进一步的训练,得到更为精确的标点符号判断。最终得到一个对之前提出的带有第一符号的文本进行精细标点分类的DNN分类器。
解决语音识别过程中自动添加标点符号的方案包括三类:第一类是利用语音识别生成的文本内容,将文本中所有的单词通过word2vector转化为词向量,并将该词向量送入深度神经网络DNN计算单词后出现标点的概率,然后取最高概率的添加方式作为最后的添加方案。其中,word2vector一种将语料库中的词转化成向量,以便后续在词向量的基础上进行各种计算的算法。该类方法脱离了语音信号本身,单纯基于文本内容确定标点符号的添加,完全未考虑到语音层面诸如静音的信息,由此,可能导致一些比较长的术语和专有名词被标点间隔开。同时,基于其系统的复杂性,对资源和时间的消耗也有所增加,而当需要更新时,如增加标点或者增加语料,也要花更长的时间重新训练神经网络分类模型。第二类是利用语音信息本身,通过判断语音信号中静音的时长是否超过阈值来决定该位置是否应该添加标点符号,若添加,则将该位置的前后语音信息送入到一个神经网络训练的分类器中,以此决定添加哪种标点符号。该类方法,通过静音时长判断,无法应对说话者因为犹豫话没说完但突然停顿的情况,也不适合应对说话较快的情况,容易造成标点符号的误添加。第三类是结合语音识别技术中的语言模型本身,对词与词之间的空隙进行语言模型的建模,利用加权有限状态转换器的特性实现自动添加标点符号。这三类语音识别过程中的标点符号添加方式都存在一定的局限性。
本申请的实施方式一,该声学模型、该发音词典和该语言模型共同构成语音识别系统的主体。在该发音词典中引入静音词;并且,将时间戳的信息向量化加入特征向量里,化为WFST进行计算;采用WFST的特性,在语言模型中将不发声音素细分为三类。并且由发音词典、语言模型和声学模型共同形成一个可搜索的状态空间,由此,可以结合发音词典、语言模型和声学模型,共同实现标点符号的初步添加。该判别器提取语音信息中语音特征,该语音特征包括无人声语音片段的时长及其对应的时间戳。语音解码器对获取的语音数据流进行解码的过程中,根据该状态空间,确定该语音数据流中需要标注标点符号的地方,同时,也确定判决器提取的无人声语音片段中哪些对应的是标点符号,并根据时间戳将二者联系起来,二者重合的地方即为标注第一符号的地方。由此,即可在语音解码器将语音数据解码后,根据状态空间中的发声词典、语言模型识别出的语音停顿处,添加第一符号。然后,再经过该语音解码器对第一符号做进一步判别,优化后输出包含第二符号的文本信息。从而,实现对语音识别结果自动添加标点符号的功能,并提升了标点符号识别的准确率。
本申请实施方式一,在几乎不影响语音识别准确率的前提下,从语音信号的语音特征,语言模型和DNN分类器三部分同时出发,来解决针对语音识别结果自动添加标 点符号的问题,在通过语音特征以及语言模型初步给出标点符号后,再经过DNN分类器对其进行进一步优化,输出包含优化后的标点符号的文本信息。提升了标点符号添加的准确率,从而提升了语音识别输出的文本的可读性和易读性,可提升用户体验。
实施方式二
请参看图6,为本申请实施方式二提供的一种语音识别装置300。该语音识别装置300包括但不限于录音笔、音频会议终端、智能会议记录装置或者有语音识别功能的智能电子设备等,对此在本实施方式二中不做限定。该语音识别装置300包括:
判别器310,用于判别并提取语音信号的语音特征,并获得语音数据流;
语音解码器320,用于对该语音数据流进行解码,并根据一个可搜索的状态空间以及该语音特征确定第一符号;该空间状态包括发音词典、声学模型和语言模型;该发音词典包含单词及其该单词对应发音的集合;
深度神经网络分类器330,用于根据上下文对该第一符号做进一步判别,并输出标注第二符号的文本信息;其中,该深度神经网络分类器为预先训练的快速深度神经网络分类器。
可选的,该语音特征包括无人声语音片段的时长、以及该无人声语音片段的时间戳。
可选的,该判别器310,具体用于接收该语音信息后,利用人声识别技术确定无人声语音片段的时长;将该时间戳的信息向量化加入根据该无人声语音片段的特征向量里,化为加权有限状态机进行计算;并获得语音数据流。
可选的,该语音解码器320,具体用于对该语音数据流进行解码,并根据该无人声语音片段的时长确定所述无人声语音片段是标点符号还是无意义静音段;根据该状态空间识别语音数据流中的初步符号;根据该时间戳,确认该初步符号中与无人声语音片段的标点符号对应的标点符号为第一符号。
可选的,该发音词典还包括以下三类静音词:第一静音词对应句中标点符号,第二静音词对应句尾标点符号,第三静音词对应无意义静音词。
可选的,该语言模型为预先训练的模型。该语音识别装置300包括该还包括:语言模型训练单元340,用于基于经过规范化处理的文本语料,统计出现频率最高的M个单词和N个目标识别范围内的标点符号;基于该出现频率最高的M个单词和该N个目标识别范围内的标点符号构造训练词汇表;M、N均为大于等于1的正整数;根据该训练词汇表训练语言模型。
可选的,该语音识别装置300包括该还包括:深度神经网络分类器训练单元350,用于对经过规范化处理的文本语料中的目标标点符号进行分类;将分类后的该文本语料送入长短期记忆神经网络中进行上下文特征提取训练,获得判别模型。
本实施方式二中有不详尽之处、或优化方案、或者具体的实例,请参见上述实施方式一中相同或对应的部分,在此不做重复赘述。
实施方式三
请参看图7,本申请实施方式三提供的一种语音识别装置400的结构示意图。该视频处理装置400包括:处理器410、存储器420以及通信接口430。处理器410、存储器420以及通信接口430之间通过总线系统实现相互的通信连接。处理器410调用存储器420中的程序,执行上述实施方式一提供的任意一种语音分析方法,并将输出结果通过该通信接口430通过无线或有线的方式输出给其他装置,如打印机、电脑、智能电子设备等可现实文本信息的电子装置。
该处理器410可以是一个独立的元器件,也可以是多个处理元件的统称。例如,可以是CPU,也可以是ASIC,或者被配置成实施以上方法的一个或多个集成电路,如至少一个微处理器DSP,或至少一个可编程门这列FPGA等。存储器420为一计算机可读存储介质,其上存储可在处理器410上运行的程序。
本实施方式三中有不详尽之处,请参见上述实施方式一中相同或对应的部分,在此不做重复赘述。
本领域技术人员应该可以意识到,在上述一个或多个示例中,本申请具体实施方式所描述的功能可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时,可以是由处理器执行软件指令的方式来实现。软件指令可以由相应的软件模块组成。软件模块可以被存放于计算机可读存储介质中,所述计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质(例如,软盘、硬盘、磁带)、光介质(例如,数字视频光盘(Digital Video Disc,DVD))、或者半导体介质(例如,固态硬盘(Solid State Disk,SSD))等。所述计算机可读存储介质包括但不限于随机存取存储器(Random Access Memory,RAM)、闪存、只读存储器(Read Only Memory,ROM)、可擦除可编程只读存储器(Erasable Programmable ROM,EPROM)、电可擦可编程只读存储器(Electrically EPROM,EEPROM)、寄存器、硬盘、移动硬盘、只读光盘(CD-ROM)或者本领域熟知的任何其它形式的存储介质。一种示例性的计算机 可读存储介质耦合至处理器,从而使处理器能够从该计算机可读存储介质读取信息,且可向该计算机可读存储介质写入信息。当然,计算机可读存储介质也可以是处理器的组成部分。处理器和计算机可读存储介质可以位于ASIC中。另外,该ASIC可以位于接入网设备、目标网络设备或核心网设备中。当然,处理器和计算机可读存储介质也可以作为分立组件存在于接入网设备、目标网络设备或核心网设备中。当使用软件实现时,也可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机指令。在计算机或芯片上加载和执行所述计算机程序指令时,全部或部分地产生按照本申请具体实施方式所述的流程或功能,该芯片可包含有处理器。所述计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机程序指令可以存储在上述计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线(Digital Subscriber Line,DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。
上述实施方式说明但并不限制本发明,本领域的技术人员能在权利要求的范围内设计出多个可代替实例。所属领域的技术人员应该意识到,本申请并不局限于上面已经描述并在附图中示出的精确结构,对在没有违反如所附权利要求书所定义的本发明的范围之内,可对具体实现方案做出适当的调整、修改、、等同替换、改进等。因此,凡依据本发明的构思和原则,所做的任意修改和变化,均在所附权利要求书所定义的本发明的范围之内。

Claims (17)

  1. 一种语音识别中添加标点符号的方法,其特征在于,所述方法包括:
    判别器判别并提取语音信号的语音特征,并获得语音数据流;
    语音解码器对所述语音数据流进行解码,并根据一个可搜索的状态空间以及所述语音特征确定第一符号;所述空间状态包括发音词典、声学模型和语言模型;所述发音词典包含单词及其该单词对应发音的集合;
    深度神经网络分类器根据上下文对所述第一符号做进一步判别,并输出标注第二符号的文本信息;其中,该深度神经网络分类器为预先训练的快速深度神经网络分类器。
  2. 如权利要求1所述的方法,其特征在于,所述语音特征包括无人声语音片段的时长、以及所述无人声语音片段的时间戳。
  3. 如权利要求2所述的方法,其特征在于,所述判别器判别并提取语音信号的语音特征,包括
    接收所述语音信息后,利用人声识别技术确定无人声语音片段的时长;
    将所述时间戳的信息向量化加入根据该无人声语音片段的特征向量里,化为加权有限状态机进行计算。
  4. 如权利要求2所述的方法,其特征在于,所述根据一个可搜索的状态空间以及所述语音特征确定第一符号,包括:
    根据所述无人声语音片段的时长确定所述无人声语音片段是标点符号还是无意义静音段;
    根据所述状态空间识别所述语音数据流中的初步符号;
    根据所述时间戳,确认所述初步符号中与所述无人声语音片段的标点符号对应的标点符号为所述第一符号。
  5. 如权利要求1所述的方法,其特征在于,所述发音词典还包括以下三类静音词:第一静音词对应句中标点符号,第二静音词对应句尾标点符号,第三静音词对应无意义静音词;所述第一符号用于标注所述语音信息中的静音词。
  6. 如权利要求1所述的方法,其特征在于,所述语言模型为预先训练的模型,通过以下训练方法获得:
    基于经过规范化处理的文本语料,统计出现频率最高的M个单词和N个目标识别范围内的标点符号;
    基于所述出现频率最高的所述M个单词和所述N个目标识别范围内的标点符号构造训练词汇表;M、N均为大于等于1的正整数;
    根据所述训练词汇表训练所述语言模型。
  7. 如权利要求1所述的方法,其特征在于,所述深度神经网络分类器通过以下方法训练获得:
    对经过规范化处理的文本语料中的目标标点符号进行分类;
    将分类后的所述文本语料送入长短期记忆神经网络中进行上下文特征提取训练,获得判别模型。
  8. 如权利要求1-7中任意一项所述的方法,其特征在于,所述发音词典、所述声学模型的建模单元和所述语言模型的建模单元之间有映射关系。
  9. 一种语音识别装置,其特征在于,所述语音识别装置包括:
    判别器,用于判别并提取语音信号的语音特征,并获得语音数据流;
    语音解码器,用于对所述语音数据流进行解码,并根据一个可搜索的状态空间以及所述语音特征确定第一符号;所述空间状态包括发音词典、声学模型和语言模型;所述发音词典包含单词及其该单词对应发音的集合;
    深度神经网络分类器,用于根据上下文对所述第一符号做进一步判别,并输出标注第二符号的文本信息;其中,该深度神经网络分类器为预先训练的快速深度神经网络分类器。
  10. 如权利要求9所述的语音识别装置,其特征在于,所述语音特征包括无人声语音片段的时长、以及所述无人声语音片段的时间戳。
  11. 如权利要求10所述的语音识别装置,其特征在于,所述判别器,具体用于接收所述语音信息后,利用人声识别技术确定无人声语音片段的时长;将所述时间戳的信息向量化加入根据该无人声语音片段的特征向量里,化为加权有限状态机进行计算;并获得语音数据流。
  12. 如权利要求10所述的语音识别装置,其特征在于,所述语音解码器,具体用于对所述语音数据流进行解码,并根据所述无人声语音片段的时长确定所述无人声语音片段是标点符号还是无意义静音段;根据所述状态空间识别所述语音数据流中的初步符号;根据所述时间戳,确认所述初步符号中与所述无人声语音片段的标点符号对应的标点符号为所述第一符号。
  13. 如权利要求9所述的语音识别装置,其特征在于,所述发音词典还包括以下三类静音词:第一静音词对应句中标点符号,第二静音词对应句尾标点符号,第三静音词对应无意义静音词;所述第一符号用于标注所述语音信息中的静音词。
  14. 如权利要求9所述的语音识别装置,其特征在于,所述语音识别装置还包括:
    语言模型训练单元,用于基于经过规范化处理的文本语料,统计出现频率最高的M个单词和N个目标识别范围内的标点符号;基于所述出现频率最高的所述M个单词和所述N个目标识别范围内的标点符号构造训练词汇表;M、N均为大于等于1的正整数;根据所述训练词汇表训练所述语言模型。
  15. 如权利要求9所述的语音识别装置,其特征在于,所述语音识别装置还包括:
    深度神经网络分类器训练单元,用于对经过规范化处理的文本语料中的目标标点符号进行分类;将分类后的所述文本语料送入长短期记忆神经网络中进行上下文特征提取训练,获得判别模型。
  16. 一种语音识别装置,其特征在于,所述语音识别装置包括:处理器以及存储器;所述处理器调用所述存储器中的程序,执行上述权利要求1至8中任意一项所述的语音识别中添加标点符号的方法。
  17. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质上存储有语音识别中添加标点符号的方法的程序,所述语音识别中添加标点符号的方法的程序被处理器执行时实现上述权利要求1至8中任意一项所述的语音识别中添加标点符号的方法。
PCT/CN2021/120413 2021-02-07 2021-09-24 一种语音识别中添加标点符号的方法及语音识别装置 WO2022166218A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110168975.5 2021-02-07
CN202110168975.5A CN112927679B (zh) 2021-02-07 2021-02-07 一种语音识别中添加标点符号的方法及语音识别装置

Publications (1)

Publication Number Publication Date
WO2022166218A1 true WO2022166218A1 (zh) 2022-08-11

Family

ID=76171060

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/120413 WO2022166218A1 (zh) 2021-02-07 2021-09-24 一种语音识别中添加标点符号的方法及语音识别装置

Country Status (2)

Country Link
CN (1) CN112927679B (zh)
WO (1) WO2022166218A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117392985A (zh) * 2023-12-11 2024-01-12 飞狐信息技术(天津)有限公司 语音处理方法、装置、终端和存储介质

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112927679B (zh) * 2021-02-07 2023-08-15 虫洞创新平台(深圳)有限公司 一种语音识别中添加标点符号的方法及语音识别装置
CN113362811B (zh) * 2021-06-30 2023-03-24 北京有竹居网络技术有限公司 语音识别模型的训练方法、语音识别方法和装置
CN113782010B (zh) * 2021-11-10 2022-02-15 北京沃丰时代数据科技有限公司 机器人响应方法、装置、电子设备及存储介质
WO2024029152A1 (ja) * 2022-08-05 2024-02-08 株式会社Nttドコモ 区切り記号挿入装置及び音声認識システム

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2001175275A (ja) * 1999-12-16 2001-06-29 Seiko Epson Corp サブワード音響モデル生成方法および音声認識装置
CN103164399A (zh) * 2013-02-26 2013-06-19 北京捷通华声语音技术有限公司 一种语音识别中的标点添加方法和装置
CN106653030A (zh) * 2016-12-02 2017-05-10 北京云知声信息技术有限公司 标点添加方法及装置
CN108831481A (zh) * 2018-08-01 2018-11-16 平安科技(深圳)有限公司 语音识别中符号添加方法、装置、计算机设备及存储介质
CN111709242A (zh) * 2020-06-01 2020-09-25 广州多益网络股份有限公司 一种基于命名实体识别的中文标点符号添加方法
CN112927679A (zh) * 2021-02-07 2021-06-08 虫洞创新平台(深圳)有限公司 一种语音识别中添加标点符号的方法及语音识别装置

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP3232289B2 (ja) * 1999-08-30 2001-11-26 インターナショナル・ビジネス・マシーンズ・コーポレーション 記号挿入装置およびその方法
US7043431B2 (en) * 2001-08-31 2006-05-09 Nokia Corporation Multilingual speech recognition system using text derived recognition models
CA2680304C (en) * 2008-09-25 2017-08-22 Multimodal Technologies, Inc. Decoding-time prediction of non-verbalized tokens
US9135231B1 (en) * 2012-10-04 2015-09-15 Google Inc. Training punctuation models
KR102450853B1 (ko) * 2015-11-30 2022-10-04 삼성전자주식회사 음성 인식 장치 및 방법
JP6495850B2 (ja) * 2016-03-14 2019-04-03 株式会社東芝 情報処理装置、情報処理方法、プログラムおよび認識システム
CN109448704A (zh) * 2018-11-20 2019-03-08 北京智能管家科技有限公司 语音解码图的构建方法、装置、服务器和存储介质
CN110688822A (zh) * 2019-09-27 2020-01-14 上海智臻智能网络科技股份有限公司 标点符号的添加方法及设备、介质
CN111261162B (zh) * 2020-03-09 2023-04-18 北京达佳互联信息技术有限公司 语音识别方法、语音识别装置及存储介质

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2001175275A (ja) * 1999-12-16 2001-06-29 Seiko Epson Corp サブワード音響モデル生成方法および音声認識装置
CN103164399A (zh) * 2013-02-26 2013-06-19 北京捷通华声语音技术有限公司 一种语音识别中的标点添加方法和装置
CN106653030A (zh) * 2016-12-02 2017-05-10 北京云知声信息技术有限公司 标点添加方法及装置
CN108831481A (zh) * 2018-08-01 2018-11-16 平安科技(深圳)有限公司 语音识别中符号添加方法、装置、计算机设备及存储介质
CN111709242A (zh) * 2020-06-01 2020-09-25 广州多益网络股份有限公司 一种基于命名实体识别的中文标点符号添加方法
CN112927679A (zh) * 2021-02-07 2021-06-08 虫洞创新平台(深圳)有限公司 一种语音识别中添加标点符号的方法及语音识别装置

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117392985A (zh) * 2023-12-11 2024-01-12 飞狐信息技术(天津)有限公司 语音处理方法、装置、终端和存储介质

Also Published As

Publication number Publication date
CN112927679A (zh) 2021-06-08
CN112927679B (zh) 2023-08-15

Similar Documents

Publication Publication Date Title
WO2022166218A1 (zh) 一种语音识别中添加标点符号的方法及语音识别装置
JP6550068B2 (ja) 音声認識における発音予測
US10621975B2 (en) Machine training for native language and fluency identification
US8972260B2 (en) Speech recognition using multiple language models
US9613621B2 (en) Speech recognition method and electronic apparatus
US9734820B2 (en) System and method for translating real-time speech using segmentation based on conjunction locations
US7974844B2 (en) Apparatus, method and computer program product for recognizing speech
WO2017071182A1 (zh) 一种语音唤醒方法、装置及系统
CN109754809B (zh) 语音识别方法、装置、电子设备及存储介质
US20150112674A1 (en) Method for building acoustic model, speech recognition method and electronic apparatus
WO2020186712A1 (zh) 一种语音识别方法、装置及终端
WO2018192186A1 (zh) 语音识别方法及装置
TW201517018A (zh) 語音辨識方法及其電子裝置
WO2020024620A1 (zh) 语音信息的处理方法以及装置、设备和存储介质
US20130030794A1 (en) Apparatus and method for clustering speakers, and a non-transitory computer readable medium thereof
US20120221335A1 (en) Method and apparatus for creating voice tag
JP6875819B2 (ja) 音響モデル入力データの正規化装置及び方法と、音声認識装置
WO2023048746A1 (en) Speaker-turn-based online speaker diarization with constrained spectral clustering
CN114330371A (zh) 基于提示学习的会话意图识别方法、装置和电子设备
CN114783464A (zh) 认知检测方法及相关装置、电子设备和存储介质
Kumar et al. Machine learning based speech emotions recognition system
WO2021051564A1 (zh) 语音识别方法、装置、计算设备和存储介质
JP2000172294A (ja) 音声認識方法、その装置及びプログラム記録媒体
CN116052655A (zh) 音频处理方法、装置、电子设备和可读存储介质
WO2023035529A1 (zh) 基于意图识别的信息智能查询方法、装置、设备及介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21924215

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21924215

Country of ref document: EP

Kind code of ref document: A1