WO2015149543A1 - 语音识别方法和装置 - Google Patents

语音识别方法和装置 Download PDF

Info

Publication number
WO2015149543A1
WO2015149543A1 PCT/CN2014/094277 CN2014094277W WO2015149543A1 WO 2015149543 A1 WO2015149543 A1 WO 2015149543A1 CN 2014094277 W CN2014094277 W CN 2014094277W WO 2015149543 A1 WO2015149543 A1 WO 2015149543A1
Authority
WO
WIPO (PCT)
Prior art keywords
feature information
acoustic model
information
data
data compression
Prior art date
Application number
PCT/CN2014/094277
Other languages
English (en)
French (fr)
Inventor
李博
王志谦
胡娜
穆向禹
贾磊
魏伟
Original Assignee
百度在线网络技术(北京)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 百度在线网络技术(北京)有限公司 filed Critical 百度在线网络技术(北京)有限公司
Priority to US14/896,588 priority Critical patent/US9805712B2/en
Publication of WO2015149543A1 publication Critical patent/WO2015149543A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/01Assessment or evaluation of speech recognition systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/183Speech classification or search using natural language modelling using context dependencies, e.g. language models
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/02Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders
    • G10L19/0204Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders using subband decomposition
    • G10L19/0208Subband vocoders
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/02Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders
    • G10L19/022Blocking, i.e. grouping of samples in time; Choice of analysis windows; Overlap factoring

Definitions

  • the present invention relates to the field of intelligent processing technologies, and in particular, to a voice recognition method and apparatus.
  • Speech recognition is one of the important technologies in the field of information technology.
  • the goal of speech recognition is to make the machine understand the natural language of the person.
  • the recognized speech can be applied to different fields as a control signal.
  • voice recognition usually adopts an online mode, and voice information input by a user is transmitted from the network to the cloud, and is recognized by the cloud server and the result is transmitted back to the user.
  • the present invention aims to solve at least one of the technical problems in the related art to some extent.
  • Another object of the present invention is to provide a speech recognition apparatus.
  • the voice recognition method includes: collecting voice information input by a user; performing feature extraction on the voice information to obtain feature information; and according to a pre-acquired acoustic model and a language model, Decoding the feature information to obtain the recognized voice information, wherein the acoustic model is obtained by performing data compression in advance.
  • the voice recognition method provided by the embodiment of the first aspect of the present invention performs voice recognition by means of an offline manner, and can implement voice recognition without relying on a network, and is convenient for the user to use. Moreover, by performing pre-data compression on the acoustic model, the acoustic model can be adapted to the mobile device to achieve voice recognition offline on the mobile device.
  • the voice recognition device of the second aspect of the present invention includes: an acquisition module, configured to collect voice information input by a user; and an extraction module, configured to perform feature extraction on the voice information to obtain feature information; a decoding module, configured to decode the feature information according to a pre-acquired acoustic model and a language model, To the recognized voice information, wherein the acoustic model is obtained by performing data compression in advance.
  • the voice recognition device provided by the embodiment of the second aspect of the present invention performs voice recognition by means of an offline manner, and can implement voice recognition without relying on a network, and is convenient for the user to use. Moreover, by performing pre-data compression on the acoustic model, the acoustic model can be adapted to the mobile device to achieve voice recognition offline on the mobile device.
  • a mobile device includes a housing, a processor, a memory, a circuit board, and a power supply circuit, wherein the circuit board is disposed inside a space enclosed by the housing, the processor and the The memory is disposed on the circuit board; the power circuit is configured to supply power to each circuit or device of the mobile device; the memory is used to store the executable program code; and the processor runs the executable program by reading the executable program code stored in the memory.
  • a program corresponding to the code configured to perform the following steps: collecting voice information input by the user; performing feature extraction on the voice information to obtain feature information; and decoding the feature information according to a pre-acquired acoustic model and a language model, The recognized voice information is obtained, wherein the acoustic model is obtained by performing data compression in advance.
  • the mobile device proposed by the embodiment of the third aspect of the present invention performs voice recognition by means of an offline manner, and can implement voice recognition without relying on a network, and is convenient for the user to use. Moreover, by performing pre-data compression on the acoustic model, the acoustic model can be adapted to the mobile device to achieve voice recognition offline on the mobile device.
  • FIG. 1 is a schematic flowchart of a voice recognition method according to an embodiment of the present invention.
  • FIG. 2 is a schematic flowchart of voice recognition in an offline mode according to an embodiment of the present invention
  • FIG. 3 is a schematic flowchart of a voice recognition method according to another embodiment of the present invention.
  • FIG. 4 is a schematic diagram of filtering feature information in an embodiment of the present invention.
  • FIG. 5 is a schematic flowchart of processing performed by using an acoustic model according to an embodiment of the present invention
  • FIG. 6 is a schematic structural diagram of a voice recognition apparatus according to another embodiment of the present invention.
  • FIG. 7 is a schematic structural diagram of a voice recognition apparatus according to another embodiment of the present invention.
  • FIG. 1 is a schematic flowchart of a voice recognition method according to an embodiment of the present invention, where the method includes:
  • the mobile device collects voice information input by the user
  • the mobile device can be a mobile phone, a tablet computer, or the like.
  • the mobile device after receiving the voice information input by the user, the mobile device sends the voice information to the server in the cloud through the network, and the server recognizes and returns the recognition result.
  • the mobile device itself may complete the voice recognition to implement the offline voice recognition.
  • S12 The mobile device performs feature extraction on the voice information input by the user to obtain feature information.
  • the analog voice information input by the user may be first converted into digital voice information. After that, the voice start point and the end point can be determined according to a Voice Activity Detector (VAD), and then feature extraction is performed.
  • VAD Voice Activity Detector
  • the mobile device decodes the feature information according to the acoustic model and the language model acquired in advance, and obtains the recognized voice information, wherein the acoustic model is obtained by performing data compression in advance.
  • the decoding step often occupies most of the time.
  • the decoder realizes the correspondence between the speech feature value and the text string through the matching of the acoustic model and the processing of the language model, wherein the acoustic model is much more complicated than the language model. Therefore, the optimization of the acoustic model will bring a lot of benefits to the efficiency of the entire speech recognition system.
  • This embodiment compresses the acoustic model to avoid large acoustic models that are difficult to operate in mobile devices.
  • voice recognition is performed in an offline manner, and voice recognition can be implemented without relying on a network, which is convenient for the user to use.
  • the acoustic model can be adapted to the mobile device to achieve voice recognition offline on the mobile device.
  • FIG. 3 is a schematic flowchart of a voice recognition method according to another embodiment of the present invention, where the method includes:
  • S31 The mobile device collects voice information input by the user.
  • S32 The mobile device performs feature extraction on the voice information to obtain feature information.
  • S33 The mobile device performs filtering processing on the feature information to obtain the filtered feature information.
  • Filtering processing includes, but is not limited to, frame skipping and the like.
  • the mobile device calculates the input filtered information according to the acoustic model after the data compression, and obtains an acoustic model score.
  • FIG. 5 the process of processing by using an acoustic model can be seen in FIG. 5, including:
  • the input data refers to the filtered feature information.
  • the algorithm for data compression of the input data is consistent with the data compression algorithm of the acoustic model to perform corresponding matching.
  • the optimization in this embodiment may include data structure optimization and computation mode optimization, wherein data structure optimization refers to compression of both input data and acoustic models.
  • Computational mode optimization refers to the use of parallel operations.
  • the present embodiment employs a data compression method to compress an already very large acoustic model to a scale suitable for use in a mobile device, while ensuring that the overall recognition rate is not degraded.
  • the operands in the entire decoding process are both compressed data, although this embodiment has more data compression and data decompression processes than the general decoding process, it benefits from the compressed data volume. It is an order of magnitude smaller than the original data, and the computational complexity of input compression and output decompression is much smaller than the calculation of the model score. Therefore, the overall decoding time is much shorter than the decoding without compressed data.
  • this embodiment while the data is compressed, the parallelism of the data calculation is fully explored.
  • this embodiment adopts various parallel means, including Not limited to data parallelism, instruction parallelism, thread parallelism, etc., the entire decoding process is optimized in parallel, and a huge time benefit is obtained.
  • the proportion of decoding time in the overall recognition time is reduced from the original 95% to less than 20%, and the acceleration is more than 150 times; the model scale is also reduced to 15%.
  • This metric is perfectly suited for use in mobile devices.
  • S53 Decompress the output data to obtain an acoustic model score.
  • the method of this embodiment further includes:
  • the mobile device scores the language model after the acoustic model matching, and obtains a language model score.
  • the score corresponding to each word in the acoustic model can be obtained, and then the language model of the corresponding word can be scored according to the score corresponding to each word.
  • the language model is much simpler than the acoustic model, the language model currently used by the server can be applied to the mobile device, and the current language model processing flow is adopted.
  • S36 The mobile device scores according to the acoustic model scoring and the language model, and obtains the combined scoring.
  • the scoring of the final model is obtained by combining the scores of the acoustic model with the scores of the language model, including but not limited to weighted summation:
  • score is the final score
  • W am and W lm are the weights of the acoustic model and the language model respectively
  • score am and score lm are scoring of the acoustic model and the language model respectively.
  • the mobile device determines the text segment corresponding to the highest combined score as the text segment obtained after the speech recognition.
  • the offline voice recognition can be implemented on the mobile device, and can be applied to software such as map navigation and mobile phone input, so that the user does not need to manually input information, and directly performs corresponding control by using voice to improve the user experience.
  • the recognition rate can be ensured, and the occupied system resources can be controlled within a reasonable range, which is very suitable for deployment on a mobile device.
  • FIG. 6 is a schematic structural diagram of a voice recognition apparatus according to another embodiment of the present invention.
  • the apparatus 60 includes an acquisition module 61, an extraction module 62, and a decoding module 63.
  • the collecting module 61 is configured to collect voice information input by the user
  • the device may be specifically a mobile device, and the mobile device may be a mobile phone, a tablet computer, or the like.
  • the mobile device after receiving the voice information input by the user, the mobile device sends the voice information to the server in the cloud through the network, and the server recognizes and returns the recognition result.
  • the mobile device itself may complete the voice recognition to implement the offline voice recognition.
  • the extracting module 62 is configured to perform feature extraction on the voice information to obtain feature information.
  • the analog voice information input by the user may be first converted into digital voice information. After that, the voice start point and the end point can be determined according to a Voice Activity Detector (VAD), and then feature extraction is performed.
  • VAD Voice Activity Detector
  • the decoding module 63 is configured to decode the feature information according to the acoustic model and the language model acquired in advance, and obtain the recognized voice information, wherein the acoustic model is obtained by performing data compression in advance.
  • the decoding step often occupies most of the time.
  • the decoder realizes the correspondence between the speech feature value and the text string through the matching of the acoustic model and the processing of the language model, wherein the acoustic model is much more complicated than the language model. Therefore, the optimization of the acoustic model will bring a lot of benefits to the efficiency of the entire speech recognition system.
  • This embodiment compresses the acoustic model to avoid large acoustic models that are difficult to operate in mobile devices.
  • voice recognition is performed by means of offline, and voice recognition can be implemented without relying on a network. It is used by users. Moreover, by performing pre-data compression on the acoustic model, the acoustic model can be adapted to the mobile device to achieve voice recognition offline on the mobile device.
  • FIG. 7 is a schematic structural diagram of a voice recognition apparatus according to another embodiment of the present invention.
  • the apparatus 60 further includes a filtering module 64.
  • the filtering module 64 is configured to filter the feature information to obtain filtered feature information, to decode the filtered feature information.
  • This embodiment filters out the useless information to ensure that the feature information input to the decoder is compact and efficient.
  • the filtering module 64 is specifically configured to: perform frame skipping on the feature information.
  • the decoding module 63 is specifically configured to:
  • the data of the acoustic model is calculated, and the language model is scored;
  • the recognized speech information is obtained by scoring the acoustic model score and the language model.
  • the decoding module 63 calculates the data information after the data compression, including:
  • the parallel operation performed by the decoding module 63 specifically includes at least one of the following items:
  • the algorithm for data compression of the input data is consistent with the data compression algorithm of the acoustic model to perform corresponding matching.
  • the optimization in this embodiment may include data structure optimization and computation mode optimization, wherein data structure optimization refers to compression of both input data and acoustic models.
  • Computational mode optimization refers to the use of parallel operations.
  • the present embodiment employs a data compression method to compress an already very large acoustic model to a scale suitable for use in a mobile device, while ensuring that the overall recognition rate is not degraded.
  • the operands in the entire decoding process are both compressed data, although this embodiment has more data compression and data decompression processes than the general decoding process, it benefits from the compressed data volume. It is an order of magnitude smaller than the original data, and the computational complexity of input compression and output decompression is much smaller than the calculation of the model score. Therefore, the overall decoding time is much shorter than the decoding without compressed data.
  • this embodiment while the data is compressed, the parallelism of the data calculation is fully explored.
  • this embodiment adopts various parallel means, including Do not Limited to data parallelism, instruction parallelism, thread parallelism, etc., the entire decoding process is optimized in parallel, and a huge time benefit is obtained.
  • the proportion of decoding time in the overall recognition time is reduced from the original 95% to less than 20%, and the acceleration is more than 150 times; the model scale is also reduced to 15%.
  • This metric is perfectly suited for use in mobile devices.
  • the language model score can also be obtained. Finally, the acoustic model scores and the language model are scored to obtain the recognized voice information.
  • the scoring of the final model is obtained by combining the scores of the acoustic model with the scores of the language model, including but not limited to weighted summation:
  • score is the final score
  • W am and W lm are the weights of the acoustic model and the language model respectively
  • score am and score lm are scoring of the acoustic model and the language model respectively.
  • the text segment corresponding to the highest combined score can be determined as the text segment obtained after the speech recognition.
  • the offline voice recognition can be implemented on the mobile device, and can be applied to software such as map navigation and mobile phone input, so that the user does not need to manually input information, and directly performs corresponding control by using voice to improve the user experience.
  • the recognition rate can be ensured, and the occupied system resources can be controlled within a reasonable range, which is very suitable for deployment on a mobile device.
  • An embodiment of the present invention further provides a mobile device, including a housing, a processor, a memory, a circuit board, and a power supply circuit, wherein the circuit board is disposed inside the space enclosed by the housing, and the processor and the memory are disposed at a circuit board; a power circuit for powering various circuits or devices of the mobile device; a memory for storing executable program code; and a processor for operating the executable program code by reading executable program code stored in the memory Program to use to perform the following steps:
  • the mobile device can be a mobile phone, a tablet computer, or the like.
  • the mobile device after receiving the voice information input by the user, the mobile device sends the voice information to the server in the cloud through the network, and the server recognizes and returns the recognition result.
  • the mobile device itself may complete the voice recognition to implement the offline voice recognition.
  • S12' Feature extraction is performed on the voice information input by the user to obtain feature information.
  • the analog voice information input by the user may be first converted into digital voice information. After that, you can determine the starting point and ending point of the voice according to the Voice Activity Detector (VAD). Line feature extraction.
  • VAD Voice Activity Detector
  • the decoding step often occupies most of the time.
  • the decoder realizes the correspondence between the speech feature value and the text string through the matching of the acoustic model and the processing of the language model, wherein the acoustic model is much more complicated than the language model. Therefore, the optimization of the acoustic model will bring a lot of benefits to the efficiency of the entire speech recognition system.
  • This embodiment compresses the acoustic model to avoid large acoustic models that are difficult to operate in mobile devices.
  • voice recognition is performed in an offline manner, and voice recognition can be implemented without relying on a network, which is convenient for the user to use.
  • the processor runs a program corresponding to the executable program code by reading executable program code stored in the memory for performing the following steps:
  • Filtering processing includes, but is not limited to, frame skipping and the like.
  • the process of obtaining the scoring of the acoustic model may include:
  • the input data refers to the filtered feature information.
  • the algorithm for data compression of the input data is consistent with the data compression algorithm of the acoustic model to perform corresponding matching.
  • the optimization in this embodiment may include data structure optimization and computation mode optimization, wherein data structure optimization refers to compression of both input data and acoustic models.
  • Computational mode optimization refers to the use of parallel operations.
  • the present embodiment employs a data compression method to compress an already very large acoustic model to a scale suitable for use in a mobile device, while ensuring that the overall recognition rate is not degraded.
  • the operands in the entire decoding process are compressed data, although compared to the general decoding process, this The embodiment has more data compression and data decompression process, but the amount of compressed data is one order of magnitude smaller than the original data amount, and the calculation of input compression and output decompression is much smaller than the calculation of model scoring, so the overall decoding The time is much shorter than the decoding without compressed data.
  • this embodiment while the data is compressed, the parallelism of the data calculation is fully explored.
  • this embodiment adopts various parallel means, including Not limited to data parallelism, instruction parallelism, thread parallelism, etc., the entire decoding process is optimized in parallel, and a huge time benefit is obtained.
  • the proportion of decoding time in the overall recognition time is reduced from the original 95% to less than 20%, and the acceleration is more than 150 times; the model scale is also reduced to 15%.
  • This metric is perfectly suited for use in mobile devices.
  • S53' Decompress the output data to obtain an acoustic model score.
  • the method of this embodiment further includes:
  • S35' The language model is scored by the data matched by the acoustic model, and the language model is scored.
  • the score corresponding to each word in the acoustic model can be obtained, and then the language model of the corresponding word can be scored according to the score corresponding to each word.
  • the language model is much simpler than the acoustic model, the language model currently used by the server can be applied to the mobile device, and the current language model processing flow is adopted.
  • S36' scoring according to the acoustic model scoring and the language model, and obtaining the combined scoring.
  • the scoring of the final model is obtained by combining the scores of the acoustic model with the scores of the language model, including but not limited to weighted summation:
  • score is the final score
  • W am and W lm are the weights of the acoustic model and the language model respectively
  • score am and score lm are scoring of the acoustic model and the language model respectively.
  • S37' The character segment corresponding to the highest combined score is determined as the character segment obtained after the speech recognition.
  • the offline voice recognition can be implemented on the mobile device, and can be applied to software such as map navigation and mobile phone input, so that the user does not need to manually input information, and directly performs corresponding control by using voice to improve the user experience.
  • the recognition rate can be ensured, and the occupied system resources can be controlled within a reasonable range, which is very suitable for deployment on a mobile device.
  • the embodiment of the invention further provides a mobile device, the mobile device comprising:
  • One or more processors are One or more processors;
  • One or more programs the one or more programs being stored in the memory, and when executed by the one or more processors, do the following:
  • the feature information is decoded according to a pre-acquired acoustic model and a language model, and the recognized voice information is obtained, wherein the acoustic model is obtained by performing data compression in advance.
  • portions of the invention may be implemented in hardware, software, firmware or a combination thereof.
  • multiple steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system.
  • a suitable instruction execution system For example, if implemented in hardware, as in another embodiment, it can be implemented by any one or combination of the following techniques well known in the art: having logic gates for implementing logic functions on data signals. Discrete logic circuits, application specific integrated circuits with suitable combinational logic gates, programmable gate arrays (PGAs), field programmable gate arrays (FPGAs), etc.
  • each functional unit in each embodiment of the present invention may be integrated into one processing module, or each unit may exist physically separately, or two or more units may be integrated into one module.
  • the above integrated modules can be implemented in the form of hardware or in the form of software functional modules.
  • the integrated modules, if implemented in the form of software functional modules and sold or used as stand-alone products, may also be stored in a computer readable storage medium.
  • the above mentioned storage medium may be a read only memory, a magnetic disk or an optical disk or the like.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Signal Processing (AREA)
  • Telephone Function (AREA)
  • Telephonic Communication Services (AREA)

Abstract

一种语音识别方法和装置,该语音识别方法包括采集用户输入的语音信息;对所述语音信息进行特征提取,得到特征信息;根据预先获取的声学模型和语言模型,对所述特征信息进行解码,得到识别后的语音信息,其中,所述声学模型为预先进行数据压缩后得到的。该方法能够不依赖网络实现语音识别。

Description

语音识别方法和装置
相关申请的交叉引用
本申请要求百度在线网络技术(北京)有限公司于2014年04月01日提交的、发明名称为“语音识别方法和装置”的、中国专利申请号“201410129541.4”的优先权。
技术领域
本发明涉及智能处理技术领域,尤其涉及一种语音识别方法和装置。
背景技术
语音识别是信息技术领域重要的技术之一,语音识别的目标是使机器听懂人的自然语言,由识别后的语音作为控制信号可以应用在不同的领域。
目前,语音识别通常采用在线方式,用户输入的语音信息由网络传入云端,经云端的服务器进行识别并将结果传回给用户。
但是,这种在线方式需要依赖网络。
发明内容
本发明旨在至少在一定程度上解决相关技术中的技术问题之一。
为此,本发明的一个目的在于提出一种语音识别方法,该方法可以不依赖网络实现语音识别。
本发明的另一个目的在于提出一种语音识别装置。
为达到上述目的,本发明第一方面实施例提出的语音识别方法,包括:采集用户输入的语音信息;对所述语音信息进行特征提取,得到特征信息;根据预先获取的声学模型和语言模型,对所述特征信息进行解码,得到识别后的语音信息,其中,所述声学模型为预先进行数据压缩后得到的。
本发明第一方面实施例提出的语音识别方法,通过离线的方式进行语音识别,可以不需要依赖网络实现语音识别,方便用户使用。并且,通过对声学模型进行预先数据压缩,可以使得声学模型适应于移动设备中,以实现在移动设备离线完成语音识别。
为达到上述目的,本发明第二方面实施例提出的语音识别装置,包括:采集模块,用于采集用户输入的语音信息;提取模块,用于对所述语音信息进行特征提取,得到特征信息;解码模块,用于根据预先获取的声学模型和语言模型,对所述特征信息进行解码,得 到识别后的语音信息,其中,所述声学模型为预先进行数据压缩后得到的。
本发明第二方面实施例提出的语音识别装置,通过离线的方式进行语音识别,可以不需要依赖网络实现语音识别,方便用户使用。并且,通过对声学模型进行预先数据压缩,可以使得声学模型适应于移动设备中,以实现在移动设备离线完成语音识别。
为达到上述目的,本发明第三方面实施例提出的移动设备,包括:壳体、处理器、存储器、电路板和电源电路,其中,电路板安置在壳体围成的空间内部,处理器和存储器设置在电路板上;电源电路,用于为移动设备的各个电路或器件供电;存储器用于存储可执行程序代码;处理器通过读取存储器中存储的可执行程序代码来运行与可执行程序代码对应的程序,以用于执行以下步骤:采集用户输入的语音信息;对所述语音信息进行特征提取,得到特征信息;根据预先获取的声学模型和语言模型,对所述特征信息进行解码,得到识别后的语音信息,其中,所述声学模型为预先进行数据压缩后得到的。
本发明第三方面实施例提出的移动设备,通过离线的方式进行语音识别,可以不需要依赖网络实现语音识别,方便用户使用。并且,通过对声学模型进行预先数据压缩,可以使得声学模型适应于移动设备中,以实现在移动设备离线完成语音识别。
本发明附加的方面和优点将在下面的描述中部分给出,部分将从下面的描述中变得明显,或通过本发明的实践了解到。
附图说明
本发明上述的和/或附加的方面和优点从下面结合附图对实施例的描述中将变得明显和容易理解,其中:
图1为本发明一实施例提出的语音识别方法的流程示意图;
图2为本发明实施例中离线方式的语音识别的流程示意图;
图3为本发明另一实施例提出的语音识别方法的流程示意图;
图4为本发明实施例中过滤特征信息的示意图;
图5为本发明实施例中利用声学模型进行处理的流程示意图;
图6为本发明另一实施例提出的语音识别装置的结构示意图;
图7为本发明另一实施例提出的语音识别装置的结构示意图。
具体实施方式
下面详细描述本发明的实施例,所述实施例的示例在附图中示出,其中自始至终相同或类似的标号表示相同或类似的元件或具有相同或类似功能的元件。下面通过参考附图描述的实施例是示例性的,仅用于解释本发明,而不能理解为对本发明的限制。相反,本发 明的实施例包括落入所附加权利要求书的精神和内涵范围内的所有变化、修改和等同物。
图1为本发明一实施例提出的语音识别方法的流程示意图,该方法包括:
S11:移动设备采集用户输入的语音信息;
其中,移动设备可以为手机、平板电脑等。
相关技术中,移动设备接收到用户输入的语音信息后,会通过网络发送给云端的服务器,由服务器进行识别并返回识别结果。
而本实施例中,为了避免语音识别受限于网络,可以由移动设备自身完成语音识别,实现离线方式的语音识别。
S12:移动设备对用户输入的语音信息进行特征提取,得到特征信息。
其中,参见图2,可以首先将用户输入的模拟语音信息转换为数字语音信息。之后,可以根据语音激活检测(Voice Activity Detector,VAD)确定语音起始点和终止点,再进行特征提取。
S13:移动设备根据预先获取的声学模型和语言模型,对所述特征信息进行解码,得到识别后的语音信息,其中,所述声学模型为预先进行数据压缩后得到的。
其中,如图2所示,解码步骤往往占据着绝大部分的时间。解码器通过对声学模型的匹配和语言模型的处理来实现语音特征值和文本字串的对应关系,其中声学模型又比语言模型复杂得多。因此,对声学模型的优化会给整个语音识别系统的效率带来大量收益。本实施例通过对声学模型进行数据压缩,以避免庞大的声学模型难以运行在移动设备中。
本实施例通过离线的方式进行语音识别,可以不需要依赖网络实现语音识别,方便用户使用。并且,通过对声学模型进行预先数据压缩,可以使得声学模型适应于移动设备中,以实现在移动设备离线完成语音识别。
图3为本发明另一实施例提出的语音识别方法的流程示意图,该方法包括:
S31:移动设备采集用户输入的语音信息。
S32:移动设备对语音信息进行特征提取,得到特征信息。
S33:移动设备对特征信息进行过滤处理,得到过滤后的特征信息。
其中,如图4所示,在一段语音信息中,由于字与字之间的停顿,带来很多无用的信息。本实施例通过过滤掉这些无用信息,以保证输入到解码器中的特征信息紧凑有效。
过滤处理包括但不限于跳帧等。
S34:移动设备根据数据压缩后的声学模型,对输入的过滤后的特征信息进行计算,得到声学模型打分。
具体的,利用声学模型进行处理的流程可以参见图5,包括:
S51:对输入数据进行数据压缩。
其中,输入数据是指过滤后的特征信息。
S52:利用数据压缩过的声学模型,对数据压缩后的输入数据进行并行计算,得到输出数据,其中,输出数据是数据压缩的声学模型打分。
其中,对输入数据进行数据压缩的算法与声学模型的数据压缩算法一致,以进行相应匹配。
本实施例中的优化可以包括数据结构优化和计算模式优化,其中,数据结构优化是指对输入数据以及声学模型都进行了压缩。计算模式优化是指采用并行运算。
在解码器模块中,本实施例采用了数据压缩方法,使本来非常庞大的声学模型压缩到适合在移动设备中使用的规模,同时能够保证整体识别率没有下降。此外,由于整个解码过程中的操作数均是压缩过的数据,所以,虽然相比于一般的解码过程,本实施例多了数据压缩和数据解压的过程,但是得益于压缩过的数据量比原数据量小一个量级,而且输入压缩和输出解压的计算量远小于模型打分的计算,因此,整体解码时间大大短于没有压缩数据的解码。
本实施例在对数据进行压缩的同时,也充分发掘了数据计算的并行性。在对大量输入数据进行解码时,不同的输入数据在进行计算时完全没有数据依赖,而且同一数据的计算中也存在毫无关联的数个步骤,因此本实施例采用各种并行手段,包括但不限于数据并行、指令并行、线程并行等,对整个解码过程进行并行优化,取得了巨大的时间收益。
在采用以上两种优化手段之后,解码时间在整体识别时间中的占比由原来的95%以上缩减到了20%以下,加速达到150倍以上;模型规模也缩减到原来的15%。这一指标完全适合应用于移动设备中。
S53:对输出数据进行解压缩,得到声学模型打分。
在得到声学模型打分后,还可以获取语言模型打分,最后根据声学模型打分和语言模型打分,得到识别后的语音信息。即,本实施例的方法还包括:
S35:移动设备对经过声学模型匹配后的数据进行语言模型打分,得到语言模型打分。
其中,经过声学模型处理后,可以得到声学模型中每个字对应的得分,之后根据每个字对应的得分,可以对相应的字进行语言模型打分。
由于语言模型相对于声学模型简单很多,因此,可以将目前服务器采用的语言模型应用到移动设备中,采用目前的语言模型处理流程。
S36:移动设备根据声学模型打分和语言模型打分,得到结合后的打分。
最终模型的打分由声学模型的得分和语言模型的得分相结合得到,该方式包括但不限于加权求和:
score=Wam·scoream+Wlm·scorelm
其中,score为最终得分,Wam和Wlm分别是声学模型和语言模型的权重,scoream和scorelm分别为声学模型和语言模型的打分。
S37:移动设备将最高的结合后的打分对应的文字片断,确定为语音识别后得到的文字片断。
本实施例可以在移动设备上实现离线语音识别,可以应用在地图导航、手机输入等软件中,使得用户不需要手动输入信息,直接采用语音完成相应控制,提升用户体验。本实施例通过对解码部分的声学模型进行计算以及数据的优化,既能够保证识别率,又能将占用的系统资源控制在一个合理的范围内,十分适合部署在移动设备上。
图6为本发明另一实施例提出的语音识别装置的结构示意图,该装置60包括采集模块61、提取模块62和解码模块63。
采集模块61用于采集用户输入的语音信息;
其中,该装置可以具体为移动设备,移动设备可以为手机、平板电脑等。
相关技术中,移动设备接收到用户输入的语音信息后,会通过网络发送给云端的服务器,由服务器进行识别并返回识别结果。
而本实施例中,为了避免语音识别受限于网络,可以由移动设备自身完成语音识别,实现离线方式的语音识别。
提取模块62用于对所述语音信息进行特征提取,得到特征信息;
其中,可以首先将用户输入的模拟语音信息转换为数字语音信息。之后,可以根据语音激活检测(Voice Activity Detector,VAD)确定语音起始点和终止点,再进行特征提取。
解码模块63用于根据预先获取的声学模型和语言模型,对所述特征信息进行解码,得到识别后的语音信息,其中,所述声学模型为预先进行数据压缩后得到的。
其中,解码步骤往往占据着绝大部分的时间。解码器通过对声学模型的匹配和语言模型的处理来实现语音特征值和文本字串的对应关系,其中声学模型又比语言模型复杂得多。因此,对声学模型的优化会给整个语音识别系统的效率带来大量收益。本实施例通过对声学模型进行数据压缩,以避免庞大的声学模型难以运行在移动设备中。
本实施例通过离线的方式进行语音识别,可以不需要依赖网络实现语音识别,方 便用户使用。并且,通过对声学模型进行预先数据压缩,可以使得声学模型适应于移动设备中,以实现在移动设备离线完成语音识别。
图7为本发明另一实施例提出的语音识别装置的结构示意图,该装置60还包括过滤模块64。
过滤模块64用于对所述特征信息进行过滤,得到过滤后的特征信息,以对所述过滤后的特征信息进行解码。
其中,如图3所示,在一段语音信息中,由于字与字之间的停顿,带来很多无用的信息。本实施例通过过滤掉这些无用信息,以保证输入到解码器中的特征信息紧凑有效。
一个实施例中,所述过滤模块64具体用于:对所述特征信息进行跳帧提取。
一个实施例中,所述解码模块63具体用于:
对所述特征信息进行数据压缩,根据所述数据压缩过的声学模型,对数据压缩后的特征信息进行计算,得到声学模型打分;
根据语言模块,对声学模型打分后的数据进行运算,得到语言模型打分;
根据所述声学模型打分和所述语言模型打分,得到识别后的语音信息。
一个实施例中,所述解码模块63对数据压缩后的特征信息进行计算,包括:
对数据压缩后的特征信息进行并行运算。
一个实施例中,所述解码模块63进行的并行运算具体包括如下项中的至少一项:
数据并行运算、指令并行运算、线程并行运算。
其中,对输入数据进行数据压缩的算法与声学模型的数据压缩算法一致,以进行相应匹配。
本实施例中的优化可以包括数据结构优化和计算模式优化,其中,数据结构优化是指对输入数据以及声学模型都进行了压缩。计算模式优化是指采用并行运算。
在解码器模块中,本实施例采用了数据压缩方法,使本来非常庞大的声学模型压缩到适合在移动设备中使用的规模,同时能够保证整体识别率没有下降。此外,由于整个解码过程中的操作数均是压缩过的数据,所以,虽然相比于一般的解码过程,本实施例多了数据压缩和数据解压的过程,但是得益于压缩过的数据量比原数据量小一个量级,而且输入压缩和输出解压的计算量远小于模型打分的计算,因此,整体解码时间大大短于没有压缩数据的解码。
本实施例在对数据进行压缩的同时,也充分发掘了数据计算的并行性。在对大量输入数据进行解码时,不同的输入数据在进行计算时完全没有数据依赖,而且同一数据的计算中也存在毫无关联的数个步骤,因此本实施例采用各种并行手段,包括但不 限于数据并行、指令并行、线程并行等,对整个解码过程进行并行优化,取得了巨大的时间收益。
在采用以上两种优化手段之后,解码时间在整体识别时间中的占比由原来的95%以上缩减到了20%以下,加速达到150倍以上;模型规模也缩减到原来的15%。这一指标完全适合应用于移动设备中。
在得到声学模型打分后,还可以获取语言模型打分,最后根据声学模型打分和语言模型打分,得到识别后的语音信息。
最终模型的打分由声学模型的得分和语言模型的得分相结合得到,该方式包括但不限于加权求和:
score=Wam·scoream+Wlm·scorelm
其中,score为最终得分,Wam和Wlm分别是声学模型和语言模型的权重,scoream和scorelm分别为声学模型和语言模型的打分。
其中,可以将最高的结合后的打分对应的文字片断,确定为语音识别后得到的文字片断。
本实施例可以在移动设备上实现离线语音识别,可以应用在地图导航、手机输入等软件中,使得用户不需要手动输入信息,直接采用语音完成相应控制,提升用户体验。本实施例通过对解码部分的声学模型进行计算以及数据的优化,既能够保证识别率,又能将占用的系统资源控制在一个合理的范围内,十分适合部署在移动设备上。
本发明实施例还提供了一种移动设备,该移动设备包括壳体、处理器、存储器、电路板和电源电路,其中,电路板安置在壳体围成的空间内部,处理器和存储器设置在电路板上;电源电路,用于为移动设备的各个电路或器件供电;存储器用于存储可执行程序代码;处理器通过读取存储器中存储的可执行程序代码来运行与可执行程序代码对应的程序,以用于执行以下步骤:
S11’:采集用户输入的语音信息;
其中,移动设备可以为手机、平板电脑等。
相关技术中,移动设备接收到用户输入的语音信息后,会通过网络发送给云端的服务器,由服务器进行识别并返回识别结果。
而本实施例中,为了避免语音识别受限于网络,可以由移动设备自身完成语音识别,实现离线方式的语音识别。
S12’:对用户输入的语音信息进行特征提取,得到特征信息。
其中,可以首先将用户输入的模拟语音信息转换为数字语音信息。之后,可以根据语音激活检测(Voice Activity Detector,VAD)确定语音起始点和终止点,再进 行特征提取。
S13’:根据预先获取的声学模型和语言模型,对所述特征信息进行解码,得到识别后的语音信息,其中,所述声学模型为预先进行数据压缩后得到的。
其中,解码步骤往往占据着绝大部分的时间。解码器通过对声学模型的匹配和语言模型的处理来实现语音特征值和文本字串的对应关系,其中声学模型又比语言模型复杂得多。因此,对声学模型的优化会给整个语音识别系统的效率带来大量收益。本实施例通过对声学模型进行数据压缩,以避免庞大的声学模型难以运行在移动设备中。
本实施例通过离线的方式进行语音识别,可以不需要依赖网络实现语音识别,方便用户使用。
另一实施例中,处理器通过读取存储器中存储的可执行程序代码来运行与可执行程序代码对应的程序,以用于执行以下步骤:
S31’:采集用户输入的语音信息。
S32’:对语音信息进行特征提取,得到特征信息。
S33’:对特征信息进行过滤处理,得到过滤后的特征信息。
其中,如图4所示,在一段语音信息中,由于字与字之间的停顿,带来很多无用的信息。本实施例通过过滤掉这些无用信息,以保证输入到解码器中的特征信息紧凑有效。
过滤处理包括但不限于跳帧等。
S34’:根据数据压缩后的声学模型,对输入的过滤后的特征信息进行计算,得到声学模型打分。
具体的,得到声学模型打分的流程可以包括:
S51’:对输入数据进行数据压缩。
其中,输入数据是指过滤后的特征信息。
S52’:利用数据压缩过的声学模型,对数据压缩后的输入数据进行并行计算,得到输出数据,其中,输出数据是数据压缩的声学模型打分。
其中,对输入数据进行数据压缩的算法与声学模型的数据压缩算法一致,以进行相应匹配。
本实施例中的优化可以包括数据结构优化和计算模式优化,其中,数据结构优化是指对输入数据以及声学模型都进行了压缩。计算模式优化是指采用并行运算。
在解码器模块中,本实施例采用了数据压缩方法,使本来非常庞大的声学模型压缩到适合在移动设备中使用的规模,同时能够保证整体识别率没有下降。此外,由于整个解码过程中的操作数均是压缩过的数据,所以,虽然相比于一般的解码过程,本 实施例多了数据压缩和数据解压的过程,但是得益于压缩过的数据量比原数据量小一个量级,而且输入压缩和输出解压的计算量远小于模型打分的计算,因此,整体解码时间大大短于没有压缩数据的解码。
本实施例在对数据进行压缩的同时,也充分发掘了数据计算的并行性。在对大量输入数据进行解码时,不同的输入数据在进行计算时完全没有数据依赖,而且同一数据的计算中也存在毫无关联的数个步骤,因此本实施例采用各种并行手段,包括但不限于数据并行、指令并行、线程并行等,对整个解码过程进行并行优化,取得了巨大的时间收益。
在采用以上两种优化手段之后,解码时间在整体识别时间中的占比由原来的95%以上缩减到了20%以下,加速达到150倍以上;模型规模也缩减到原来的15%。这一指标完全适合应用于移动设备中。
S53’:对输出数据进行解压缩,得到声学模型打分。
在得到声学模型打分后,还可以获取语言模型打分,最后根据声学模型打分和语言模型打分,得到识别后的语音信息。即,本实施例的方法还包括:
S35’:对经过声学模型匹配后的数据进行语言模型打分,得到语言模型打分。
其中,经过声学模型处理后,可以得到声学模型中每个字对应的得分,之后根据每个字对应的得分,可以对相应的字进行语言模型打分。
由于语言模型相对于声学模型简单很多,因此,可以将目前服务器采用的语言模型应用到移动设备中,采用目前的语言模型处理流程。
S36’:根据声学模型打分和语言模型打分,得到结合后的打分。
最终模型的打分由声学模型的得分和语言模型的得分相结合得到,该方式包括但不限于加权求和:
score=Wam·scoream+Wlm·scorelm
其中,score为最终得分,Wam和Wlm分别是声学模型和语言模型的权重,scoream和scorelm分别为声学模型和语言模型的打分。
S37’:将最高的结合后的打分对应的文字片断,确定为语音识别后得到的文字片断。
本实施例可以在移动设备上实现离线语音识别,可以应用在地图导航、手机输入等软件中,使得用户不需要手动输入信息,直接采用语音完成相应控制,提升用户体验。本实施例通过对解码部分的声学模型进行计算以及数据的优化,既能够保证识别率,又能将占用的系统资源控制在一个合理的范围内,十分适合部署在移动设备上。
本发明实施例还提出了一种移动设备,该移动设备包括:
一个或者多个处理器;
存储器;
一个或者多个程序,所述一个或者多个程序存储在所述存储器中,当被所述一个或者多个处理器执行时进行如下操作:
采集用户输入的语音信息;
对所述语音信息进行特征提取,得到特征信息;
根据预先获取的声学模型和语言模型,对所述特征信息进行解码,得到识别后的语音信息,其中,所述声学模型为预先进行数据压缩后得到的。
需要说明的是,在本发明的描述中,术语“第一”、“第二”等仅用于描述目的,而不能理解为指示或暗示相对重要性。此外,在本发明的描述中,除非另有说明,“多个”的含义是两个或两个以上。
流程图中或在此以其他方式描述的任何过程或方法描述可以被理解为,表示包括一个或更多个用于实现特定逻辑功能或过程的步骤的可执行指令的代码的模块、片段或部分,并且本发明的优选实施方式的范围包括另外的实现,其中可以不按所示出或讨论的顺序,包括根据所涉及的功能按基本同时的方式或按相反的顺序,来执行功能,这应被本发明的实施例所属技术领域的技术人员所理解。
应当理解,本发明的各部分可以用硬件、软件、固件或它们的组合来实现。在上述实施方式中,多个步骤或方法可以用存储在存储器中且由合适的指令执行系统执行的软件或固件来实现。例如,如果用硬件来实现,和在另一实施方式中一样,可用本领域公知的下列技术中的任一项或他们的组合来实现:具有用于对数据信号实现逻辑功能的逻辑门电路的离散逻辑电路,具有合适的组合逻辑门电路的专用集成电路,可编程门阵列(PGA),现场可编程门阵列(FPGA)等。
本技术领域的普通技术人员可以理解实现上述实施例方法携带的全部或部分步骤是可以通过程序来指令相关的硬件完成,所述的程序可以存储于一种计算机可读存储介质中,该程序在执行时,包括方法实施例的步骤之一或其组合。
此外,在本发明各个实施例中的各功能单元可以集成在一个处理模块中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个模块中。上述集成的模块既可以采用硬件的形式实现,也可以采用软件功能模块的形式实现。所述集成的模块如果以软件功能模块的形式实现并作为独立的产品销售或使用时,也可以存储在一个计算机可读取存储介质中。
上述提到的存储介质可以是只读存储器,磁盘或光盘等。
在本说明书的描述中,参考术语“一个实施例”、“一些实施例”、“示例”、“具体示例”、 或“一些示例”等的描述意指结合该实施例或示例描述的具体特征、结构、材料或者特点包含于本发明的至少一个实施例或示例中。在本说明书中,对上述术语的示意性表述不一定指的是相同的实施例或示例。而且,描述的具体特征、结构、材料或者特点可以在任何的一个或多个实施例或示例中以合适的方式结合。
尽管上面已经示出和描述了本发明的实施例,可以理解的是,上述实施例是示例性的,不能理解为对本发明的限制,本领域的普通技术人员在本发明的范围内可以对上述实施例进行变化、修改、替换和变型。

Claims (13)

  1. 一种语音识别方法,其特征在于,包括:
    采集用户输入的语音信息;
    对所述语音信息进行特征提取,得到特征信息;
    根据预先获取的声学模型和语言模型,对所述特征信息进行解码,得到识别后的语音信息,其中,所述声学模型为预先进行数据压缩后得到的。
  2. 根据权利要求1所述的方法,其特征在于,所述得到特征信息之后,所述方法还包括:
    对所述特征信息进行过滤,得到过滤后的特征信息,以对所述过滤后的特征信息进行解码。
  3. 根据权利要求2所述的方法,其特征在于,所述对所述特征信息进行过滤,包括:
    对所述特征信息进行跳帧提取。
  4. 根据权利要求1至3任一项所述的方法,其特征在于,所述根据预先获取的声学模型和语言模型,对所述特征信息进行解码,得到识别后的语音信息,包括:
    对所述特征信息进行数据压缩,根据所述数据压缩过的声学模型,对数据压缩后的特征信息进行计算,得到声学模型打分;
    根据语言模块,对声学模型打分后的数据进行运算,得到语言模型打分;
    根据所述声学模型打分和所述语言模型打分,得到识别后的语音信息。
  5. 根据权利要求4所述的方法,其特征在于,所述对数据压缩后的特征信息进行计算,包括:
    对数据压缩后的特征信息进行并行运算。
  6. 根据权利要求5所述的方法,其特征在于,所述并行运算包括如下项中的至少一项:
    数据并行运算、指令并行运算、线程并行运算。
  7. 一种语言识别装置,其特征在于,包括:
    采集模块,用于采集用户输入的语音信息;
    提取模块,用于对所述语音信息进行特征提取,得到特征信息;
    解码模块,用于根据预先获取的声学模型和语言模型,对所述特征信息进行解码,得到识别后的语音信息,其中,所述声学模型为预先进行数据压缩后得到的。
  8. 根据权利要求7所述的装置,其特征在于,还包括:
    过滤模块,用于对所述特征信息进行过滤,得到过滤后的特征信息,以对所述过滤后的特征信息进行解码。
  9. 根据权利要求8所述的装置,其特征在于,所述过滤单元具体用于:
    对所述特征信息进行跳帧提取。
  10. 根据权利要求7至9任一项所述的装置,其特征在于,所述解码模块具体用于:
    对所述特征信息进行数据压缩,根据所述数据压缩过的声学模型,对数据压缩后的特征信息进行计算,得到声学模型打分;
    根据语言模块,对声学模型打分后的数据进行运算,得到语言模型打分;
    根据所述声学模型打分和所述语言模型打分,得到识别后的语音信息。
  11. 根据权利要求10所述的装置,其特征在于,所述解码模块对数据压缩后的特征信息进行计算,包括:
    对数据压缩后的特征信息进行并行运算。
  12. 根据权利要求11所述的装置,其特征在于,所述解码模块进行的并行运算具体包括如下项中的至少一项:
    数据并行运算、指令并行运算、线程并行运算。
  13. 一种移动设备,其特征在于,包括:
    一个或者多个处理器;
    存储器;
    一个或者多个程序,所述一个或者多个程序存储在所述存储器中,当被所述一个或者多个处理器执行时进行如下操作:
    采集用户输入的语音信息;
    对所述语音信息进行特征提取,得到特征信息;
    根据预先获取的声学模型和语言模型,对所述特征信息进行解码,得到识别后的语音信息,其中,所述声学模型为预先进行数据压缩后得到的。
PCT/CN2014/094277 2014-04-01 2014-12-18 语音识别方法和装置 WO2015149543A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US14/896,588 US9805712B2 (en) 2014-04-01 2014-12-18 Method and device for recognizing voice

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201410129541.4 2014-04-01
CN201410129541.4A CN103915092B (zh) 2014-04-01 2014-04-01 语音识别方法和装置

Publications (1)

Publication Number Publication Date
WO2015149543A1 true WO2015149543A1 (zh) 2015-10-08

Family

ID=51040722

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2014/094277 WO2015149543A1 (zh) 2014-04-01 2014-12-18 语音识别方法和装置

Country Status (3)

Country Link
US (1) US9805712B2 (zh)
CN (1) CN103915092B (zh)
WO (1) WO2015149543A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105810190A (zh) * 2015-01-20 2016-07-27 哈曼国际工业有限公司 音乐内容和实时音乐伴奏的自动转录

Families Citing this family (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103915092B (zh) 2014-04-01 2019-01-25 百度在线网络技术(北京)有限公司 语音识别方法和装置
CN105489222B (zh) 2015-12-11 2018-03-09 百度在线网络技术(北京)有限公司 语音识别方法和装置
CN105913840A (zh) * 2016-06-20 2016-08-31 西可通信技术设备(河源)有限公司 一种语音识别装置及移动终端
CN106992007B (zh) * 2017-03-28 2020-07-28 百度在线网络技术(北京)有限公司 基于语音识别打分系统的数据处理方法和装置
KR20200059703A (ko) 2018-11-21 2020-05-29 삼성전자주식회사 음성 인식 방법 및 음성 인식 장치
CN109524017A (zh) * 2018-11-27 2019-03-26 北京分音塔科技有限公司 一种用户自定义词的语音识别增强方法和装置
CN110164416B (zh) * 2018-12-07 2023-05-09 腾讯科技(深圳)有限公司 一种语音识别方法及其装置、设备和存储介质
CN111583906B (zh) * 2019-02-18 2023-08-15 中国移动通信有限公司研究院 一种语音会话的角色识别方法、装置及终端
US11295726B2 (en) 2019-04-08 2022-04-05 International Business Machines Corporation Synthetic narrowband data generation for narrowband automatic speech recognition systems
US20210224078A1 (en) * 2020-01-17 2021-07-22 Syntiant Systems and Methods for Generating Wake Signals from Known Users
CN113223500B (zh) * 2021-04-12 2022-02-25 北京百度网讯科技有限公司 语音识别方法、训练语音识别模型的方法及对应装置
CN113611296A (zh) * 2021-08-20 2021-11-05 天津讯飞极智科技有限公司 语音识别装置和拾音设备

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030088402A1 (en) * 1999-10-01 2003-05-08 Ibm Corp. Method and system for low bit rate speech coding with speech recognition features and pitch providing reconstruction of the spectral envelope
CN1455387A (zh) * 2002-11-15 2003-11-12 中国科学院声学研究所 一种语音识别系统中的快速解码方法
US20040024599A1 (en) * 2002-07-31 2004-02-05 Intel Corporation Audio search conducted through statistical pattern matching
CN1607576A (zh) * 2002-11-15 2005-04-20 中国科学院声学研究所 一种语音识别系统
CN1920948A (zh) * 2005-08-24 2007-02-28 富士通株式会社 语音识别系统及语音处理系统
CN102063900A (zh) * 2010-11-26 2011-05-18 北京交通大学 克服混淆发音的语音识别方法及系统
CN103915092A (zh) * 2014-04-01 2014-07-09 百度在线网络技术(北京)有限公司 语音识别方法和装置

Family Cites Families (30)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
ZA948426B (en) * 1993-12-22 1995-06-30 Qualcomm Inc Distributed voice recognition system
US5899973A (en) * 1995-11-04 1999-05-04 International Business Machines Corporation Method and apparatus for adapting the language model's size in a speech recognition system
US6223155B1 (en) * 1998-08-14 2001-04-24 Conexant Systems, Inc. Method of independently creating and using a garbage model for improved rejection in a limited-training speaker-dependent speech recognition system
JP3728177B2 (ja) * 2000-05-24 2005-12-21 キヤノン株式会社 音声処理システム、装置、方法及び記憶媒体
US20050038647A1 (en) * 2003-08-11 2005-02-17 Aurilab, Llc Program product, method and system for detecting reduced speech
KR100679042B1 (ko) * 2004-10-27 2007-02-06 삼성전자주식회사 음성인식 방법 및 장치, 이를 이용한 네비게이션 시스템
US8005675B2 (en) * 2005-03-17 2011-08-23 Nice Systems, Ltd. Apparatus and method for audio analysis
CA2654960A1 (en) * 2006-04-10 2008-12-24 Avaworks Incorporated Do-it-yourself photo realistic talking head creation system and method
ES2359430T3 (es) * 2006-04-27 2011-05-23 Mobiter Dicta Oy Procedimiento, sistema y dispositivo para la conversión de la voz.
US7966183B1 (en) * 2006-05-04 2011-06-21 Texas Instruments Incorporated Multiplying confidence scores for utterance verification in a mobile telephone
US7822605B2 (en) * 2006-10-19 2010-10-26 Nice Systems Ltd. Method and apparatus for large population speaker identification in telephone interactions
US8635243B2 (en) * 2007-03-07 2014-01-21 Research In Motion Limited Sending a communications header with voice recording to send metadata for use in speech recognition, formatting, and search mobile search application
US10056077B2 (en) * 2007-03-07 2018-08-21 Nuance Communications, Inc. Using speech recognition results based on an unstructured language model with a music system
US8886545B2 (en) * 2007-03-07 2014-11-11 Vlingo Corporation Dealing with switch latency in speech recognition
US20090018826A1 (en) * 2007-07-13 2009-01-15 Berlin Andrew A Methods, Systems and Devices for Speech Transduction
US20090030676A1 (en) * 2007-07-26 2009-01-29 Creative Technology Ltd Method of deriving a compressed acoustic model for speech recognition
US8219404B2 (en) * 2007-08-09 2012-07-10 Nice Systems, Ltd. Method and apparatus for recognizing a speaker in lawful interception systems
CN101281745B (zh) * 2008-05-23 2011-08-10 深圳市北科瑞声科技有限公司 一种车载语音交互系统
CN101650886B (zh) * 2008-12-26 2011-05-18 中国科学院声学研究所 一种自动检测语言学习者朗读错误的方法
CN101604520A (zh) * 2009-07-16 2009-12-16 北京森博克智能科技有限公司 基于统计模型和语法规则的口语语音识别方法
US8190420B2 (en) * 2009-08-04 2012-05-29 Autonomy Corporation Ltd. Automatic spoken language identification based on phoneme sequence patterns
CN102231277A (zh) * 2011-06-29 2011-11-02 电子科技大学 基于声纹识别的移动终端隐私保护方法
CN102436816A (zh) * 2011-09-20 2012-05-02 安徽科大讯飞信息科技股份有限公司 一种语音数据解码方法和装置
CN102539154A (zh) * 2011-10-16 2012-07-04 浙江吉利汽车研究院有限公司 基于排气噪声矢量量化分析的发动机故障诊断方法及装置
CN103365849B (zh) * 2012-03-27 2016-06-15 富士通株式会社 关键词检索方法和设备
US8909517B2 (en) * 2012-08-03 2014-12-09 Palo Alto Research Center Incorporated Voice-coded in-band data for interactive calls
US8990076B1 (en) * 2012-09-10 2015-03-24 Amazon Technologies, Inc. Front-end difference coding for distributed speech recognition
CN102982799A (zh) * 2012-12-20 2013-03-20 中国科学院自动化研究所 一种融合引导概率的语音识别优化解码方法
CN103325370B (zh) * 2013-07-01 2015-11-25 百度在线网络技术(北京)有限公司 语音识别方法和语音识别系统
US9355636B1 (en) * 2013-09-16 2016-05-31 Amazon Technologies, Inc. Selective speech recognition scoring using articulatory features

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030088402A1 (en) * 1999-10-01 2003-05-08 Ibm Corp. Method and system for low bit rate speech coding with speech recognition features and pitch providing reconstruction of the spectral envelope
US20040024599A1 (en) * 2002-07-31 2004-02-05 Intel Corporation Audio search conducted through statistical pattern matching
CN1455387A (zh) * 2002-11-15 2003-11-12 中国科学院声学研究所 一种语音识别系统中的快速解码方法
CN1607576A (zh) * 2002-11-15 2005-04-20 中国科学院声学研究所 一种语音识别系统
CN1920948A (zh) * 2005-08-24 2007-02-28 富士通株式会社 语音识别系统及语音处理系统
CN102063900A (zh) * 2010-11-26 2011-05-18 北京交通大学 克服混淆发音的语音识别方法及系统
CN103915092A (zh) * 2014-04-01 2014-07-09 百度在线网络技术(北京)有限公司 语音识别方法和装置

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105810190A (zh) * 2015-01-20 2016-07-27 哈曼国际工业有限公司 音乐内容和实时音乐伴奏的自动转录
CN105810190B (zh) * 2015-01-20 2021-02-12 哈曼国际工业有限公司 音乐内容和实时音乐伴奏的自动转录

Also Published As

Publication number Publication date
US20170011736A1 (en) 2017-01-12
CN103915092A (zh) 2014-07-09
US9805712B2 (en) 2017-10-31
CN103915092B (zh) 2019-01-25

Similar Documents

Publication Publication Date Title
WO2015149543A1 (zh) 语音识别方法和装置
WO2021093449A1 (zh) 基于人工智能的唤醒词检测方法、装置、设备及介质
CN107134279B (zh) 一种语音唤醒方法、装置、终端和存储介质
JP6751433B2 (ja) アプリケーションプログラムをウェイクアップする処理方法、装置及び記憶媒体
US9911409B2 (en) Speech recognition apparatus and method
KR102317958B1 (ko) 화상처리장치 및 방법
US10714077B2 (en) Apparatus and method of acoustic score calculation and speech recognition using deep neural networks
US10719115B2 (en) Isolated word training and detection using generated phoneme concatenation models of audio inputs
US11238306B2 (en) Generating vector representations of code capturing semantic similarity
US10115389B2 (en) Speech synthesis method and apparatus
JP2021086154A (ja) 音声認識方法、装置、機器及びコンピュータ読み取り可能な記憶媒体
US20180322865A1 (en) Artificial intelligence-based acoustic model training method and apparatus, device and storage medium
JP6229046B2 (ja) 地方なまりを区別する音声データ認識方法、装置及びサーバ
US10783884B2 (en) Electronic device-awakening method and apparatus, device and computer-readable storage medium
TW201543467A (zh) 語音輸入方法、裝置和系統
KR102580408B1 (ko) 음성 기능을 갖는 휴대용 오디오 디바이스
WO2020140590A1 (zh) 一种音频信号处理方法及设备、存储介质
CN111341299B (zh) 一种语音处理方法及装置
CN108564944B (zh) 智能控制方法、系统、设备及存储介质
US11893988B2 (en) Speech control method, electronic device, and storage medium
CN110070859A (zh) 一种语音识别方法及装置
JP6778811B2 (ja) 音声認識方法及び装置
CN114842855A (zh) 语音唤醒模型的训练、唤醒方法、装置、设备及存储介质
CN114267342A (zh) 识别模型的训练方法、识别方法、电子设备及存储介质
CN113157240A (zh) 语音处理方法、装置、设备、存储介质及计算机程序产品

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 14887821

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 14896588

Country of ref document: US

NENP Non-entry into the national phase
122 Ep: pct application non-entry in european phase

Ref document number: 14887821

Country of ref document: EP

Kind code of ref document: A1