WO2022134025A1 - 一种离线语音识别方法和装置、电子设备和可读存储介质 - Google Patents

一种离线语音识别方法和装置、电子设备和可读存储介质 Download PDF

Info

Publication number
WO2022134025A1
WO2022134025A1 PCT/CN2020/139507 CN2020139507W WO2022134025A1 WO 2022134025 A1 WO2022134025 A1 WO 2022134025A1 CN 2020139507 W CN2020139507 W CN 2020139507W WO 2022134025 A1 WO2022134025 A1 WO 2022134025A1
Authority
WO
WIPO (PCT)
Prior art keywords
intent
text data
target
information
preset
Prior art date
Application number
PCT/CN2020/139507
Other languages
English (en)
French (fr)
Inventor
郝吉芳
宿绍勋
王炳乾
Original Assignee
京东方科技集团股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 京东方科技集团股份有限公司 filed Critical 京东方科技集团股份有限公司
Priority to CN202080003684.4A priority Critical patent/CN115104151A/zh
Priority to PCT/CN2020/139507 priority patent/WO2022134025A1/zh
Publication of WO2022134025A1 publication Critical patent/WO2022134025A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue

Definitions

  • the present disclosure relates to the technical field of speech recognition, and in particular, to an offline speech recognition method and apparatus, an electronic device and a readable storage medium.
  • Speech recognition refers to the process of analyzing the input speech signal to obtain the meaning expressed by the speech signal.
  • speech recognition relies on the network, and the electronic device needs to communicate and connect with the background server through the network, so as to realize the function of speech recognition through the background server.
  • an embodiment of the present disclosure provides an offline speech recognition method, including the following steps:
  • the control instruction corresponding to the voice signal is determined according to the key information and the target intention.
  • the identifying the target intent of the text data includes:
  • the preset intent with the highest degree of matching with the semantic information is used as the target intent corresponding to the text data.
  • the preset intent includes at least one of network connection control, shutdown control, volume adjustment, brightness adjustment, and signal source adjustment.
  • the extracting key information associated with the target intent in the text data includes:
  • the target intention determine the preset information that matches the target intention in the plurality of preset information
  • the information included in the target vocabulary is acquired as the key information.
  • the acquiring a voice signal and converting the voice signal into text data includes:
  • an offline speech recognition device including:
  • an acquisition conversion module for acquiring a voice signal and converting the voice signal into text data
  • an intent recognition module for identifying the target intent of the text data
  • a key information extraction module configured to extract key information associated with the target intent in the text data, the key information being matched with one of a plurality of preset information
  • a control instruction determination module configured to determine a control instruction corresponding to the voice signal according to the key information and the target intention.
  • the intent recognition module includes:
  • a vector conversion submodule for converting the text data into a digital vector through a pre-trained conversion model
  • a semantic information identification sub-module for identifying the semantic information corresponding to the digital vector
  • an intent matching submodule configured to determine the degree of matching between the semantic information and multiple preset intents
  • the intent determination sub-module is configured to use the preset intent with the highest degree of matching with the semantic information as the target intent corresponding to the text data.
  • the preset intent includes at least one of network connection control, shutdown control, volume adjustment, brightness adjustment, and signal source adjustment.
  • the key information extraction module includes:
  • a preset information determination sub-module configured to determine the preset information corresponding to the target intent among the plurality of preset information according to the target intent
  • a marking sub-module configured to mark a plurality of words included in the text data, and determine the matching degree of each of the words and each of the preset information
  • a target vocabulary determination submodule used for taking the vocabulary with the highest matching degree with the preset information as the target vocabulary containing the key information
  • the key information acquisition sub-module is used for acquiring the information included in the target vocabulary as the key information.
  • the acquisition conversion module includes:
  • the acquisition sub-module is used to acquire the input voice signal
  • noise reduction sub-module configured to perform noise reduction processing on the voice signal to obtain a first signal
  • a text conversion submodule for converting the first signal into a first text through a pre-trained text conversion model
  • a correction submodule configured to correct abnormal data existing in the first text to obtain text data corresponding to the speech signal.
  • embodiments of the present disclosure provide an electronic device, including a processor, a memory, and a computer program stored on the memory and executable on the processor, where the computer program is executed by the processor When implementing the steps of the offline speech recognition method according to any one of the first aspects.
  • an embodiment of the present disclosure provides a computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, implements any one of the offline speech recognition methods described in the first aspect. step.
  • the embodiment of the present disclosure by acquiring a voice signal and converting the voice signal into text data; identifying the target intent of the text data; extracting key information associated with the target intent in the text data; The information and the target intent determine a control instruction corresponding to the voice signal.
  • the embodiment of the present disclosure can realize the recognition of the voice signal without relying on the background server by obtaining the target intention of the voice signal and the key information corresponding to the target intention, thereby determining the control command of the voice signal.
  • the offline device can also realize speech recognition, which improves the application range of speech recognition.
  • FIG. 1 is a flowchart of an offline speech recognition method provided by an embodiment of the present disclosure
  • FIG. 2 is a schematic diagram of a scenario of an offline speech recognition method provided by an embodiment of the present disclosure
  • FIG. 3 is another flowchart of an offline speech recognition method provided by an embodiment of the present disclosure.
  • FIG. 4 is a structural diagram of an offline speech recognition apparatus provided by an embodiment of the present disclosure.
  • Embodiments of the present disclosure provide an offline speech recognition method.
  • the technical solution of this embodiment is applied to an electronic device, and it should be noted that the offline speech recognition in this embodiment refers to speech recognition without relying on network resources.
  • the electronic device can be offline or online.
  • the offline state means that the electronic device is not connected to external devices through wireless hotspots, mobile data networks or other means;
  • the online state means that the electronic device is connected to other devices through wireless hotspots, mobile data networks or other means. communication connection.
  • the offline speech recognition process does not depend on external data of the electronic device, and it can be understood that the speech recognition process in the embodiment of the present disclosure can be implemented regardless of whether the electronic device is in an offline state or an online state.
  • the offline speech recognition method includes the following steps:
  • Step 101 Acquire a voice signal and convert the voice signal into text data.
  • the voice signal in this embodiment refers to the voice signal input by the user to the electronic device.
  • the input can be collected through a remote control with a sound collection function, a microphone, or a sound collection device that comes with the electronic equipment. voice signal.
  • the voice signal is further converted into text.
  • the step 101 specifically includes:
  • noise reduction processing is first performed on the speech signal.
  • the purpose of the noise reduction processing is to eliminate noise, and the noise specifically includes external noise and internal noise.
  • the external noise refers to the noise from outside the electronic device, such as environmental noise, etc.
  • the internal noise refers to the music played by the electronic device itself, the noise generated by the application program running by itself, and the like.
  • External noise can be achieved by filtering, spectral subtraction, Wiener filtering and deep learning noise reduction, and internal noise can be achieved by corresponding echo cancellation according to the sound played by the electronic device.
  • the first signal with relatively high quality can be obtained.
  • the process of speech recognition mainly includes extracting the features of speech, and establishing a speech template required for speech recognition on this basis.
  • the established speech template is compared with the characteristics of the input first signal, and according to a certain search and matching strategy, the speech with the highest degree of matching with the first signal is found. template. Then, according to the definition of this template, the identification result for the first signal can be given by looking up the table.
  • the training of the text conversion model is completed in advance.
  • signal processing and knowledge mining can be performed on the pre-collected speech and language databases to obtain the "acoustic model” and "language model” required by the speech recognition system for text conversion.
  • Model training obtain a text conversion model that meets the needs of use, and then set it in the electronic device.
  • the text conversion model is used to identify the user input signal.
  • the user input signal here may refer to the above-mentioned voice signal, or may be the above-mentioned first signal subjected to noise reduction processing.
  • the process of converting the speech signal into the first text can be understood as including two main processes of noise reduction processing and text recognition.
  • Noise reduction processing can mainly realize endpoint detection to remove redundant mute and non-speaking voices, noise reduction, feature extraction, etc.; text recognition mainly uses the trained "acoustic model” and “language model” to count the feature vectors of the user's speech. Pattern recognition can also be called decoding, so as to obtain the text information it contains.
  • an adaptive feedback process may be further included after the text recognition, and the feedback process is mainly used for self-learning of the user's speech, so as to perform necessary actions on the "acoustic model” and the “speech model”. "Correction” to further improve the accuracy of recognition.
  • the content of the obtained text can also be corrected, for example, correcting wrong homophones, such as correcting paired eyes to matching pair of glasses; correcting words with similar pronunciation, such as correcting the wandering weaver girl to the cowherd and weaver girl; Correction of some specific nouns according to the thesaurus, such as Woody Allen to Allen Woody; grammatical error correction, such as unimaginable to unimaginable; word completion, such as love and providence Corrected to if love has providence; the appearance is wrong, for example, sorghum is corrected to sorghum, etc.
  • This process can be implemented based on specific rules or by using a corresponding deep learning model. Obviously, the specific rules based on it can also be further expanded.
  • the corrected first text is used as the text corresponding to the input speech signal.
  • the above-mentioned process of noise reduction and the step of text correction are not necessary, and this step may be omitted as required to reduce the system load in the speech recognition process.
  • Step 102 Identify the target intent of the text data.
  • the target intent corresponding to the text data is identified. This process can be understood as classifying the text data to determine the meaning expressed and the specific intended purpose.
  • the step 102 includes:
  • the preset intent with the highest degree of matching with the semantic information is used as the target intent corresponding to the text data.
  • the process of intent recognition may be implemented based on the Bert model.
  • the transformation model of the Bert architecture is a model that pre-trains to generate word vectors, which converts natural language text into digital vectors, and then identifies its corresponding semantic information, which can increase the generalization ability of the word vector model and fully describe character-level, word-level, Sentence-level and even inter-sentence relational features.
  • the intent recognition process can also be implemented by means of regular expression matching, a similarity calculation model based on Bilstm, etc., which is not further limited here.
  • the matching degree between the semantic information and the plurality of preset intentions is determined.
  • the electronic device may be an electronic device such as an all-in-one conference machine, a smart screen, and a home device.
  • the preset intent includes at least one of network connection control, shutdown control, volume adjustment, brightness adjustment, and signal source adjustment.
  • the preset intent is used as the target intent corresponding to the text data, which is beneficial to reduce the amount of computation and improve the accuracy of the result recognition.
  • Step 103 Extract key information associated with the target intent in the text data, where the key information matches one of multiple preset information.
  • the key information in the text data is extracted.
  • one or more preset information that matches each preset intent is set.
  • the text data is searched for whether Corresponding key information exists.
  • the text data obtained according to the voice signal is "adjust the volume to 60"
  • the target intention corresponding to the voice signal obtained through intention recognition is volume adjustment
  • the preset information corresponding to the volume adjustment includes volume.
  • the step 103 specifically includes:
  • the target intent determine the preset information that matches the target intent in the plurality of preset information
  • the information included in the target vocabulary is acquired as the key information.
  • the acquisition of key information may be achieved by filling slots.
  • preset information matching the target intent among the plurality of preset information is determined.
  • the preset information corresponding to volume adjustment is volume increase, volume reduction, mute, and adjusting to a specified volume
  • the preset information corresponding to brightness adjustment is brightness increase and brightness decrease, when it is determined that the target intention is volume adjustment
  • the The preset information that is intended to be matched is the four preset information of volume increase, volume decrease, mute and adjust to the specified volume.
  • the degree of matching between the vocabulary and the preset information is determined.
  • the four words “will”, “volume”, “adjust to” and “60” are respectively determined to be related to “volume up”, “volume down”, “mute” and “adjust to specified The degree of matching between the four preset information of "Volume”.
  • the matching degree between "60” and “adjust to the specified volume” is the highest. Therefore, the word “60” is used as the target word, and the information contained in “60” is further obtained as the specific volume value 60 , use this information as key information.
  • the matching degree of each word with the preset information may be calculated by methods including but not limited to the above-mentioned softmax algorithm.
  • Step 104 Determine a control instruction corresponding to the voice signal according to the key information and the target intention.
  • the intention is volume adjustment
  • the key information is that the volume value is 60. Therefore, the corresponding control command can be obtained to adjust the volume to 60. .
  • the electronic device may be further controlled to execute the control instruction to adjust the volume to 60.
  • the embodiment of the present disclosure can realize the recognition of the voice signal without relying on the background server by obtaining the target intention of the voice signal and the key information corresponding to the target intention, thereby determining the control command of the voice signal.
  • the offline device can also realize speech recognition, which improves the application range of speech recognition.
  • the technical solution of this embodiment can be implemented without a network, and the corresponding speed is faster, lower cost, and more convenient to use than the online speech recognition based on the background server.
  • Embodiments of the present disclosure provide an offline speech recognition apparatus.
  • the offline speech recognition apparatus 400 includes:
  • an acquisition conversion module 401 for acquiring a voice signal and converting the voice signal into text data
  • Intention recognition module 402 used for recognizing the target intention of the text data
  • a key information extraction module 403, configured to extract key information associated with the target intent in the text data
  • the control instruction determining module 404 is configured to determine the control instruction corresponding to the voice signal according to the key information and the target intention.
  • the intent recognition module 402 includes:
  • a vector conversion submodule for converting the text data into a digital vector through a pre-trained conversion model
  • a semantic information identification sub-module for identifying the semantic information corresponding to the digital vector
  • an intent matching sub-module for determining the matching degree between the semantic information and multiple preset intents
  • the intent determination sub-module is configured to use the preset intent with the highest degree of matching with the semantic information as the target intent corresponding to the text data.
  • the preset intent includes at least one of network connection control, shutdown control, volume adjustment, brightness adjustment, and signal source adjustment.
  • the key information extraction module 403 includes:
  • a preset information determination sub-module configured to determine the preset information corresponding to the target intent among the plurality of preset information according to the target intent
  • a marking submodule configured to mark a plurality of words included in the text data, and determine the matching degree of each of the words and each of the preset information
  • a target vocabulary determination submodule used for taking the vocabulary with the highest matching degree with the preset information as the target vocabulary containing the key information
  • the key information acquisition sub-module is used for acquiring the information included in the target vocabulary as the key information.
  • the acquisition and conversion module 401 includes:
  • the acquisition sub-module is used to acquire the input voice signal
  • noise reduction sub-module configured to perform noise reduction processing on the voice signal to obtain a first signal
  • a text conversion submodule for converting the first signal into a first text through a pre-trained text conversion model
  • a correction submodule configured to correct abnormal data existing in the first text to obtain text data corresponding to the speech signal.
  • the offline speech recognition apparatus in this embodiment can implement each step of the above-mentioned offline speech recognition method embodiment, and can achieve basically the same or similar technical effects, which will not be repeated here.
  • An embodiment of the present disclosure further provides a mobile terminal, including a processor, a memory, and a computer program stored in the memory and executable on the processor, and the computer program is executed by the processor to implement the above-mentioned embodiments of the offline speech recognition method and can achieve the same technical effect, in order to avoid repetition, it will not be repeated here.
  • Embodiments of the present disclosure further provide a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium.
  • a computer program is stored on the computer-readable storage medium.
  • the computer program is executed by a processor, each process of the above-mentioned embodiments of the offline speech recognition method can be implemented, and can achieve the same The technical effect, in order to avoid repetition, will not be repeated here.
  • the computer-readable storage medium such as read-only memory (Read-Only Memory, referred to as ROM), random access memory (Random Access Memory, referred to as RAM), magnetic disk or optical disk and so on.
  • the disclosed apparatus and method may be implemented in other manners.
  • the apparatus embodiments described above are only illustrative.
  • the division of the units is only a logical function division. In actual implementation, there may be other division methods.
  • multiple units or components may be combined or Can be integrated into another system, or some features can be ignored, or not implemented.
  • the shown or discussed mutual coupling or direct coupling or communication connection may be through some interfaces, indirect coupling or communication connection of devices or units, and may be in electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solutions of the embodiments of the present disclosure.
  • each functional unit in each embodiment of the present disclosure may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit.
  • the functions, if implemented in the form of software functional units and sold or used as independent products, may be stored in a computer-readable storage medium.
  • the computer software products are stored in a storage medium, including Several instructions are used to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods described in various embodiments of the present disclosure.
  • the aforementioned storage medium includes: a U disk, a removable hard disk, a ROM, a RAM, a magnetic disk, or an optical disk and other mediums that can store program codes.
  • the embodiment of the present disclosure by acquiring a voice signal and converting the voice signal into text data; identifying the target intent of the text data; extracting key information associated with the target intent in the text data; The information and the target intent determine a control instruction corresponding to the voice signal.
  • the embodiment of the present disclosure can realize the recognition of the voice signal without relying on the background server by obtaining the target intention of the voice signal and the key information corresponding to the target intention, thereby determining the control command of the voice signal.
  • the offline device can also realize speech recognition, which improves the application range of speech recognition.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Telephonic Communication Services (AREA)
  • Machine Translation (AREA)

Abstract

一种离线语音识别方法和装置(400)、电子设备和可读存储介质。离线语音识别方法包括获取语音信号,并将语音信号转换为文本数据(101);识别文本数据的目标意图(102);提取文本数据中与目标意图相关联的关键信息,关键信息与多个预设信息中的一个相匹配(103);根据关键信息和目标意图确定语音信号对应的控制指令(104)。通过获取语音信号的目标意图,并获取目标意图对应的关键信息,从而确定语音信号的控制指令,能够实现不依赖后台服务器即可实现对于语音信号的识别,这样,未联网的离线设备同样能够实现语音识别,提高了语音识别的应用范围。

Description

一种离线语音识别方法和装置、电子设备和可读存储介质 技术领域
本公开涉及语音识别技术领域,尤其涉及一种离线语音识别方法和装置、电子设备和可读存储介质。
背景技术
语音识别指的是对输入的语音信号进行解析,获取语音信号表达的含义的过程。相关技术中,语音识别依赖网络进行,电子设备需要通过网络与后台服务器通信连接,以通过后台服务器实现语音识别功能。
发明内容
第一方面,本公开实施例提供了一种离线语音识别方法,包括以下步骤:
获取语音信号,并将所述语音信号转换为文本数据;
识别所述文本数据的目标意图;
提取所述文本数据中与所述目标意图相关联的关键信息,所述关键信息与多个预设信息中的一个相匹配;
根据所述关键信息和所述目标意图确定所述语音信号对应的控制指令。
可选的,所述识别所述文本数据的目标意图,包括:
通过预训练的转换模型将所述文本数据转换为数字向量;
识别所述数字向量对应的语义信息;
确定所述语义信息与多个预设意图之间的匹配程度;
将与所述语义信息匹配程度最高的预设意图作为所述文本数据对应的目标意图。
可选的,所述预设意图包括网络连接控制、关机控制、音量调节、亮度调节和信号源调节中至少一项。
可选的,所述提取所述文本数据中与所述目标意图相关联的关键信息,包括:
根据所述目标意图,确定所述多个预设信息中与所述目标意图相匹配的 所述预设信息;
标记所述文本数据中包括的多个词汇,并确定每一所述词汇与各所述预设信息的匹配程度;
将与所述预设信息匹配程度最高的词汇作为包含所述关键信息的目标词汇;
获取所述目标词汇中包括的信息作为所述关键信息。
可选的,所述获取语音信号,并将所述语音信号转换为文本数据,包括:
获取输入的语音信号;
对所述语音信号进行降噪处理获得第一信号;
通过预先训练的文本转换模型将所述第一信号转换为第一文本;
校正所述第一文本中存在的异常数据获得所述语音信号对应的文本数据。
第二方面,本公开实施例提供了一种离线语音识别装置,包括:
获取转换模块,用于获取语音信号,并将所述语音信号转换为文本数据;
意图识别模块,用于识别所述文本数据的目标意图;
关键信息提取模块,用于提取所述文本数据中与所述目标意图相关联的关键信息,所述关键信息与多个预设信息中的一个相匹配;
控制指令确定模块,用于根据所述关键信息和所述目标意图确定所述语音信号对应的控制指令。
可选的,所述意图识别模块包括:
向量转换子模块,用于通过预训练的转换模型将所述文本数据转换为数字向量;
语义信息识别子模块,用于识别所述数字向量对应的语义信息;
意图匹配子模块,用于确定所述语义信息与多个预设意图之间的匹配程度;
意图确定子模块,用于将与所述语义信息匹配程度最高的预设意图作为所述文本数据对应的目标意图。
可选的,所述预设意图包括网络连接控制、关机控制、音量调节、亮度调节和信号源调节中至少一项。
可选的,所述关键信息提取模块包括:
预设信息确定子模块,用于根据所述目标意图,确定所述多个预设信息中与所述目标意图对应相匹配的所述预设信息;
标记子模块,用于标记所述文本数据中包括的多个词汇,并确定每一所述词汇与各所述预设信息的匹配程度;
目标词汇确定子模块,用于将与所述预设信息匹配程度最高的词汇作为包含所述关键信息的目标词汇;
关键信息获取子模块,用于获取所述目标词汇中包括的信息作为所述关键信息。
可选的,所述获取转换模块包括:
获取子模块,用于获取输入的语音信号;
降噪子模块,用于对所述语音信号进行降噪处理获得第一信号;
文本转换子模块,用于通过预先训练的文本转换模型将所述第一信号转换为第一文本;
校正子模块,用于校正所述第一文本中存在的异常数据获得所述语音信号对应的文本数据。
第三方面,本公开实施例提供了一种电子设备,包括处理器、存储器及存储在所述存储器上并可在所述处理器上运行的计算机程序,所述计算机程序被所述处理器执行时实现如第一方面中任一项所述的离线语音识别方法的步骤。
第四方面,本公开实施例提供了一种计算机可读存储介质,其上存储有计算机程序,所述计算机程序被处理器执行时实现第一方面中任一项所述的离线语音识别方法的步骤。
本公开实施例通过获取语音信号,并将所述语音信号转换为文本数据;识别所述文本数据的目标意图;提取所述文本数据中与所述目标意图相关联的关键信息;根据所述关键信息和所述目标意图确定所述语音信号对应的控制指令。这样,本公开实施例通过获取语音信号的目标意图,并获取目标意图对应的关键信息,从而确定语音信号的控制指令,能够实现不依赖后台服务器即可实现对于语音信号的识别,这样,未联网的离线设备同样能够实现语音识别提高了语音识别的应用范围。
附图说明
为了更清楚地说明本公开实施例的技术方案,下面将对本公开实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本公开的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获取其他的附图。
图1是本公开一实施例提供的离线语音识别方法的流程图;
图2是本公开一实施例提供的离线语音识别方法的场景示意图;
图3是本公开一实施例提供的离线语音识别方法的又一流程图;
图4是本公开一实施例提供的离线语音识别装置的结构图。
具体实施方式
下面将结合本公开实施例中的附图,对本公开实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本公开一部分实施例,而不是全部的实施例。基于本公开中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获取的所有其他实施例,都属于本公开保护的范围。
本公开实施例提供了一种离线语音识别方法。
本实施例的技术方案应用于电子设备,应当注意的是,本实施例中的离线语音识别指的是不依赖网络资源进行语音识别。该电子设备可以处于离线状态,也可以处于在线状态。其中,离线状态指的是电子设备未通过无线热点、移动数据网络及其他方式与外部设备进行数据连接;在线状态指的是电子设备通过无线热点、移动数据网络或其他方式与其他设备之间实现通信连接。
本实施例中,离线语音识别过程不依赖电子设备的外部数据,可以理解为,无论电子设备处于离线状态还是在线状态,均能实现本公开实施例中的语音识别过程。
如图1所示,在一个实施例中,该离线语音识别方法包括以下步骤:
步骤101:获取语音信号,并将所述语音信号转换为文本数据。
如图2所示,本实施例中的语音信号指的是用户输入至电子设备的语音 信号,实施时,可以通过具有声音采集功能的遥控器、麦克风或电子设备自带的声音采集装置采集输入的语音信号。
在采集到的语音信号之后,进一步将该语音信号转换为文本。
在其中一个实施例中,该步骤101具体包括:
获取输入的语音信号;
对所述语音信号进行降噪处理获得第一信号;
通过预先训练的文本转换模型将所述第一信号转换为第一文本;
校正所述第一文本中存在的异常数据获得所述语音信号对应的文本数据。
如图3所示,当获取了输入的语音信号之后,先对语音信号进行降噪处理,降噪处理的目的在于消除噪声,噪声具体包括外部噪声和内部噪声。其中,外部噪声指的是来自电子设备以外的噪声,例如环境噪声等,而内部噪声指的是电子设备自身播放的音乐、自身运行的应用程序所产生的噪声等。外部噪声可以通过滤波、谱减法、维纳滤波法和深度学习降噪等方法实现,内部噪声则可以根据电子设备所播放的声音进行相应的回声消除实现。
在经过降噪处理之后,能够获得质量相对较高的第一信号。
接下来,将该第一信号转换为第一文本。本实施例中,语音识别的过程主要包括提取语音的特征,并在此基础上建立语音识别所需的语音模板。
在识别过程中,利用进行语音识别的文本转换模型,将所建立的语音模板与输入的第一信号的特征进行比较,根据一定的搜索和匹配策略,找出与第一信号匹配程度最高的语音模板。然后根据此模板的定义,通过查表就可以给出对于第一信号的识别结果。
文本转换模型的训练预先完成的,实施时,可以对预先收集好的语音、语言数据库进行信号处理和知识挖掘,获取语音识别系统所需要的“声学模型”和“语言模型”,以进行文本转换模型的训练,获得满足使用需求的文本转换模型,然后设置于电子设备中。
应用过程中,利用该文本转换模型,对用户输入信号进行识别。应当注意的是,这里的用户输入信号指的可以是上述语音信号,也可以是上述经过降噪处理的第一信号。
可以理解为,将语音信号转换为第一文本的过程可以理解为包括降噪处 理和文本识别两个主要过程。
降噪处理主要可以实现进行端点检测以去除多余的静音和非说话声、降噪、特征提取等;文本识别主要利用训练好的“声学模型”和“语言模型”对用户说话的特征向量进行统计模式识别,也可以称作解码,从而得到其包含的文字信息。
在其中一些实施例中,在文本识别之后还可以进一步包括一个自适应的反馈过程,该反馈过程主要用于对用户的语音进行自学习,从而对“声学模型”和“语音模型”进行必要的“校正”,进一步提高识别的准确率。
在获得第一文本之后,还可以对获得的文本内容进行校正,例如,更正错误的同音字,例如将配副眼睛更正为配副眼镜;更正发音近似的词语,例如将流浪织女更正为牛郎织女;根据词库对某些特定的名词进行更正,例如将伍迪艾伦更正为艾伦伍迪;语法错误的更正,例如将想象难以更正为难以想象;字词补全,例如将如爱有天意更正为假如爱有天意;形似字错误,例如将高梁更正为高粱等。该过程可以基于特定的规则或利用相应的深度学习模型实现,显然,所依据的具体规则也可以进一步作出扩充。
本实施例中,将校正后的第一文本作为输入语音信号对应的文本。在其他一些实施例中,上述降噪的过程和该文本校正的步骤并非必须的,可以根据需要省略该步骤,以降低语音识别过程中的系统负荷。
步骤102:识别所述文本数据的目标意图。
如图3所示,在获得文本数据后,识别该文本数据对应的目标意图,该过程可以理解为对文本数据进行分类,确定其表达的含义及具体希望实现的目的。
在其中一些实施例中,该步骤102包括:
通过预训练的转换模型将所述文本数据转换为数字向量;
识别所述数字向量对应的语义信息;
确定所述语义信息与多个预设意图之间的匹配程度;
将与所述语义信息匹配程度最高的预设意图作为所述文本数据对应的目标意图。
本实施例中,意图识别的过程可以基于Bert模型实现。Bert架构的转换 模型是一个预训练产生词向量的模型,即将自然语言的文本转换成数字向量,然后识别其对应的语义信息,能够增加词向量模型泛化能力,充分描述字符级、词级、句子级甚至句间关系特征。显然,该意图识别过程还可以通过正则表达匹配、基于Bilstm的相似度计算模型等方式实现,此处不做进一步限定。
在其中一些实施例中,意图识别可以通过softmax分类器实现,例如,可以设置分类函数y i=softmax(W ih 1+b i),其中,y i为意图被分至第i类的概率,W i为权重,h 1为数据集,b i为偏置向量。该softmax算法本身可参考相关技术,此处不做进一步限定和描述。
在获得了数字向量对应的语义信息之后,确定语义信息与多个预设意图之间的匹配程度。
应当理解的是,由于本实施例的技术方案用于实现离线语音识别,受到硬件性能等因素限制,所以所能提供的运算能力是有限的,因此,本实施例中,设置一定数量的预设意图,且主要针对这些预设意图提供语音识别和控制功能。
如图3所示,在一个实施例中,该电子设备可以是会议一体机、智慧屏、家居设备等电子设备。预设意图包括网络连接控制、关机控制、音量调节、亮度调节和信号源调节中至少一项。
更为具体的,在其中一个实施例中,仅设置了上述五种预设意图,在进行语音识别过程中,将识别出的语义信息与上述预设意图相匹配,并从中选择匹配程度最高的预设意图作为文本数据对应的目标意图,有利于降低运算量,提高对于结果识别的准确程度。
步骤103:提取所述文本数据中与所述目标意图相关联的关键信息,所述关键信息与多个预设信息中的一个相匹配。
在确定了目标意图之后,对文本数据中的关键信息进行提取,本实施例中,针对每一预设意图设定匹配相应的一个或多个预设信息,实施时,从文本数据中查找是否存在相应的关键信息。
示例性的,在一个实施例中,根据语音信号获得的文本数据为“将音量调到60”,通过意图识别获得该语音信号对应的目标意图为音量调节,音量 调节对应的预设信息包括音量增加、音量降低、静音和调节至指定音量四种,在获得了文本数据之后,从文本数据中识别是否存在与预设信息相匹配的关键信息,本实施例中,识别到“60”,与预设信息中的“调节至指定的音量”相匹配,因此,将“60”作为相应的关键信息。
在其中一些实施例中,该步骤103具体包括:
根据所述目标意图,确定所述多个预设信息中与所述目标意图相匹配的所述预设信息;
标记所述文本数据中包括的多个词汇,并确定每一所述词汇与各所述预设信息的匹配程度;
将与所述预设信息匹配程度最高的词汇作为包含所述关键信息的目标词汇;
获取所述目标词汇中包括的信息作为所述关键信息。
在其中一些实施例中,关键信息的获取可以通过槽位填充实现。本实施例中,在确定了目标意图之后,确定多个预设信息中与目标意图相匹配的预设信息。
示例性的,音量调节对应的预设信息为音量增加、音量降低、静音和调节至指定音量,亮度调节对应的预设信息为亮度增加和亮度降低,当确定目标意图为音量调节时,与该意图相匹配的预设信息为音量增加、音量降低、静音和调节至指定音量这四个预设信息。
接下来,标记文本数据中包括的多个词汇,例如,对于“将音量调到60”,标记的词汇可以是“将”、“音量”、“调到”和“60”,这一过程汇中,可以标记文本数据中的部分或全部词汇。
在完成对于词汇的标记之后,确定词汇和预设信息之间的匹配程度。示例性的,本实施例中,分别确定“将”、“音量”、“调到”和“60”这四个词汇与“音量增加”、“音量降低”、“静音”和“调节至指定音量”这四个预设信息之间一一的匹配程度。
本实施例中,“60”和“调节至指定音量”之间匹配程度最高,因此,将“60”这一词汇作为目标词汇,并进一步获取“60”包含的信息为具体的音量值大小60,将该信息作为关键信息。
与上述过程类似的,每一词汇与预设信息的匹配程度可以通过包括但不限于上述softmax算法等方法计算。
步骤104:根据所述关键信息和所述目标意图确定所述语音信号对应的控制指令。
在获得了意图和关键信息后,确定相应的控制指令,例如,本实施例中,意图为音量调节,关键信息具体是音量值大小为60,所以能够得到相应的控制指令为将音量调节至60。
如图2和图3所示,在确定了控制指令之后,进一步可以控制电子设备执行该控制指令,将音量调节到60。
这样,本公开实施例通过获取语音信号的目标意图,并获取目标意图对应的关键信息,从而确定语音信号的控制指令,能够实现不依赖后台服务器即可实现对于语音信号的识别,这样,未联网的离线设备同样能够实现语音识别提高了语音识别的应用范围。
此外,本实施例的技术方案不需要网络即可实现,相应速度相对于基于后台服务器的在线语音识别,相应速度更快、成本更低、使用更加便利。
本公开实施例提供了一种离线语音识别装置。
如图4所示,在一个实施例中,该离线语音识别装置400包括:
获取转换模块401,用于获取语音信号,并将所述语音信号转换为文本数据;
意图识别模块402,用于识别所述文本数据的目标意图;
关键信息提取模块403,用于提取所述文本数据中与所述目标意图相关联的关键信息;
控制指令确定模块404,用于根据所述关键信息和所述目标意图确定所述语音信号对应的控制指令。
在其中一些实施例中,所述意图识别模块402包括:
向量转换子模块,用于通过预训练的转换模型将所述文本数据转换为数字向量;
语义信息识别子模块,用于识别所述数字向量对应的语义信息;
意图匹配子模块,用于确定所述语义信息与多个预设意图之间的匹配程 度;
意图确定子模块,用于将与所述语义信息匹配程度最高的预设意图作为所述文本数据对应的目标意图。
在其中一些实施例中,所述预设意图包括网络连接控制、关机控制、音量调节、亮度调节和信号源调节中至少一项。
在其中一些实施例中,所述关键信息提取模块403包括:
预设信息确定子模块,用于根据所述目标意图,确定所述多个预设信息中与所述目标意图对应相匹配的所述预设信息;
标记子模块,用于标记所述文本数据中包括的多个词汇,并确定每一所述词汇与各所述预设信息的匹配程度;
目标词汇确定子模块,用于将与所述预设信息匹配程度最高的词汇作为包含所述关键信息的目标词汇;
关键信息获取子模块,用于获取所述目标词汇中包括的信息作为所述关键信息。
在其中一些实施例中,所述获取转换模块401包括:
获取子模块,用于获取输入的语音信号;
降噪子模块,用于对所述语音信号进行降噪处理获得第一信号;
文本转换子模块,用于通过预先训练的文本转换模型将所述第一信号转换为第一文本;
校正子模块,用于校正所述第一文本中存在的异常数据获得所述语音信号对应的文本数据。
本实施例中的离线语音识别装置能够实现上述离线语音识别方法实施例的各个步骤,并能实现基本相同或相似的技术效果,此处不再赘述。
本公开实施例还提供一种移动终端,包括处理器,存储器,存储在存储器上并可在所述处理器上运行的计算机程序,该计算机程序被处理器执行时实现上述离线语音识别方法实施例的各个过程,且能达到相同的技术效果,为避免重复,这里不再赘述。
本公开实施例还提供一种计算机可读存储介质,计算机可读存储介质上存储有计算机程序,该计算机程序被处理器执行时实现上述离线语音识别方 法实施例的各个过程,且能达到相同的技术效果,为避免重复,这里不再赘述。其中,所述的计算机可读存储介质,如只读存储器(Read-Only Memory,简称ROM)、随机存取存储器(Random Access Memory,简称RAM)、磁碟或者光盘等。
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本公开的范围。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统、装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
在本申请所提供的实施例中,应该理解到,所揭露的装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本公开实施例方案的目的。
另外,在本公开各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。
所述功能如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本公开的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以 以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本公开各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、ROM、RAM、磁碟或者光盘等各种可以存储程序代码的介质。
本公开实施例通过获取语音信号,并将所述语音信号转换为文本数据;识别所述文本数据的目标意图;提取所述文本数据中与所述目标意图相关联的关键信息;根据所述关键信息和所述目标意图确定所述语音信号对应的控制指令。这样,本公开实施例通过获取语音信号的目标意图,并获取目标意图对应的关键信息,从而确定语音信号的控制指令,能够实现不依赖后台服务器即可实现对于语音信号的识别,这样,未联网的离线设备同样能够实现语音识别提高了语音识别的应用范围。
以上,仅为本公开的具体实施方式,但本公开的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本公开揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本公开的保护范围之内。因此,本公开的保护范围应以权利要求的保护范围为准。

Claims (12)

  1. 一种离线语音识别方法,包括以下步骤:
    获取语音信号,并将所述语音信号转换为文本数据;
    识别所述文本数据的目标意图;
    提取所述文本数据中与所述目标意图相关联的关键信息,所述关键信息与多个预设信息中的一个相匹配;
    根据所述关键信息和所述目标意图确定所述语音信号对应的控制指令。
  2. 根据权利要求1所述的方法,其中,所述识别所述文本数据的目标意图,包括:
    通过预训练的转换模型将所述文本数据转换为数字向量;
    识别所述数字向量对应的语义信息;
    确定所述语义信息与多个预设意图之间的匹配程度;
    将与所述语义信息匹配程度最高的预设意图作为所述文本数据对应的目标意图。
  3. 根据权利要求2所述的方法,其中,所述预设意图包括网络连接控制、关机控制、音量调节、亮度调节和信号源调节中至少一项。
  4. 根据权利要求2或3所述的方法,其中,所述提取所述文本数据中与所述目标意图相关联的关键信息,包括:
    根据所述目标意图,确定所述多个预设信息中与所述目标意图相匹配的所述预设信息;
    标记所述文本数据中包括的多个词汇,并确定每一所述词汇与各所述预设信息的匹配程度;
    将与所述预设信息匹配程度最高的词汇作为包含所述关键信息的目标词汇;
    获取所述目标词汇中包括的信息作为所述关键信息。
  5. 根据权利要求1所述的方法,其中,所述获取语音信号,并将所述语音信号转换为文本数据,包括:
    获取输入的语音信号;
    对所述语音信号进行降噪处理获得第一信号;
    通过预先训练的文本转换模型将所述第一信号转换为第一文本;
    校正所述第一文本中存在的异常数据获得所述语音信号对应的文本数据。
  6. 一种离线语音识别装置,包括:
    获取转换模块,用于获取语音信号,并将所述语音信号转换为文本数据;
    意图识别模块,用于识别所述文本数据的目标意图;
    关键信息提取模块,用于提取所述文本数据中与所述目标意图相关联的关键信息,所述关键信息与多个预设信息中的一个相匹配;
    控制指令确定模块,用于根据所述关键信息和所述目标意图确定所述语音信号对应的控制指令。
  7. 根据权利要求6所述的装置,其中,所述意图识别模块包括:
    向量转换子模块,用于通过预训练的转换模型将所述文本数据转换为数字向量;
    语义信息识别子模块,用于识别所述数字向量对应的语义信息;
    意图匹配子模块,用于确定所述语义信息与多个预设意图之间的匹配程度;
    意图确定子模块,用于将与所述语义信息匹配程度最高的预设意图作为所述文本数据对应的目标意图。
  8. 根据权利要求7所述的装置,其中,所述预设意图包括网络连接控制、关机控制、音量调节、亮度调节和信号源调节中至少一项。
  9. 根据权利要求7或8所述的装置,其中,所述关键信息提取模块包括:
    预设信息确定子模块,用于根据所述目标意图,确定所述多个预设信息中与所述目标意图对应相匹配的所述预设信息;
    标记子模块,用于标记所述文本数据中包括的多个词汇,并确定每一所述词汇与各所述预设信息的匹配程度;
    目标词汇确定子模块,用于将与所述预设信息匹配程度最高的词汇作为包含所述关键信息的目标词汇;
    关键信息获取子模块,用于获取所述目标词汇中包括的信息作为所述关键信息。
  10. 根据权利要求6所述的装置,其中,所述获取转换模块包括:
    获取子模块,用于获取输入的语音信号;
    降噪子模块,用于对所述语音信号进行降噪处理获得第一信号;
    文本转换子模块,用于通过预先训练的文本转换模型将所述第一信号转换为第一文本;
    校正子模块,用于校正所述第一文本中存在的异常数据获得所述语音信号对应的文本数据。
  11. 一种电子设备,包括处理器、存储器及存储在所述存储器上并可在所述处理器上运行的计算机程序,所述计算机程序被所述处理器执行时实现如权利要求1至5中任一项所述的离线语音识别方法的步骤。
  12. 一种可读存储介质,其上存储有计算机程序,所述计算机程序被处理器执行时实现权利要求1至5中任一项所述的离线语音识别方法的步骤。
PCT/CN2020/139507 2020-12-25 2020-12-25 一种离线语音识别方法和装置、电子设备和可读存储介质 WO2022134025A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202080003684.4A CN115104151A (zh) 2020-12-25 2020-12-25 一种离线语音识别方法和装置、电子设备和可读存储介质
PCT/CN2020/139507 WO2022134025A1 (zh) 2020-12-25 2020-12-25 一种离线语音识别方法和装置、电子设备和可读存储介质

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2020/139507 WO2022134025A1 (zh) 2020-12-25 2020-12-25 一种离线语音识别方法和装置、电子设备和可读存储介质

Publications (1)

Publication Number Publication Date
WO2022134025A1 true WO2022134025A1 (zh) 2022-06-30

Family

ID=82157161

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/139507 WO2022134025A1 (zh) 2020-12-25 2020-12-25 一种离线语音识别方法和装置、电子设备和可读存储介质

Country Status (2)

Country Link
CN (1) CN115104151A (zh)
WO (1) WO2022134025A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116708905A (zh) * 2023-08-07 2023-09-05 海马云(天津)信息技术有限公司 在电视盒子上实现数字人交互的方法和装置
CN116935846A (zh) * 2023-06-29 2023-10-24 珠海谷田科技有限公司 一种线下会议灯光控制方法、装置、设备及存储介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103810998A (zh) * 2013-12-05 2014-05-21 中国农业大学 基于移动终端设备的离线语音识别方法以及实现方法
CN106448664A (zh) * 2016-10-28 2017-02-22 魏朝正 一种通过语音控制智能家居设备的系统及方法
US20170339175A1 (en) * 2016-05-19 2017-11-23 International Business Machines Corporation Using natural language processing for detection of intended or unexpected application behavior
CN108831458A (zh) * 2018-05-29 2018-11-16 广东声将军科技有限公司 一种离线的语音到命令变换方法和系统
CN109410927A (zh) * 2018-11-29 2019-03-01 北京蓦然认知科技有限公司 离线命令词与云端解析结合的语音识别方法、装置和系统
CN111081218A (zh) * 2019-12-24 2020-04-28 北京工业大学 一种语音识别方法及语音控制系统

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103810998A (zh) * 2013-12-05 2014-05-21 中国农业大学 基于移动终端设备的离线语音识别方法以及实现方法
US20170339175A1 (en) * 2016-05-19 2017-11-23 International Business Machines Corporation Using natural language processing for detection of intended or unexpected application behavior
CN106448664A (zh) * 2016-10-28 2017-02-22 魏朝正 一种通过语音控制智能家居设备的系统及方法
CN108831458A (zh) * 2018-05-29 2018-11-16 广东声将军科技有限公司 一种离线的语音到命令变换方法和系统
CN109410927A (zh) * 2018-11-29 2019-03-01 北京蓦然认知科技有限公司 离线命令词与云端解析结合的语音识别方法、装置和系统
CN111081218A (zh) * 2019-12-24 2020-04-28 北京工业大学 一种语音识别方法及语音控制系统

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116935846A (zh) * 2023-06-29 2023-10-24 珠海谷田科技有限公司 一种线下会议灯光控制方法、装置、设备及存储介质
CN116935846B (zh) * 2023-06-29 2024-03-19 珠海谷田科技有限公司 一种线下会议灯光控制方法、装置、设备及存储介质
CN116708905A (zh) * 2023-08-07 2023-09-05 海马云(天津)信息技术有限公司 在电视盒子上实现数字人交互的方法和装置

Also Published As

Publication number Publication date
CN115104151A (zh) 2022-09-23

Similar Documents

Publication Publication Date Title
US10235994B2 (en) Modular deep learning model
CN105869634B (zh) 一种基于领域的带反馈语音识别后文本纠错方法及系统
US10923137B2 (en) Speech enhancement and audio event detection for an environment with non-stationary noise
CN108346427A (zh) 一种语音识别方法、装置、设备及存储介质
WO2021047180A1 (zh) 基于情绪识别的智能会话方法、装置及计算机设备
WO2018045646A1 (zh) 基于人工智能的人机交互方法和装置
WO2022121251A1 (zh) 文本处理模型训练方法、装置、计算机设备和存储介质
WO2022134025A1 (zh) 一种离线语音识别方法和装置、电子设备和可读存储介质
CN111445898B (zh) 语种识别方法、装置、电子设备和存储介质
WO2022166218A1 (zh) 一种语音识别中添加标点符号的方法及语音识别装置
WO2021082941A1 (zh) 视频人物识别方法、装置、存储介质与电子设备
JP2018045001A (ja) 音声認識システム、情報処理装置、プログラム、音声認識方法
JP2007514992A (ja) オーディオ対話システム及びボイスブラウズ方法
CN110335608A (zh) 声纹验证方法、装置、设备及存储介质
US10282417B2 (en) Conversational list management
CN110162802A (zh) 一种智能中英语音翻译机
US20220122596A1 (en) Method and system of automatic context-bound domain-specific speech recognition
CN115132170A (zh) 语种分类方法、装置及计算机可读存储介质
CN110033778B (zh) 一种说谎状态实时识别修正系统
CN114171009A (zh) 用于目标设备的语音识别方法、装置、设备及存储介质
CN111489740A (zh) 语音处理方法及装置、电梯控制方法及装置
JP2021082125A (ja) 対話装置
CN117932005B (zh) 一种基于人工智能的语音交互方法
US20230142836A1 (en) Predictive query execution
US20230267918A1 (en) Automatic out of vocabulary word detection in speech recognition

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20966590

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 24.10.2023)