WO2020228270A1 - 语音处理方法、装置、计算机设备及存储介质 - Google Patents

语音处理方法、装置、计算机设备及存储介质 Download PDF

Info

Publication number
WO2020228270A1
WO2020228270A1 PCT/CN2019/116513 CN2019116513W WO2020228270A1 WO 2020228270 A1 WO2020228270 A1 WO 2020228270A1 CN 2019116513 W CN2019116513 W CN 2019116513W WO 2020228270 A1 WO2020228270 A1 WO 2020228270A1
Authority
WO
WIPO (PCT)
Prior art keywords
real
sound signal
environmental sound
time environmental
user
Prior art date
Application number
PCT/CN2019/116513
Other languages
English (en)
French (fr)
Inventor
王健宗
贾雪丽
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2020228270A1 publication Critical patent/WO2020228270A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command

Definitions

  • This application relates to the field of speech processing, and in particular to a speech processing method, device, computer equipment and storage medium.
  • Some existing speech recognition systems rely on speech for activation. This kind of voice interaction system often relies on recognizing keywords in the user's voice. For example, for a smart speaker with voice interaction function, the wake-up keyword is set to "Hello". When the user says “Hello” near the smart speaker, the voice recognition module of the smart speaker detects the "Hello” in the keyword monitoring mode. "Hello” voice, and then switch the voice recognition module to work mode (switch from keyword monitoring mode to voice recognition mode) to monitor the voice commands issued by the user (voice commands can be used to command smart speakers to turn on music or broadcast news, etc.).
  • work mode switch from keyword monitoring mode to voice recognition mode
  • a voice processing method including:
  • the real-time environmental sound signal contains the designated keyword, the real-time environmental sound signal is recognized through a voice recognition model to obtain a user spoken instruction;
  • the machine logic instruction is sent to the execution device, so that the execution device executes the machine logic instruction.
  • a voice processing device includes:
  • the buffer module is used to buffer real-time environmental sound signals through the audio buffer
  • the detection module is used to detect whether the real-time environmental sound signal contains designated keywords
  • a recognition module configured to, if it is detected that the real-time environmental sound signal contains the designated keyword, recognize the real-time environmental sound signal through a voice recognition model to obtain a user spoken instruction;
  • the instruction conversion module is used to convert the user spoken instructions into machine logic instructions
  • the execution module is used to send the machine logic instruction to the execution device so that the execution device executes the machine logic instruction.
  • a computer device includes a memory, a processor, and computer-readable instructions stored in the memory and capable of running on the processor, and the processor implements the following steps when the processor executes the computer-readable instructions:
  • the real-time environmental sound signal contains the designated keyword, the real-time environmental sound signal is recognized through a voice recognition model to obtain a user spoken instruction;
  • the machine logic instruction is sent to the execution device, so that the execution device executes the machine logic instruction.
  • One or more readable storage media storing computer readable instructions, when the computer readable instructions are executed by one or more processors, the one or more processors execute the following steps:
  • the real-time environmental sound signal contains the designated keyword, the real-time environmental sound signal is recognized through a voice recognition model to obtain a user spoken instruction;
  • the machine logic instruction is sent to the execution device, so that the execution device executes the machine logic instruction.
  • FIG. 1 is a schematic diagram of an application environment of a voice processing method in an embodiment of the present application
  • FIG. 2 is a schematic flowchart of a voice processing method in an embodiment of the present application.
  • FIG. 3 is a schematic flowchart of a voice processing method in an embodiment of the present application.
  • FIG. 4 is a schematic flowchart of a voice processing method in an embodiment of the present application.
  • FIG. 5 is a schematic flowchart of a voice processing method in an embodiment of the present application.
  • FIG. 6 is a schematic flowchart of a voice processing method in an embodiment of the present application.
  • FIG. 7 is a schematic flowchart of a voice processing method in an embodiment of the present application.
  • Fig. 8 is a schematic structural diagram of a speech processing device in an embodiment of the present application.
  • FIG. 9 is a schematic structural diagram of a voice processing device in an embodiment of the present application.
  • Fig. 10 is a schematic diagram of a computer device in an embodiment of the present application.
  • the voice processing method provided in this embodiment can be applied in the application environment as shown in FIG. 1, where the client communicates with the server through the network.
  • Clients include, but are not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices.
  • the server can be implemented with an independent server or a server cluster composed of multiple servers.
  • a voice processing method is provided, and the method is applied to the server in FIG. 1 as an example for description, including the following steps:
  • the audio buffer may refer to a memory used to temporarily record real-time environmental sound signals.
  • the storage size of the audio memory can be adjusted to a storage space exceeding the duration of the longest designated keyword or key phrase. For example, if the length of the designated keyword is 10 seconds, the storage space of the audio memory can be set to store real-time environmental sound signals longer than 10 seconds.
  • the real-time environmental sound signal refers to the sound signal recorded in the current environment. Due to the limited storage space of the audio memory, the audio memory only stores real-time environmental sound signals from the current period of time (the length is the upper limit of the storage space of the audio memory).
  • the voice wake-up processing module can be used to detect whether the real-time environmental sound signal contains specified keywords.
  • the voice wake-up processing module can include a corresponding voice recognition model.
  • the voice wake-up processing module can be based on existing keyword positioning technology, such as Microsoft Cortana WoV wake up processing unit.
  • the specified keywords can be set independently by the user or based on the preset in the system.
  • the voice wake-up processing module can detect whether the real-time environmental sound signal in the audio buffer contains designated keywords, and determine the detection result. For example, if the designated keyword is set to "Hello, computer”, if the real-time environmental sound signal contains "Hello, computer”, the voice wake-up processing module can detect the "Hello, computer” contained in the real-time environmental sound signal, and Make sure that the real-time environmental sound signal contains the specified keywords.
  • the matching degree threshold is set in advance, and the matching degree between the real-time environmental sound signal and the specified keywords is calculated (the standard voice of the specified keywords can be generated first, and the acoustic features can be extracted from the standard voice, such as Sound energy, waveform, etc., and then extract the acoustic characteristics from the real-time environmental sound signal, and then calculate the matching degree between the acoustic characteristics corresponding to the specified keyword and the real-time environmental sound signal), and determine whether the calculated matching degree is not less than the matching degree threshold, If the calculated matching degree is not less than the matching degree threshold, it is determined that the real-time environmental sound signal contains the specified keyword; if the calculated matching degree is less than the matching degree threshold, it is determined that the real-time environmental sound signal does not contain the specified keyword.
  • the matching degree is used to characterize the similarity between the standard speech generated by the specified keyword and the real-time environmental sound signal.
  • the real-time environmental sound signal contains the designated keyword, the real-time environmental sound signal is recognized through a voice recognition model to obtain a user spoken instruction.
  • the voice recognition model preset in the voice recognition module can be used to recognize real-time environmental sound signals and obtain the user's spoken instructions.
  • the voice recognition module may be a voice processing module independent of the voice wake-up processing module, for example, it may be a voice processing module based on ASR (Automatic Speech Recognition) technology.
  • the voice wake-up processing module can be embedded or connected with a trigger, and the trigger is connected with the voice recognition module. When the voice wake-up processing module detects that the real-time environmental sound signal contains specified keywords, the trigger is activated, and the trigger sends out a wake-up signal to wake up the voice recognition module.
  • the voice recognition module After the voice recognition module wakes up, it will switch from the sleep or standby state to the active state, and recognize the real-time environmental sound signal buffered in the audio buffer through the voice recognition model preset in the voice recognition module.
  • the voice recognition module can recognize the real-time environmental sound signal after the designated keyword buffered in the audio buffer, and convert the real-time environmental sound signal after the designated keyword into the user's spoken instruction. For example, the user starts to speak, and the specific content is: Hello computer, please turn on the light in the kitchen. Since "Hello Computer" is a designated keyword, the voice recognition module can recognize the real-time environmental sound signal after the designated keyword buffered by the audio buffer, and recognize the user's spoken language designation "Please turn on the kitchen light".
  • the speech recognition model can be built internally, or it can use external computing resources. If an external voice recognition model is used, the real-time environmental sound signal to be recognized can be sent to the voice recognition model through a dedicated interface, and the recognition result (ie, user spoken instruction) fed back by the voice recognition model can be obtained. If the speech recognition model is a self-built model, a large number of speech samples can be obtained (for example, open source data from a public network can be used), and then the speech samples can be input into a preset neural network model for training.
  • the neural network model here can be a statistical language model based on Markov algorithm, N-gram algorithm or recurrent neural network. After the training is completed, the trained model is tested using the test sample, and if the test passes, the trained model can be used as the speech recognition model of this embodiment.
  • the voice wake-up processing module detects that the real-time environmental sound signal in the audio buffer does not contain the designated keyword, it continues to detect the change of the real-time environmental sound signal in the audio buffer.
  • a natural language understanding module can be used to convert user spoken instructions into machine logic instructions.
  • the natural language understanding module can generate machine logic instructions based on user spoken instructions. Since the machine cannot recognize the user's spoken instruction, a natural language understanding module is required to extract the information in the user's spoken instruction to generate machine logic instructions that can be recognized by the machine. For example, if the user's spoken instruction is "please turn on the kitchen light”, the natural language understanding module can extract key information from "please turn on the kitchen light”: “turn on”, “kitchen”, “light”, and generate the corresponding turn on the kitchen The control instructions (ie machine logic instructions) of the light.
  • the execution device can be a controlled device connected to a voice processing device, such as a household device, a smart car, etc.
  • a voice processing device such as a household device, a smart car, etc.
  • the execution device can execute the corresponding operation according to the machine logic instruction.
  • the kitchen lamp receives the turn-on instruction sent by the natural language understanding module, the kitchen lamp responds to the turn-on instruction and completes the turn-on operation.
  • the execution device may also be a non-physical device, such as a music player or radio on a mobile phone or other device.
  • the real-time environmental sound signal is buffered through the audio buffer to obtain real-time sound information in the environment (that is, the above-mentioned real-time environmental sound signal). Detect whether the real-time environmental sound signal contains a designated keyword, and determine whether to wake up the voice processing device through keyword detection. If it is detected that the real-time environmental sound signal contains the specified keyword, the real-time environmental sound signal is recognized through a voice recognition model, and the user's spoken instruction is obtained to quickly wake up the device, and at the same time, the real-time environment in the audio buffer The sound signal is processed to obtain the user's spoken instruction.
  • the user spoken instructions are converted into machine logic instructions, so as to convert the user spoken instructions into machine-recognizable instructions.
  • the machine logic instruction is sent to the execution device, so that the execution device executes the machine logic instruction to complete the operation required by the user's spoken instruction.
  • the buffering of real-time environmental sound signals through the audio buffer includes:
  • the collection module can be used to collect environmental sounds.
  • the sound collection module can be an audio capture device such as a microphone or a microphone array.
  • the sound collection module can record the sound in its environment (which may include the user's voice), and convert the sound in the environment into a real-time environmental sound signal.
  • the audio buffer can be configured to store real-time environmental sound signals provided by the sound collection module.
  • the real-time environmental sound signal may include user speech segments (or audio features extracted from these user speech segments) when the user speaks.
  • the audio buffer may be a circular buffer or a ring buffer.
  • the audio buffer stores real-time environmental sound signals in a circular buffering manner, that is, the oldest real-time environmental sound signals are covered by updated real-time environmental sound signals.
  • steps S101-S102 environmental sound is collected, and the real-time environmental sound signal is generated to obtain initial data of the sound signal.
  • the real-time environmental sound signal is stored in the audio buffer in a circular buffering manner, so as to buffer the real-time collected sound signal in the audio buffer.
  • the real-time environmental sound signal before detecting whether the real-time environmental sound signal contains a designated keyword, it further includes:
  • Keyword setting information refers to information input by the user for setting specified keywords. For example, if the user intends to use "Hello Computer” as a designated keyword, he can input the keyword setting information by voice input (for example, in the keyword setting program, say "Hello Computer", which is collected by the voice collection module To the keyword setting information), you can also enter the keyword setting information by entering the text "Hello Computer” (for example, using a smart phone connected to a voice processing device, and an application that controls the voice processing device is installed on the smart phone. The user can enter the keyword setting information of "Hello Computer” on the application).
  • the preset specification is used to determine whether the keyword setting information is suitable as the designated keyword of the voice processing device.
  • the preset specification may define some illegal characters.
  • the illegal character may be punctuation marks.
  • the keyword setting information includes punctuation marks, the keyword setting information does not conform to the preset specification.
  • the preset norms can also stipulate that some illegal or uncivilized words and sentences cannot be used as designated keywords. For example, if the keyword setting information contains words such as "fuck” and "fascism", the keyword setting information also does not meet the preset specifications.
  • the keyword setting information is input by voice, when the user's voice cannot be recognized normally (for example, the user makes a cry imitating an animal), it can also be determined that the keyword setting information entered by the user does not meet the expected Set specifications.
  • the user can be reminded that the currently input keyword information is not available, and the keyword setting information needs to be re-entered.
  • the keyword setting information meets the preset specification, it is determined that the keyword setting information is the designated keyword.
  • the keyword setting information input by the user is obtained to obtain a keyword for waking up the device. It is determined whether the keyword setting information meets the preset specification, so as to ensure that the keywords set in the keyword setting information are available or applicable. If the keyword setting information meets the preset specification, it is determined that the keyword setting information is the designated keyword to complete the keyword setting.
  • recognizing the real-time environmental sound signal to obtain a user's spoken instruction includes:
  • S301 Generate a wake-up instruction when it is detected that the real-time environmental sound signal contains a keyword voice
  • a trigger can be set to respond to the detection result of the keyword.
  • the voice wake-up processing module detects that the real-time environmental sound signal contains keywords, it can generate a wake-up signal (that is, wake-up command) based on a trigger embedded in the voice wake-up processing module or connected to the voice wake-up processing module. And send the wake-up signal to the voice recognition module.
  • the voice recognition module can switch from a low-power idle state to a high-power recognition state. At this time, the voice wake-up processing module is in an empty state.
  • the voice recognition module can monitor the end of the user sentence in the real-time environmental sound signal to determine the real-time environmental sound signal that needs to be processed.
  • the end point of the user sentence can be determined based on the preset duration range and the energy change of the real-time environmental sound signal.
  • the preset duration range can be defined as 3-10 seconds
  • the energy threshold is the average value of the background noise of the current environment.
  • the real-time environmental sound signal to be processed may include the initial segment buffered by the audio buffer (that is, the real-time environmental sound signal including designated keywords) and one or more additional received segments of audio after the real-time environmental sound signal. signal.
  • the additional receiving segment includes further speech from the user.
  • the designated keyword can be recognized by the voice wake-up processing module and the voice recognition module at the same time.
  • the voice recognition module can also set an end point to stop voice recognition. For example, if no voice activity is detected within a specified period of time, the voice recognition module switches from the high-power recognition state to the low-power idle state.
  • steps S301-S303 when it is detected that the real-time environmental sound signal contains the voice of the keyword, a wake-up instruction is generated, and the user's spoken instruction is responded in time.
  • the end of the user sentence in the real-time environmental sound signal is monitored according to the wake-up instruction to ensure that the acquired user spoken instruction is complete. If the end of the user sentence in the real-time environmental sound signal is monitored, the real-time environmental sound signal before the end of the user sentence is identified, and the real-time environmental sound signal before the end of the user sentence is converted into The user spoken instruction is used to obtain the user spoken instruction that needs to be processed.
  • all the acquired sound signals in the environment may be detected first to determine whether each sound signal meets the preset sound source requirements.
  • all the sound signals in the acquired environment can be separated to obtain multiple independent sound signals. For example, you can use ManyEars technology to separate the sound signal.
  • the target sound source that meets the requirements of the preset sound source refers to a sound signal whose duration in the preset volume range is greater than the preset duration.
  • the preset volume range can be set according to requirements, and the minimum and maximum values of the volume range can be set. Anything that exceeds the maximum value of the volume range is regarded as noise, and at this time, it is excluded from the range of the target sound source. If the volume is smaller than the minimum value, it can be considered that it is not the target sound source emitted by the tracking object that needs to be tracked in the current environment. Understandably, the preset volume range and preset duration can be set according to the current environment.
  • the sound signal in the current environment is continuously acquired for detection at this time.
  • an identification mark may be added to the target sound source.
  • different identification marks may be added to each target sound source, for example, it may be marked as a first target sound source, a second target sound source, and so on.
  • the sound information belonging to the target sound source can be located by the sound source localization operation in the ManyEars technology to determine the specific real-time position of the target sound source.
  • the sound collection device may be a microphone array, and the sound source position of the target sound source can be calculated according to the slight difference in the timing of the collected sound signals.
  • the sound source location can include direction and distance.
  • steps S11-S13 all sound signals in the current environment are detected, and it is determined whether there is a target sound source that meets the preset sound source requirements in all sound signals, so as to determine the existence of the target sound source.
  • an identification mark is added to the target sound source to distinguish possible different target sound sources.
  • the target sound source is localized through a sound source localization operation, and the sound source position of the target sound source is obtained.
  • the sound source position is associated with the identification identifier to determine the position corresponding to the target sound source (ie, the sound source position) ).
  • the recognizing the real-time environmental sound signal through a voice recognition model to obtain a user spoken instruction includes:
  • the real-time environmental signal may be optimized according to the calculated sound source position.
  • the tuning parameters include, but are not limited to, volume gain, specific noise characteristic parameters, and reverberation echo characteristic parameters.
  • the tuning parameters vary depending on the environment, and are also affected by the placement of the sound signal collection equipment.
  • the tuning parameters can be obtained by autonomous learning based on previously collected voice data (for example, an unsupervised learning algorithm can be used to process the collected voice data by itself).
  • the tuning parameters can be used to optimize the real-time environmental sound signal to generate an optimized sound signal that is more conducive to the recognition of the speech recognition model.
  • the optimized voice signal is processed by the speech recognition model to obtain the required user spoken instructions. Due to the higher quality of the optimized voice signal, the obtained user spoken instructions are also more accurate. In some specific environments, optimizing the sound signal can effectively eliminate the environmental noise and reverberation signal in the original real-time environmental signal, greatly improving the recognition accuracy of the user's spoken instruction, and reducing the number of repeated spoken instructions.
  • the adjustment parameters matching the sound source position are obtained to further optimize the real-time environmental signal.
  • the real-time environmental sound signal is processed according to the adjustment parameters to generate an optimized sound signal to obtain a sound signal more suitable for processing by a speech recognition model.
  • the speech recognition model is used to process the optimized sound signal to obtain the user's spoken instruction to recognize the user's spoken instruction.
  • a voice processing device is provided, and the voice processing device corresponds to the voice processing method in the foregoing embodiment one-to-one.
  • the voice processing device includes a buffer module 10, a detection module 20, an identification module 30, an instruction conversion module 40 and an execution module 50.
  • the detailed description of each functional module is as follows:
  • the buffer module 10 is used to buffer real-time environmental sound signals through the audio buffer
  • the detection module 20 is configured to detect whether the real-time environmental sound signal contains designated keywords
  • the recognition module 30 is configured to, if it is detected that the real-time environmental sound signal contains the designated keyword, recognize the real-time environmental sound signal through a voice recognition model to obtain a user spoken instruction;
  • the instruction conversion module 40 is used to convert the user spoken instructions into machine logic instructions
  • the execution module 50 is configured to send the machine logic instruction to the execution device, so that the execution device executes the machine logic instruction.
  • the cache module 10 includes:
  • the collecting unit 101 is configured to collect environmental sound and generate the real-time environmental sound signal
  • the storage unit 102 is configured to store the real-time environmental sound signal in the audio buffer in a circular buffering manner.
  • the voice processing device further includes a setting module, and the setting module includes:
  • the acquisition setting information unit is used to acquire the keyword setting information input by the user
  • the standard judgment unit is used to judge whether the keyword setting information meets the preset standard
  • the keyword determining unit is configured to determine that the keyword setting information is the designated keyword if the keyword setting information meets the preset specification.
  • the identification module 30 includes:
  • the wake-up unit is configured to generate a wake-up instruction when it is detected that the real-time environmental sound signal contains a keyword voice;
  • the sentence end point detection unit is configured to monitor the user sentence end point in the real-time environmental sound signal according to the wake-up instruction;
  • the spoken instruction conversion unit is configured to identify the real-time environmental sound signal before the end of the user sentence if the end of the user sentence in the real-time environmental sound signal is monitored, and combine all the words before the end of the user sentence The real-time environmental sound signal is converted into the user spoken instruction.
  • the voice processing device further includes a positioning module, and the positioning module includes:
  • the target sound source judging unit is used to detect all sound signals in the current environment and determine whether there is a target sound source that meets the preset sound source requirements in all the sound signals;
  • the adding identification unit is used to add an identification mark to the target sound source when there is a target sound source that meets the requirements of the preset sound source;
  • the sound source location determining unit is configured to locate the target sound source through a sound source localization operation to obtain the sound source position of the target sound source, and the sound source position is associated with the identification identifier.
  • the identification module 30 includes:
  • An acquiring parameter unit configured to acquire a tuning parameter matching the position of the sound source
  • a sound optimization unit configured to process the real-time environmental sound signal according to the adjustment parameter to generate an optimized sound signal
  • the voice recognition unit is configured to use the voice recognition model to process the optimized sound signal to obtain the user spoken instruction.
  • Each module in the above-mentioned speech processing device can be implemented in whole or in part by software, hardware, and a combination thereof.
  • the foregoing modules may be embedded in the form of hardware or independent of the processor in the computer device, or may be stored in the memory of the computer device in the form of software, so that the processor can call and execute the operations corresponding to the foregoing modules.
  • a computer device is provided.
  • the computer device may be a server, and its internal structure diagram may be as shown in FIG. 10.
  • the computer equipment includes a processor, a memory, a network interface and a database connected through a system bus. Among them, the processor of the computer device is used to provide calculation and control capabilities.
  • the memory of the computer device includes a non-volatile storage medium and an internal memory.
  • the non-volatile storage medium stores an operating system, computer readable instructions, and a database.
  • the internal memory provides an environment for the operation of the operating system and computer-readable instructions in the non-volatile storage medium.
  • the database of the computer equipment is used to store the data involved in the above voice processing method.
  • the network interface of the computer device is used to communicate with an external terminal through a network connection.
  • the computer-readable instructions are executed by the processor to realize a voice processing method.
  • the readable storage medium provided in this embodiment includes a non-volatile readable storage medium and a volatile readable storage medium.
  • a computer device including a memory, a processor, and computer-readable instructions stored in the memory and capable of running on the processor.
  • the processor executes the computer-readable instructions, the following steps are implemented:
  • the real-time environmental sound signal contains the designated keyword, the real-time environmental sound signal is recognized through a voice recognition model to obtain a user spoken instruction;
  • the machine logic instruction is sent to the execution device, so that the execution device executes the machine logic instruction.
  • a computer-readable storage medium in one embodiment, includes a non-volatile readable storage medium and a volatile readable storage medium.
  • the readable storage medium stores computer readable instructions, and the computer readable instructions implement the following steps when executed by the processor:
  • the real-time environmental sound signal contains the designated keyword, the real-time environmental sound signal is recognized through a voice recognition model to obtain a user spoken instruction;
  • the machine logic instruction is sent to the execution device, so that the execution device executes the machine logic instruction.
  • Non-volatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory.
  • Volatile memory may include random access memory (RAM) or external cache memory.
  • RAM is available in many forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain Channel (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Machine Translation (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

一种语音处理方法、装置、计算机设备及存储介质,其方法包括:通过音频缓冲器缓存实时环境声音信号(S10);检测实时环境声音信号是否包含指定关键词(S20);若检测到实时环境声音信号包含指定关键词,则通过语音识别模型对实时环境声音信号进行识别,获得用户口语指令(S30);将用户口语指令转化为机器逻辑指令(S40);将机器逻辑指令发送到执行设备,以使执行设备执行机器逻辑指令(S50)。该语音处理方法可以克服现有技术中唤醒和语音识别不同步,实时对用户的语音指令进行识别,提高了用户体验。

Description

语音处理方法、装置、计算机设备及存储介质
本申请以2019年5月10日提交的申请号为201910390372.2,名称为“语音处理方法、装置、计算机设备及存储介质”的中国发明申请为基础,并要求其优先权。
技术领域
本申请涉及语音处理领域,尤其涉及一种语音处理方法、装置、计算机设备及存储介质。
背景技术
现有的一部分语音识别系统,是依赖于语音进行激活的。这种语音交互系统,往往依赖于对用户语音中的关键词进行识别。例如,一具有语音交互功能的智能音箱,其设置的唤醒关键词为“Hello”,当用户在智能音箱附近说出“Hello”,智能音箱的语音识别模块在关键词监测模式下监测到该“Hello”语音,然后将语音识别模块切换工作模式(从关键词监测模式切换为语音识别模式),监听用户发出的语音指令(语音指令可以用于命令智能音箱打开音乐或播报新闻等)。
然而,在现有的语音识别过程中,关键词识别与语音指令识别之间存在一定的时间间隔(切换工作模式需要一定时间),导致用户在连续说出唤醒关键词与语音指令时,语音指令无法被正确识别(因为此时语音识别模式尚未启用)。虽然,在时间间隔内,可以通过播放一个简短的铃声或者产生一些视觉反馈来通知用户设备已经完成加载,可以使用语音指令进行下一步操作。然而,这种时间间隔产生的停顿在语音的自然流中产生中断,对用户体验的质量产生负面影响。
发明内容
基于此,有必要针对上述技术问题,提供一种语音处理方法、装置、计算机设 备及存储介质,以克服现有技术中,唤醒和语音识别不同步,导致用户体验不佳的问题。
一种语音处理方法,包括:
通过音频缓冲器缓存实时环境声音信号;
检测所述实时环境声音信号是否包含指定关键词;
若检测到所述实时环境声音信号包含所述指定关键词,则通过语音识别模型对所述实时环境声音信号进行识别,获得用户口语指令;
将所述用户口语指令转化为机器逻辑指令;
将所述机器逻辑指令发送到执行设备,以使执行设备执行所述机器逻辑指令。
一种语音处理装置,包括:
缓存模块,用于通过音频缓冲器缓存实时环境声音信号;
检测模块,用于检测所述实时环境声音信号是否包含指定关键词;
识别模块,用于若检测到所述实时环境声音信号包含所述指定关键词,则通过语音识别模型对所述实时环境声音信号进行识别,获得用户口语指令;
指令转化模块,用于将所述用户口语指令转化为机器逻辑指令;
执行模块,用于将所述机器逻辑指令发送到执行设备,以使执行设备执行所述机器逻辑指令。
一种计算机设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机可读指令,所述处理器执行所述计算机可读指令时实现如下步骤:
通过音频缓冲器缓存实时环境声音信号;
检测所述实时环境声音信号是否包含指定关键词;
若检测到所述实时环境声音信号包含所述指定关键词,则通过语音识别模型对所述实时环境声音信号进行识别,获得用户口语指令;
将所述用户口语指令转化为机器逻辑指令;
将所述机器逻辑指令发送到执行设备,以使执行设备执行所述机器逻辑指令。
一个或多个存储有计算机可读指令的可读存储介质,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行如下步骤:
通过音频缓冲器缓存实时环境声音信号;
检测所述实时环境声音信号是否包含指定关键词;
若检测到所述实时环境声音信号包含所述指定关键词,则通过语音识别模型对所述实时环境声音信号进行识别,获得用户口语指令;
将所述用户口语指令转化为机器逻辑指令;
将所述机器逻辑指令发送到执行设备,以使执行设备执行所述机器逻辑指令。
本申请的一个或多个实施例的细节在下面的附图和描述中提出,本申请的其他特征和优点将从说明书、附图以及权利要求变得明显。
附图说明
为了更清楚地说明本申请实施例的技术方案,下面将对本申请实施例的描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。
图1是本申请一实施例中语音处理方法的一应用环境示意图;
图2是本申请一实施例中语音处理方法的一流程示意图;
图3是本申请一实施例中语音处理方法的一流程示意图;
图4是本申请一实施例中语音处理方法的一流程示意图;
图5是本申请一实施例中语音处理方法的一流程示意图;
图6是本申请一实施例中语音处理方法的一流程示意图;
图7是本申请一实施例中语音处理方法的一流程示意图;
图8是本申请一实施例中语音处理装置的一结构示意图;
图9是本申请一实施例中语音处理装置的一结构示意图;
图10是本申请一实施例中计算机设备的一示意图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实 施例。基于本申请中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
本实施例提供的语音处理方法,可应用在如图1的应用环境中,其中,客户端通过网络与服务端进行通信。客户端包括但不限于各种个人计算机、笔记本电脑、智能手机、平板电脑和便携式可穿戴设备。服务端可以用独立的服务器或者是多个服务器组成的服务器集群来实现。
在一实施例中,如图2所示,提供一种语音处理方法,以该方法应用在图1中的服务端为例进行说明,包括如下步骤:
S10、通过音频缓冲器缓存实时环境声音信号。
本实施例中,音频缓冲器可以指用于临时记录实时环境声音信号的存储器。音频存储器的存储大小可以调整为超过最长的指定关键字或关键短语的持续时间的存储空间。例如,指定关键字的长度为10秒钟,则音频存储器的存储空间可以设置为可存储大于10秒的实时环境声音信号。实时环境声音信号指的是当前环境下所录制的声音信号。由于音频存储器的存储空间有限,因而,音频存储器中仅保存距离当前最近一段时间(长度为音频存储器的存储空间的上限)的实时环境声音信号。
S20、检测所述实时环境声音信号是否包含指定关键词。
可以使用语音唤醒处理模块检测实时环境声音信号是否包含指定关键词。语音唤醒处理模块中可以包含相应的语音识别模型。语音唤醒处理模块可以基于现有的关键字定位技术,如微软小娜
Figure PCTCN2019116513-appb-000001
的WoV唤醒处理单元。指定关键词可以根据用户自主设置,也可以基于系统内的预先设置。
语音唤醒处理模块可以检测音频缓冲器中的实时环境声音信号是否包含指定关键词,并确定检测结果。例如,指定关键词设置为“你好,电脑”,若实时环境声音信号中包含“你好,电脑”,语音唤醒处理模块可以检测到实时环境声音信号中包含的“你好,电脑”,并确定实时环境声音信号包含指定关键词。在判断实时环境声音信号是否包含指定关键词时,预先设置匹配度阈值,计算实时环境声音信号与指定关键词的匹配度(可以先生成指定关键词的标准语音,从标准语音提取声学特征,如声音能量、波形等,然后从实时环境声音信号提取声学特征 ,然后,计算指定关键字对应的声学特征与实时环境声音信号与的匹配度),判断计算出的匹配度是否不小于匹配度阈值,若计算出的匹配度不小于匹配度阈值,则判定实时环境声音信号包含指定关键词,若计算出的匹配度小于匹配度阈值,则判定实时环境声音信号不包含指定关键词。在此处,匹配度用于表征由指定关键词生成的标准语音与实时环境声音信号的相似程度。
S30、若检测到所述实时环境声音信号包含所述指定关键词,则通过语音识别模型对所述实时环境声音信号进行识别,获得用户口语指令。
可以使用语音识别模块中预设的语音识别模型对实时环境声音信号进行识别,获取用户的口语指令。语音识别模块可以是独立于语音唤醒处理模块的语音处理模块,如,可以是基于ASR(Automatic Speech Recognition的缩写,自动语音识别)技术的语音处理模块。语音唤醒处理模块可以内嵌或连接触发器,该触发器与语音识别模块连接。当语音唤醒处理模块检测到实时环境声音信号包含指定关键词,则激活触发器,由触发器发出唤醒信号,唤醒语音识别模块。语音识别模块唤醒后,将从休眠或待机状态转换到激活状态,并通过该语音识别模块中预置的语音识别模型对音频缓冲器缓存的实时环境声音信号进行识别。此时,语音识别模块可以识别音频缓冲器缓存的指定关键词之后的实时环境声音信号,并将指定关键词之后的实时环境声音信号转化为用户口语指令。例如,用户开始讲话,具体内容为:你好电脑,请打开厨房的灯。由于“你好电脑”是指定关键词,语音识别模块可以识别音频缓冲器缓存的指定关键词之后的实时环境声音信号,识别出用户口语指定“请打开厨房的灯”。
语音识别模型可以是内部自建的,也可以使用外部的计算资源。若使用外部的语音识别模型,则可以通过专用的接口向该语音识别模型发送待识别的实时环境声音信号,并获取该语音识别模型反馈的识别结果(即用户口语指令)。若语音识别模型为自建的模型,可获取大量的语音样本(如可以使用公用网络的开源数据),然后将语音样本输入预设的神经网络模型中进行训练。这里的神经网络模型可以是基于Markov算法、N-gram算法或递归神经网络的统计语言模型。在训练完毕后,使用测试样本对训练后的模型进行测试,若测试通过,则可以将训练后的模型用作本实施例的语音识别模型。
需要注意的是,若语音唤醒处理模块检测到音频缓冲器中的实时环境声音信号不包含指定关键词,则继续检测音频缓冲器中的实时环境声音信号的变化。
S40、将所述用户口语指令转化为机器逻辑指令。
本实施例中,可以使用自然语言理解模块将用户口语指令转化为机器逻辑指令。自然语言理解模块可以基于用户口语指令生成机器逻辑指令。由于机器并无法识别用户口语指令,需要自然语言理解模块对用户口语指令中的信息进行提取,生成机器可以识别的机器逻辑指令。例如,用户口语指令为“请打开厨房的灯”,自然语言理解模块可以从“请打开厨房的灯”提取出关键信息:“打开”、“厨房”、“灯”,并生成相应的开启厨房的灯的控制指令(即机器逻辑指令)。
S50、将所述机器逻辑指令发送到执行设备,以使执行设备执行所述机器逻辑指令。
执行设备可以是与语音处理装置连接的受控设备,如家用设备,智能汽车等。当执行设备接收到语音处理装置发送的机器逻辑指令时,可以根据机器逻辑指令执行相应的操作。如当厨房的灯接收到自然语言理解模块发送的开启指令时,则厨房的灯响应该开启指令,并完成开启操作。在一些情况下,执行设备也可以是非实体的设备,如手机或其他设备上的音乐播放器、收音机等。
步骤S10-S50中,通过音频缓冲器缓存实时环境声音信号,以实时获取环境中的声音信息(即为上述实时环境声音信号)。检测所述实时环境声音信号是否包含指定关键词,通过关键词检测以确定是否唤醒语音处理装置。若检测到所述实时环境声音信号包含所述指定关键词,则通过语音识别模型对所述实时环境声音信号进行识别,获得用户口语指令,以快速唤醒设备,同时对音频缓冲器中的实时环境声音信号进行处理,获得用户口语指令。将所述用户口语指令转化为机器逻辑指令,以将用户口语指令转化为机器可识别的指令。将所述机器逻辑指令发送到执行设备,以使执行设备执行所述机器逻辑指令,以完成用户口语指令所要求完成的操作。
可选的,如图3所示,所述通过音频缓冲器缓存实时环境声音信号,包括:
S101、采集环境声音,生成所述实时环境声音信号;
S102、在所述音频缓冲器以循环缓冲的方式存储所述实时环境声音信号。
本实施例中,可以使用采集模块采集环境声音。声音采集模块可以是诸如麦克风或麦克风阵列之类的音频捕抓设备。声音采集模块可以录制其所在环境内的声音(可以包括用户的语音),并将环境内的声音转化为实时环境声音信号。
音频缓冲器可以配置为存储由声音采集模块提供的实时环境声音信号。实时环境声音信号可以包括用户讲话时的用户语音片段(或从这些用户语音片段中提取的音频特征)。
具体的,音频缓冲器可以是循环缓冲器或环形缓冲器。音频缓冲器以循环缓冲的方式存储实时环境声音信号,即,时间最久的实时环境声音信号被更新的实时环境声音信号所覆盖。
步骤S101-S102中,采集环境声音,生成所述实时环境声音信号,以获得声音信号的初始数据。在所述音频缓冲器以循环缓冲的方式存储所述实时环境声音信号,以将实时采集到的声音信号缓存于音频缓存器中。
可选的,如图4所示,所述检测所述实时环境声音信号是否包含指定关键词之前,还包括:
S21、获取用户输入的关键词设置信息;
S22、判断所述关键词设置信息是否符合预设规范;
S23、若所述关键词设置信息符合所述预设规范,则确定所述关键词设置信息为所述指定关键词。
本实施例中,可以提供多种形式获取用户输入的关键词设置信息,可以是语音输入,也可以是文本输入。关键词设置信息指的是用户输入的用于设置指定关键词的信息。例如,用户打算使用“你好电脑”作为指定关键词,则可以通过语音输入的方式录入关键词设置信息(如,在关键词设置程序中,说出“你好电脑”,由声音采集模块采集到该关键词设置信息),也可以通过输入“你好电脑”文本的方式录入关键词设置信息(如,使用与语音处理装置连接智能手机,智能手机上安装有控制语音处理装置的应用程序,用户可以在该应用程序上输入“你好电脑”的关键词设置信息)。
预设规范用于确定关键词设置信息是否适于作为语音处理装置的指定关键词。例如,而预设规范可以定义一些非法字符,如非法字符可以是标点符号,当关 键词设置信息包括标点符号,则该关键词设置信息不符合预设规范。
预设规范还可以规定一些非法、或不文明词句不能作为指定关键词。例如,若关键词设置信息中包含“fuck”、“法西斯”等词语时,该关键词设置信息也是不符合预设规范。
在另一些情况下,如关键词设置信息以语音方式进行输入,当用户发出的语音无法被正常识别(如用户发出模仿动物的叫声),也可以判断用户输入的关键词设置信息不符合预设规范。
当关键词设置信息不符合预设规范时,可以提醒用户当前输入的关键词信息不可用,需要重新输入关键词设置信息。
若关键词设置信息符合预设规范,则确定所述关键词设置信息为所述指定关键词。
步骤S21-S23中,获取用户输入的关键词设置信息,以获得用于唤醒设备的关键词。判断所述关键词设置信息是否符合预设规范,以确保关键词设置信息中设置的关键词可用或适用。若所述关键词设置信息符合所述预设规范,则确定所述关键词设置信息为所述指定关键词,以完成关键词的设置。
可选的,如图5所示,所述若检测到所述实时环境声音信号包含关键词的语音,则对所述实时环境声音信号进行识别,获得用户口语指令,包括:
S301、当检测到所述实时环境声音信号包含关键词的语音时,生成唤醒指令;
S302、根据所述唤醒指令监测所述实时环境声音信号中的用户语句终点;
S303、若监测到所述实时环境声音信号中的用户语句终点,则对所述用户语句终点前的所述实时环境声音信号进行识别,并将所述用户语句终点前的所述实时环境声音信号转化为所述用户口语指令。
本实施例中,可以设置触发器来响应关键词的检测结果。例如,当语音唤醒处理模块检测到实时环境声音信号包含关键词的语音,则可以基于语音唤醒处理模块内嵌的或与连接语音唤醒处理模块连接的触发器产生唤醒信号(也即唤醒指令),并将该唤醒信号发送至语音识别模块。当接收唤醒信号,语音识别模块可以从低功率空闲状态转换为高功率识别状态。此时,语音唤醒处理模块处于空置状态。
处于高功率识别状态时,语音识别模块可以监测实时环境声音信号中的用户语句终点,来确定需要处理的实时环境声音信号。可以基于预设时长范围和实时环境声音信号的能量变化来确定用户语句终点。如可以定义预设时长范围为3-10秒,能量阈值为当前环境的背景噪音平均值,当检测到的实时环境声音信号低于能量阈值,则认为用户发言完毕(也可以是停顿),即监测到实时环境声音信号中的用户语句终点(此处的用户语句终点也可能不是实际的用户语音的终点)。
在一些实施例中,需要处理的实时环境声音信号可以包括音频缓冲器缓冲的初始段(即包括指定关键词的实时环境声音信号)和实时环境声音信号之后的一个或多个附加接收段的音频信号。其中,附加接收段包括来自用户的进一步语音。在另一些实施例中,指定关键词可以同时被语音唤醒处理模块和语音识别模块所识别。
语音识别模块还可以设置停止进行语音识别的终点。例如,在指定时长的时间内未检测到语音活动,则语音识别模块从高功率识别状态转换到低功率空闲状态。
步骤S301-S303中,当检测到所述实时环境声音信号包含关键词的语音时,生成唤醒指令,以及时响应用户的口语指令。根据所述唤醒指令监测所述实时环境声音信号中的用户语句终点,以确保获取到的用户口语指令是完整的。若监测到所述实时环境声音信号中的用户语句终点,则对所述用户语句终点前的所述实时环境声音信号进行识别,并将所述用户语句终点前的所述实时环境声音信号转化为所述用户口语指令,以获得需要进行处理的用户口语指令。
可选的,如图6所示,所述通过音频缓冲器缓存实时环境声音信号之前,包括:
S11、检测当前环境中的所有声音信号,并判断在所有声音信号中是否存在符合预设声源要求的目标声源;
S12、在存在符合预设声源要求的目标声源时,为所述目标声源添加识别标识;
S13、通过声源定位运算对所述目标声源进行定位,获取所述目标声源的声源 位置,所述声源位置与所述识别标识关联。
在本实施例中,在音频缓冲器缓存实时环境声音信号之前,可以先对获取到的环境中的所有声音信号进行检测,判断各个声音信号是否符合预设声源要求。在此处,可以对获取到的环境中的所有声音信号进行分离,获得多个独立的声音信号。如,可以使用ManyEars技术对声音信号进行分离。
具体的,符合预设声源要求的目标声源,是指在预设音量大小范围中的持续时长大于预设时长的声音信号。其中,预设音量大小范围可以根据需求进行设定,可以设定该音量大小范围的最小值和最大值。超出所述音量大小范围的最大值的即视为噪音,此时将其排除在目标声源的范围之外。小于所述音量大小的最小值的,可以视为其不是该当前环境中所需要进行追踪的追踪对象所发出的目标声源。可理解地,预设音量大小范围以及预设时长可以根据当前环境的不同进行设定。
进一步地,在不存在符合预设声源要求的目标声源时,此时继续获取当前环境中的声音信号进行检测。
在确定存在符合预设声源要求的目标声源时,可以为该目标声源添加识别标识。在声音信号中存在多个符合预设声源要求的目标声源时,可以为各个目标声源添加不同的识别标识,如,可以标记为第一目标声源、第二目标声源等。
作为优选,可以由通过ManyEars技术中的声源定位运算对属于目标声源的声音信息进行定位,确定目标声源的具体的实时位置。在此处,声音的采集设备可以是麦克风阵列,可以根据采集到的声音信号的时序上的微小差别计算出目标声源的声源位置。声源位置可以包括方向和距离。
步骤S11-S13中,检测当前环境中的所有声音信号,并判断在所有声音信号中是否存在符合预设声源要求的目标声源,以确定目标声源的存在。在存在符合预设声源要求的目标声源时,为所述目标声源添加识别标识,以区分可能存在的不同目标声源。通过声源定位运算对所述目标声源进行定位,获取所述目标声源的声源位置,所述声源位置与所述识别标识关联,以确定目标声源对应的位置(即声源位置)。
可选的,所述通过语音识别模型对所述实时环境声音信号进行识别,获得用户 口语指令,包括:
S304、获取与所述声源位置匹配的调校参数;
S305、根据所述调校参数对所述实时环境声音信号进行处理,生成优化声音信号;
S306、使用所述语音识别模型对所述优化声音信号进行处理,获得所述用户口语指令。
本实施例中,为了提高对实时环境信号的识别率,可以根据计算出的声源位置对实时环境信号进行优化处理。在此处,调校参数包括但不限于音量增益、特定噪音特征参数、混响回音特征参数。调教参数因所在环境的不同而存在差异,同时,也受到声音信号采集设备的放置位置影响。在一些情况下,调教参数可以是根据在先采集的语音数据自主学习而获得(如,可使用无监督学习算法自行对已采集的语音数据进行处理)。
在获得与声源位置匹配的调校参数后,可以使用该调校参数对实时环境声音信号进行优化处理,生成更利于语音识别模型识别的优化声音信号。
最后,由语音识别模型对优化声音信号进行处理,获得需要的用户口语指令。由于优化声音信号的质量更高,获得的用户口语指令也更为精确。在一些特定环境下,优化声音信号可以有效消除原有的实时环境信号中的环境杂音、混响信号,大大提高用户口语指令的识别正确率,减少用户重复发出口语指令的次数。
步骤S304-S306中,获取与所述声源位置匹配的调校参数,以进一步对实时环境信号进行优化处理。根据所述调校参数对所述实时环境声音信号进行处理,生成优化声音信号,以获得更适于语音识别模型处理的声音信号。使用所述语音识别模型对所述优化声音信号进行处理,获得所述用户口语指令,以识别出用户的口语指令。
应理解,上述实施例中各步骤的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。
在一实施例中,提供一种语音处理装置,该语音处理装置与上述实施例中语音 处理方法一一对应。如图8所示,该语音处理装置包括缓存模块10、检测模块20、识别模块30、指令转化模块40和执行模块50。各功能模块详细说明如下:
缓存模块10,用于通过音频缓冲器缓存实时环境声音信号;
检测模块20,用于检测所述实时环境声音信号是否包含指定关键词;
识别模块30,用于若检测到所述实时环境声音信号包含所述指定关键词,则通过语音识别模型对所述实时环境声音信号进行识别,获得用户口语指令;
指令转化模块40,用于将所述用户口语指令转化为机器逻辑指令;
执行模块50,用于将所述机器逻辑指令发送到执行设备,以使执行设备执行所述机器逻辑指令。
可选的,如图9所示,缓存模块10包括:
采集单元101,用于采集环境声音,生成所述实时环境声音信号;
存储单元102,用于在所述音频缓冲器以循环缓冲的方式存储所述实时环境声音信号。
可选的,语音处理装置还包括设置模块,该设置模块包括:
获取设置信息单元,用于获取用户输入的关键词设置信息;
规范判断单元,用于判断所述关键词设置信息是否符合预设规范;
确定关键词单元,用于若所述关键词设置信息符合所述预设规范,则确定所述关键词设置信息为所述指定关键词。
可选的,识别模块30包括:
唤醒单元,用于当检测到所述实时环境声音信号包含关键词的语音时,生成唤醒指令;
语句终点检测单元,用于根据所述唤醒指令监测所述实时环境声音信号中的用户语句终点;
口语指令转化单元,用于若监测到所述实时环境声音信号中的用户语句终点,则对所述用户语句终点前的所述实时环境声音信号进行识别,并将所述用户语句终点前的所述实时环境声音信号转化为所述用户口语指令。
可选的,语音处理装置还包括定位模块,该定位模块包括:
目标声源判断单元,用于检测当前环境中的所有声音信号,并判断在所有声音 信号中是否存在符合预设声源要求的目标声源;
添加标识单元,用于在存在符合预设声源要求的目标声源时,为所述目标声源添加识别标识;
确定声源位置单元,用于通过声源定位运算对所述目标声源进行定位,获取所述目标声源的声源位置,所述声源位置与所述识别标识关联。
可选的,识别模块30包括:
获取参数单元,用于获取与所述声源位置匹配的调校参数;
声音优化单元,用于根据所述调校参数对所述实时环境声音信号进行处理,生成优化声音信号;
语音识别单元,用于使用所述语音识别模型对所述优化声音信号进行处理,获得所述用户口语指令。
关于语音处理装置的具体限定可以参见上文中对于语音处理方法的限定,在此不再赘述。上述语音处理装置中的各个模块可全部或部分通过软件、硬件及其组合来实现。上述各模块可以硬件形式内嵌于或独立于计算机设备中的处理器中,也可以以软件形式存储于计算机设备中的存储器中,以便于处理器调用执行以上各个模块对应的操作。
在一个实施例中,提供了一种计算机设备,该计算机设备可以是服务器,其内部结构图可以如图10所示。该计算机设备包括通过系统总线连接的处理器、存储器、网络接口和数据库。其中,该计算机设备的处理器用于提供计算和控制能力。该计算机设备的存储器包括非易失性存储介质、内存储器。该非易失性存储介质存储有操作系统、计算机可读指令和数据库。该内存储器为非易失性存储介质中的操作系统和计算机可读指令的运行提供环境。该计算机设备的数据库用于存储上述语音处理方法所涉及的数据。该计算机设备的网络接口用于与外部的终端通过网络连接通信。该计算机可读指令被处理器执行时以实现一种语音处理方法。本实施例所提供的可读存储介质包括非易失性可读存储介质和易失性可读存储介质。
在一个实施例中,提供了一种计算机设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机可读指令,处理器执行计算机可读指令时实 现以下步骤:
通过音频缓冲器缓存实时环境声音信号;
检测所述实时环境声音信号是否包含指定关键词;
若检测到所述实时环境声音信号包含所述指定关键词,则通过语音识别模型对所述实时环境声音信号进行识别,获得用户口语指令;
将所述用户口语指令转化为机器逻辑指令;
将所述机器逻辑指令发送到执行设备,以使执行设备执行所述机器逻辑指令。
在一个实施例中,提供了一种计算机可读存储介质,本实施例所提供的可读存储介质包括非易失性可读存储介质和易失性可读存储介质。可读存储介质上存储有计算机可读指令,计算机可读指令被处理器执行时实现以下步骤:
通过音频缓冲器缓存实时环境声音信号;
检测所述实时环境声音信号是否包含指定关键词;
若检测到所述实时环境声音信号包含所述指定关键词,则通过语音识别模型对所述实时环境声音信号进行识别,获得用户口语指令;
将所述用户口语指令转化为机器逻辑指令;
将所述机器逻辑指令发送到执行设备,以使执行设备执行所述机器逻辑指令。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机可读指令来指令相关的硬件来完成,所述的计算机可读指令可存储于一非易失性计算机可读取存储介质或易失性可读存储介质中,该计算机可读指令在执行时,可包括如上述各方法的实施例的流程。其中,本申请所提供的各实施例中所使用的对存储器、存储、数据库或其它介质的任何引用,均可包括非易失性和/或易失性存储器。非易失性存储器可包括只读存储器(ROM)、可编程ROM(PROM)、电可编程ROM(EPROM)、电可擦除可编程ROM(EEPROM)或闪存。易失性存储器可包括随机存取存储器(RAM)或者外部高速缓冲存储器。作为说明而非局限,RAM以多种形式可得,诸如静态RAM(SRAM)、动态RAM(DRAM)、同步DRAM(SDRAM)、双数据率SDRAM(DDRSDRAM)、增强型SDRAM(ESDRAM)、同步链路(Synchlink)DRAM(SLDRAM)、存储器总线(Rambus)直接RAM(RDRAM)、直接存 储器总线动态RAM(DRDRAM)、以及存储器总线动态RAM(RDRAM)等。
所属领域的技术人员可以清楚地了解到,为了描述的方便和简洁,仅以上述各功能单元、模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能单元、模块完成,即将所述装置的内部结构划分成不同的功能单元或模块,以完成以上描述的全部或者部分功能。
以上所述实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围,均应包含在本申请的保护范围之内。

Claims (20)

  1. 一种语音处理方法,其特征在于,包括:
    通过音频缓冲器缓存实时环境声音信号;
    检测所述实时环境声音信号是否包含指定关键词;
    若检测到所述实时环境声音信号包含所述指定关键词,则通过语音识别模型对所述实时环境声音信号进行识别,获得用户口语指令;
    将所述用户口语指令转化为机器逻辑指令;
    将所述机器逻辑指令发送到执行设备,以使执行设备执行所述机器逻辑指令。
  2. 如权利要求1所述的语音处理方法,其特征在于,所述通过音频缓冲器缓存实时环境声音信号,包括:
    采集环境声音,生成所述实时环境声音信号;
    在所述音频缓冲器以循环缓冲的方式存储所述实时环境声音信号。
  3. 如权利要求1所述的语音处理方法,其特征在于,所述检测所述实时环境声音信号是否包含指定关键词之前,还包括:
    获取用户输入的关键词设置信息;
    判断所述关键词设置信息是否符合预设规范;
    若所述关键词设置信息符合所述预设规范,则确定所述关键词设置信息为所述指定关键词。
  4. 如权利要求1所述的语音处理方法,其特征在于,所述若检测到所述实时环境声音信号包含关键词的语音,则对所述实时环境声音信号进行识别,获得用户口语指令,包括:
    当检测到所述实时环境声音信号包含关键词的语音时,生成唤醒指令;
    根据所述唤醒指令监测所述实时环境声音信号中的用户语句终点;
    若监测到所述实时环境声音信号中的用户语句终点,则对所述用户语句终点前的所述实时环境声音信号进行识别,并将所述用户语句终点前的所述实时环境声音信号转化为所述用户口语指令。
  5. 如权利要求1所述的语音处理方法,其特征在于,所述通过音频缓冲器缓存实时环境声音信号之前,包括:
    检测当前环境中的所有声音信号,并判断在所有声音信号中是否存在符合预设声源要求的目标声源;
    在存在符合预设声源要求的目标声源时,为所述目标声源添加识别标识;
    通过声源定位运算对所述目标声源进行定位,获取所述目标声源的声源位置,所述声源位置与所述识别标识关联。
  6. 如权利要求5所述的语音处理方法,其特征在于,所述通过语音识别模型对所述实时环境声音信号进行识别,获得用户口语指令,包括:
    获取与所述声源位置匹配的调校参数;
    根据所述调校参数对所述实时环境声音信号进行处理,生成优化声音信号;
    使用所述语音识别模型对所述优化声音信号进行处理,获得所述用户口语指令。
  7. 一种语音处理装置,其特征在于,包括:
    缓存模块,用于通过音频缓冲器缓存实时环境声音信号;
    检测模块,用于检测所述实时环境声音信号是否包含指定关键词;
    识别模块,用于若检测到所述实时环境声音信号包含所述指定关键词,则通过语音识别模型对所述实时环境声音信号进行识别,获得用户口语指令;
    指令转化模块,用于将所述用户口语指令转化为机器逻辑指令;
    执行模块,用于将所述机器逻辑指令发送到执行设备,以使执行 设备执行所述机器逻辑指令。
  8. 如权利要求7所述的语音处理装置,其特征在于,所述缓存模块包括:
    采集单元,用于采集环境声音,生成所述实时环境声音信号;
    存储单元,用于在所述音频缓冲器以循环缓冲的方式存储所述实时环境声音信号。
  9. 如权利要求7所述的语音处理装置,其特征在于,所述语音处理装置还包括设置模块,该设置模块包括:
    获取设置信息单元,用于获取用户输入的关键词设置信息;
    规范判断单元,用于判断所述关键词设置信息是否符合预设规范;
    确定关键词单元,用于若所述关键词设置信息符合所述预设规范,则确定所述关键词设置信息为所述指定关键词。
  10. 如权利要求7所述的语音处理装置,其特征在于,所述识别模块包括:
    唤醒单元,用于当检测到所述实时环境声音信号包含关键词的语音时,生成唤醒指令;
    语句终点检测单元,用于根据所述唤醒指令监测所述实时环境声音信号中的用户语句终点;
    口语指令转化单元,用于若监测到所述实时环境声音信号中的用户语句终点,则对所述用户语句终点前的所述实时环境声音信号进行识别,并将所述用户语句终点前的所述实时环境声音信号转化为所述用户口语指令。
  11. 如权利要求7所述的语音处理装置,其特征在于,所述语音处理装置还包括定位模块,该定位模块包括:
    目标声源判断单元,用于检测当前环境中的所有声音信号,并判断在所有声音信号中是否存在符合预设声源要求的目标声源;
    添加标识单元,用于在存在符合预设声源要求的目标声源时,为 所述目标声源添加识别标识;
    确定声源位置单元,用于通过声源定位运算对所述目标声源进行定位,获取所述目标声源的声源位置,所述声源位置与所述识别标识关联。
  12. 如权利要求11所述的语音处理装置,其特征在于,所述识别模块包括:
    获取参数单元,用于获取与所述声源位置匹配的调校参数;
    声音优化单元,用于根据所述调校参数对所述实时环境声音信号进行处理,生成优化声音信号;
    语音识别单元,用于使用所述语音识别模型对所述优化声音信号进行处理,获得所述用户口语指令。
  13. 一种计算机设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机可读指令,其特征在于,所述处理器执行所述计算机可读指令时实现如下步骤:
    通过音频缓冲器缓存实时环境声音信号;
    检测所述实时环境声音信号是否包含指定关键词;
    若检测到所述实时环境声音信号包含所述指定关键词,则通过语音识别模型对所述实时环境声音信号进行识别,获得用户口语指令;
    将所述用户口语指令转化为机器逻辑指令;
    将所述机器逻辑指令发送到执行设备,以使执行设备执行所述机器逻辑指令。
  14. 如权利要求13所述的计算机设备,其特征在于,所述通过音频缓冲器缓存实时环境声音信号,包括:
    采集环境声音,生成所述实时环境声音信号;
    在所述音频缓冲器以循环缓冲的方式存储所述实时环境声音信号。
  15. 如权利要求13所述的计算机设备,其特征在于,在所述检测所述 实时环境声音信号是否包含指定关键词之前,所述处理器执行所述计算机可读指令时还实现如下步骤:
    获取用户输入的关键词设置信息;
    判断所述关键词设置信息是否符合预设规范;
    若所述关键词设置信息符合所述预设规范,则确定所述关键词设置信息为所述指定关键词。
  16. 如权利要求13所述的计算机设备,其特征在于,所述若检测到所述实时环境声音信号包含关键词的语音,则对所述实时环境声音信号进行识别,获得用户口语指令,包括:
    当检测到所述实时环境声音信号包含关键词的语音时,生成唤醒指令;
    根据所述唤醒指令监测所述实时环境声音信号中的用户语句终点;
    若监测到所述实时环境声音信号中的用户语句终点,则对所述用户语句终点前的所述实时环境声音信号进行识别,并将所述用户语句终点前的所述实时环境声音信号转化为所述用户口语指令。
  17. 一个或多个存储有计算机可读指令的可读存储介质,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行如下步骤:
    通过音频缓冲器缓存实时环境声音信号;
    检测所述实时环境声音信号是否包含指定关键词;
    若检测到所述实时环境声音信号包含所述指定关键词,则通过语音识别模型对所述实时环境声音信号进行识别,获得用户口语指令;
    将所述用户口语指令转化为机器逻辑指令;
    将所述机器逻辑指令发送到执行设备,以使执行设备执行所述机器逻辑指令。
  18. 如权利要求17所述的可读存储介质,其特征在于,所述通过音频 缓冲器缓存实时环境声音信号,包括:
    采集环境声音,生成所述实时环境声音信号;
    在所述音频缓冲器以循环缓冲的方式存储所述实时环境声音信号。
  19. 如权利要求17所述的可读存储介质,其特征在于,在所述检测所述实时环境声音信号是否包含指定关键词之前,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器还执行如下步骤:
    获取用户输入的关键词设置信息;
    判断所述关键词设置信息是否符合预设规范;
    若所述关键词设置信息符合所述预设规范,则确定所述关键词设置信息为所述指定关键词。
  20. 如权利要求17所述的可读存储介质,其特征在于,所述若检测到所述实时环境声音信号包含关键词的语音,则对所述实时环境声音信号进行识别,获得用户口语指令,包括:
    当检测到所述实时环境声音信号包含关键词的语音时,生成唤醒指令;
    根据所述唤醒指令监测所述实时环境声音信号中的用户语句终点;
    若监测到所述实时环境声音信号中的用户语句终点,则对所述用户语句终点前的所述实时环境声音信号进行识别,并将所述用户语句终点前的所述实时环境声音信号转化为所述用户口语指令。
PCT/CN2019/116513 2019-05-10 2019-11-08 语音处理方法、装置、计算机设备及存储介质 WO2020228270A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910390372.2 2019-05-10
CN201910390372.2A CN110232916A (zh) 2019-05-10 2019-05-10 语音处理方法、装置、计算机设备及存储介质

Publications (1)

Publication Number Publication Date
WO2020228270A1 true WO2020228270A1 (zh) 2020-11-19

Family

ID=67860467

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/116513 WO2020228270A1 (zh) 2019-05-10 2019-11-08 语音处理方法、装置、计算机设备及存储介质

Country Status (2)

Country Link
CN (1) CN110232916A (zh)
WO (1) WO2020228270A1 (zh)

Families Citing this family (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110232916A (zh) * 2019-05-10 2019-09-13 平安科技(深圳)有限公司 语音处理方法、装置、计算机设备及存储介质
CN111739515B (zh) * 2019-09-18 2023-08-04 北京京东尚科信息技术有限公司 语音识别方法、设备、电子设备和服务器、相关系统
CN111208736B (zh) * 2019-12-17 2023-10-27 中移(杭州)信息技术有限公司 智能音箱控制方法、装置、电子设备及存储介质
CN110970028B (zh) * 2019-12-26 2022-07-22 杭州中科先进技术研究院有限公司 一种规范语音识别设备的语音识别指令与操作指令的方法
CN111681655A (zh) * 2020-05-21 2020-09-18 北京声智科技有限公司 语音控制方法、装置、电子设备及存储介质
CN112153397B (zh) * 2020-09-16 2023-03-14 北京达佳互联信息技术有限公司 视频处理方法、装置、服务器及存储介质
CN112435670A (zh) * 2020-11-11 2021-03-02 青岛歌尔智能传感器有限公司 语音识别方法、语音识别设备和计算机可读存储介质
CN112201246B (zh) * 2020-11-19 2023-11-28 深圳市欧瑞博科技股份有限公司 基于语音的智能控制方法、装置、电子设备及存储介质
CN112416776B (zh) * 2020-11-24 2022-12-13 天津五八到家货运服务有限公司 运行环境的选择方法、装置、测试设备及存储介质
CN112420044A (zh) * 2020-12-03 2021-02-26 深圳市欧瑞博科技股份有限公司 语音识别方法、语音识别装置及电子设备
CN112581978A (zh) * 2020-12-11 2021-03-30 平安科技(深圳)有限公司 声音事件检测与定位方法、装置、设备及可读存储介质
CN112765335B (zh) * 2021-01-27 2024-03-08 上海三菱电梯有限公司 语音呼梯系统
WO2023283965A1 (zh) * 2021-07-16 2023-01-19 华为技术有限公司 用于语音代听和生成语音识别模型的方法、装置、电子设备和介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103943105A (zh) * 2014-04-18 2014-07-23 安徽科大讯飞信息科技股份有限公司 一种语音交互方法及系统
CN104538030A (zh) * 2014-12-11 2015-04-22 科大讯飞股份有限公司 一种可以通过语音控制家电的控制系统与方法
CN105654943A (zh) * 2015-10-26 2016-06-08 乐视致新电子科技(天津)有限公司 一种语音唤醒方法、装置及系统
CN108831483A (zh) * 2018-09-07 2018-11-16 马鞍山问鼎网络科技有限公司 一种人工智能语音识别系统
CN109584896A (zh) * 2018-11-01 2019-04-05 苏州奇梦者网络科技有限公司 一种语音芯片及电子设备
CN110232916A (zh) * 2019-05-10 2019-09-13 平安科技(深圳)有限公司 语音处理方法、装置、计算机设备及存储介质

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP5815956B2 (ja) * 2011-02-10 2015-11-17 キヤノン株式会社 音声処理装置及びプログラム
US9972315B2 (en) * 2015-01-14 2018-05-15 Honda Motor Co., Ltd. Speech processing device, speech processing method, and speech processing system
BR112017021673B1 (pt) * 2015-04-10 2023-02-14 Honor Device Co., Ltd Método de controle de voz, meio não-transitório legível por computador e terminal
CN107705785A (zh) * 2017-08-01 2018-02-16 百度在线网络技术(北京)有限公司 智能音箱的声源定位方法、智能音箱及计算机可读介质
CN107808670B (zh) * 2017-10-25 2021-05-14 百度在线网络技术(北京)有限公司 语音数据处理方法、装置、设备及存储介质
CN109754814B (zh) * 2017-11-08 2023-07-28 阿里巴巴集团控股有限公司 一种声音处理方法、交互设备
CN108682414A (zh) * 2018-04-20 2018-10-19 深圳小祺智能科技有限公司 语音控制方法、语音系统、设备和存储介质
CN109147779A (zh) * 2018-08-14 2019-01-04 苏州思必驰信息科技有限公司 语音数据处理方法和装置
CN108962262B (zh) * 2018-08-14 2021-10-08 思必驰科技股份有限公司 语音数据处理方法和装置

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103943105A (zh) * 2014-04-18 2014-07-23 安徽科大讯飞信息科技股份有限公司 一种语音交互方法及系统
CN104538030A (zh) * 2014-12-11 2015-04-22 科大讯飞股份有限公司 一种可以通过语音控制家电的控制系统与方法
CN105654943A (zh) * 2015-10-26 2016-06-08 乐视致新电子科技(天津)有限公司 一种语音唤醒方法、装置及系统
CN108831483A (zh) * 2018-09-07 2018-11-16 马鞍山问鼎网络科技有限公司 一种人工智能语音识别系统
CN109584896A (zh) * 2018-11-01 2019-04-05 苏州奇梦者网络科技有限公司 一种语音芯片及电子设备
CN110232916A (zh) * 2019-05-10 2019-09-13 平安科技(深圳)有限公司 语音处理方法、装置、计算机设备及存储介质

Also Published As

Publication number Publication date
CN110232916A (zh) 2019-09-13

Similar Documents

Publication Publication Date Title
WO2020228270A1 (zh) 语音处理方法、装置、计算机设备及存储介质
US20240038236A1 (en) Activation trigger processing
CN108520743B (zh) 智能设备的语音控制方法、智能设备及计算机可读介质
US10410635B2 (en) Dual mode speech recognition
CN111566730B (zh) 低功率设备中的语音命令处理
CN111223497B (zh) 一种终端的就近唤醒方法、装置、计算设备及存储介质
KR20180084392A (ko) 전자 장치 및 그의 동작 방법
US11978478B2 (en) Direction based end-pointing for speech recognition
CN110223687B (zh) 指令执行方法、装置、存储介质及电子设备
WO2020048431A1 (zh) 一种语音处理方法、电子设备和显示设备
US20200279568A1 (en) Speaker verification
US11393490B2 (en) Method, apparatus, device and computer-readable storage medium for voice interaction
US20230298575A1 (en) Freeze Words
US11437022B2 (en) Performing speaker change detection and speaker recognition on a trigger phrase
KR20230113368A (ko) 검출들의 시퀀스에 기반한 핫프레이즈 트리거링
US20230223014A1 (en) Adapting Automated Speech Recognition Parameters Based on Hotword Properties
US20230113883A1 (en) Digital Signal Processor-Based Continued Conversation
TW202029017A (zh) 音訊裝置以及語音控制方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19928940

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19928940

Country of ref document: EP

Kind code of ref document: A1