WO2020135067A1 - 语音交互方法、装置、机器人及计算机可读存储介质 - Google Patents

语音交互方法、装置、机器人及计算机可读存储介质 Download PDF

Info

Publication number
WO2020135067A1
WO2020135067A1 PCT/CN2019/124844 CN2019124844W WO2020135067A1 WO 2020135067 A1 WO2020135067 A1 WO 2020135067A1 CN 2019124844 W CN2019124844 W CN 2019124844W WO 2020135067 A1 WO2020135067 A1 WO 2020135067A1
Authority
WO
WIPO (PCT)
Prior art keywords
voice
user
robot
output
speech
Prior art date
Application number
PCT/CN2019/124844
Other languages
English (en)
French (fr)
Inventor
崔锦
谈华斌
吉东旭
胡斌
林东
乔光辉
司丽娟
Original Assignee
同方威视技术股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 同方威视技术股份有限公司 filed Critical 同方威视技术股份有限公司
Publication of WO2020135067A1 publication Critical patent/WO2020135067A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/22Interactive procedures; Man-machine interfaces
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/005Language recognition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/04Segmentation; Word boundary detection
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/08Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters
    • H04L43/0805Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters by checking availability
    • H04L43/0811Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters by checking availability by checking connectivity

Definitions

  • the present disclosure relates to the technical field of robots, and in particular, to a voice interaction method, device, robot, and computer-readable storage medium.
  • voice interaction is a more natural interaction method between humans and robots.
  • Voice interaction is simple and fast, and can liberate human hands and eyes, and can bring a better experience to users in many scenarios.
  • the robot background can also obtain user data, laying the foundation for subsequent value-added services. This shows that providing users with better voice interaction services is an important work goal in the field of voice robots.
  • a technical problem solved by the present disclosure is how to make the robot provide the user with a voice interaction service corresponding to the user's language.
  • a voice interaction method including: processing a voice input by a user using a deep learning neural network to recognize a language of the voice; and providing a robot voice interaction service corresponding to the language for the user.
  • the method further includes: training the deep learning neural network using corpora marked with different languages, so that the deep learning neural network can recognize the language of the speech input by the user.
  • the method further includes: judging the current network connection status of the robot; if the current network connection status is online, sending the voice input by the user to the cloud server for voice recognition, semantic understanding and speech synthesis, and receiving cloud feedback The first output speech; At the same time, the user input speech is sent to the local knowledge base for speech recognition, semantic understanding and speech synthesis, to obtain the local output second output speech; If the first output speech is received first, then the second output For voice, the first output voice is played to the user; if the first output voice is received after the second output voice is obtained first, the second output voice is played to the user.
  • the method further includes: if the current network status is offline, sending the voice input by the user to a local knowledge base for voice recognition, semantic understanding, and voice synthesis to obtain a second output voice of local feedback; Play the second output voice.
  • the speech input by the user is segmented according to the time interval between the words in the speech input by the user and the speech energy.
  • a Hidden Markov Model is used to segment the speech input by the user.
  • the method further includes: in the sleep state of the robot, real-time recognition of the human face by using the camera; in response to the camera recognizing the human face, awakening the robot from the sleep state to the working state.
  • a voice interactive robot including a front-end processing chip, configured to: use a deep learning neural network to process a voice input by a user, recognize a language of the voice; provide a language for the user Corresponding robot voice interactive service.
  • the deep learning neural network can recognize the language of the speech input by the user after being trained with corpora labeled with different languages.
  • it also includes a host computer configured to: determine the current network connection status of the robot; if the current network connection status is online, send the voice input by the user to the cloud server for voice recognition, semantic understanding and Speech synthesis, receive the first output speech feedback from the cloud; meanwhile, send the user input speech to the local knowledge base for speech recognition, semantic understanding and speech synthesis to get the second output speech feedback locally; if the first output is received first After the second output voice is obtained, the first output voice is played to the user; if the first output voice is received after the second output voice is obtained first, the second output voice is played to the user.
  • a host computer configured to: determine the current network connection status of the robot; if the current network connection status is online, send the voice input by the user to the cloud server for voice recognition, semantic understanding and Speech synthesis, receive the first output speech feedback from the cloud; meanwhile, send the user input speech to the local knowledge base for speech recognition, semantic understanding and speech synthesis to get the second output speech feedback locally; if the first output is received first After the second output voice is
  • the host computer is further configured to: if the current network status is offline, send the voice input by the user to a local knowledge base for voice recognition, semantic understanding, and voice synthesis to obtain a second output of local feedback Voice; play the second output voice to the user.
  • the host computer is further configured to segment the speech input by the user according to the time interval and the speech energy between the words in the speech input by the user.
  • the host computer is further configured to: use the Hidden Markov Model to segment the speech input by the user.
  • it further includes a camera configured to: recognize the human face in real-time while the robot is in the sleep state; and wake up the robot from the sleep state to the working state when the human face is recognized.
  • a microphone array is further included, and the multiple microphones of the microphone array are located on the same circumference on the same horizontal plane, and the circumferential distances between adjacent microphones are equal.
  • the microphone array is covered with a silicone sleeve, and the silicone sleeve is fixedly connected to the housing of the voice robot.
  • it further includes a microphone and multiple cavities; wherein the microphone and microphone array are disposed in different cavities of the voice interactive robot.
  • it also includes soundproof cotton wrapped on the microphone array.
  • a voice interaction device including: a memory; and a processor coupled to the memory, the processor configured to execute the foregoing voice interaction method based on instructions stored in the memory .
  • a computer-readable storage medium wherein the computer-readable storage medium stores computer instructions, and the instructions implement the aforementioned voice interaction method when executed by a processor.
  • the present disclosure enables the robot to provide users with a voice interaction service corresponding to the user's language, and improves the user experience of the robot voice interaction service.
  • FIG. 1 shows a schematic flowchart of a voice interaction method according to some embodiments of the present disclosure.
  • FIG. 2 shows a schematic flowchart of a voice interaction method according to other embodiments of the present disclosure.
  • FIG. 3 shows a schematic flowchart of a voice interaction method according to still other embodiments of the present disclosure.
  • FIG. 4 shows a schematic structural diagram of a voice interactive robot according to some embodiments of the present disclosure.
  • FIG. 5 shows a schematic structural diagram of a front-end processing chip.
  • FIG. 6 shows a schematic structural diagram of a voice interaction device according to some embodiments of the present disclosure.
  • voice interaction method of the present disclosure will be described with reference to FIG. 1 to explain how the present disclosure enables the robot to provide users with voice interaction services corresponding to the user's language.
  • FIG. 1 shows a schematic flowchart of a voice interaction method according to some embodiments of the present disclosure. As shown in FIG. 1, this embodiment includes steps S102 to S116.
  • step S102 the deep learning neural network is trained using corpora marked with different languages, so that the deep learning neural network can recognize the language of the speech input by the user.
  • a large amount of English corpus and Chinese corpus can be used to train the deep learning neural network, so that the deep learning neural network can accurately recognize the speech input by the user as Chinese speech or English speech.
  • step S104 the deep learning neural network is used to process the speech input by the user, and the language type of the user's speech is recognized.
  • the speech input by the user can be recognized as Chinese speech or English speech.
  • step S106 the user is provided with a robot voice interaction service corresponding to the language.
  • the voice recognition, semantic understanding, speech synthesis and other voice algorithms in the Chinese voice interaction platform are invoked to provide users with Chinese voice interaction Services; if the voice input by the user is recognized as English voice, call voice recognition, semantic understanding, speech synthesis and other voice algorithms in a voice interaction platform (such as Amazon Alexa voice service platform) to provide users with English voice interaction services.
  • the voice interaction method provided in the above embodiment enables the voice robot to provide voice interaction services corresponding to the user's language in complex language scenarios, enables the voice robot to accurately perform Chinese and English voice interaction with the user, and improves the robot voice interaction service User experience.
  • the voice recognition function can be improved.
  • an existing voice interaction platform converts voice into text, it is necessary to segment the text to facilitate semantic understanding.
  • sentence segmentation In order to make sentence segmentation more accurate, the following three technical measures can be used.
  • sentence segmentation is performed when it is detected that the speech energy is less than the preset energy value (close to zero).
  • a keyword library can be used to extract keywords in the voice "Where is the Shanghai University City Bookstore".
  • the extracted keywords "Shanghai” have location intention attributes, "at” has location intention attributes, and "where” has query intention attributes.
  • the intention attribute of each keyword can be obtained as the query position.
  • a pre-trained neural network corresponding to the intention attribute of the query position is used to segment the speech.
  • using massive corpus to train the hidden Markov model CNN-HMM can make the trained hidden Markov model CNN-HMM have the function of directly segmenting speech.
  • the user input speech can be more accurately segmented, thereby further improving the user experience of the robot voice interaction service.
  • FIG. 2 shows a schematic flowchart of a voice interaction method according to other embodiments of the present disclosure. As shown in FIG. 2, this embodiment includes steps S201 to S205.
  • step S201 the current network connection status of the robot is determined.
  • step S202 For example, you can send a ping message to a specific network port. If the recovery message is received, the current network connection status is online, and step S202 is performed; if the reply message is not received, the current network status is offline, and step S204 is performed.
  • step S202 the voice input by the user is sent to the cloud server for voice recognition, semantic understanding and speech synthesis, and the first output voice fed back by the cloud is received; meanwhile, the voice input by the user is sent to the local knowledge base for voice recognition and semantics Understand and speech synthesis, get the second output speech of local feedback.
  • step S203 the output voice obtained first among the first output voice and the second output voice is played to the user.
  • the first output voice is played to the user; if the first output voice is received after the second output voice is obtained first, the second output voice is played to the user.
  • step S204 the voice input by the user is sent to a local knowledge base for voice recognition, semantic understanding, and voice synthesis to obtain a locally output second output voice.
  • step S205 the second output voice is played to the user.
  • the above-mentioned embodiments provide a way to recognize and understand voice in an off-line combination.
  • This embodiment enables the voice robot to provide voice interaction services when the network conditions are poor, and enables the voice robot to provide more accurate voice interaction services when the network conditions are poor, which further improves the robot voice interaction service User experience.
  • FIG. 3 shows a schematic flowchart of a voice interaction method according to still other embodiments of the present disclosure. As shown in FIG. 3, this embodiment includes steps S302 to S304.
  • step S302 in the sleep state of the robot, the camera is used to recognize the face in real time.
  • step S304 is performed; if a human face is not recognized, then return to step S301.
  • step S304 the robot is awakened from the sleep state to the working state.
  • the system's wake-up program is called to wake the robot from a sleep state to a working state.
  • the face wake-up function of the voice robot is realized through the cooperation of the camera and the system wake-up program.
  • the user does not need to provide a wake-up word to wake up the voice robot, and only needs to be close to the camera to perform voice interaction with the voice robot.
  • the following describes some embodiments of the voice interactive robot of the present disclosure with reference to FIG. 4 to illustrate the hardware architecture of the voice interactive robot of the present disclosure.
  • FIG. 4 shows a schematic structural diagram of a voice interactive robot according to some embodiments of the present disclosure.
  • the voice interactive robot 40 in this embodiment includes a microphone 401, a front-end processing chip 402, a host computer 403, and a microphone 404.
  • the front-end processing chip 402 is configured to: use the deep learning neural network to process the voice input by the user, recognize the language of the voice; and provide the user with a robot voice interaction service corresponding to the language.
  • the deep learning neural network can recognize the language of the voice input by the user after being trained with corpora marked with different languages.
  • FIG. 5 shows a schematic structural diagram of a front-end processing chip.
  • the software program in the front-end processing chip can be developed based on the Android platform, which specifically includes a voice recognition module 5021, a voice proxy module 5022, a voice listening module 5023, a voice service module 5024, a control service module 5025, and a connection service Module 5026.
  • the functions of each module are as follows:
  • the voice recognition module 5021 is used to perform language recognition on the received voice, and call different voice interactive services according to the recognized language category;
  • the voice agent module 5022 is used to send a voice service message to the voice service module 5024;
  • the voice listening module 5023 is used to obtain the feedback voice service message from the voice service module 5024;
  • Voice service module 5024 used to implement basic voice algorithm services, including voice recognition, semantic understanding, and speech synthesis;
  • the control service module 5025 is used to realize the control service of the voice robot according to the voice instruction
  • the control service module 5026 is used to implement a connection control service between the voice robot and external devices according to voice instructions.
  • a set of interfaces can be used between the various working modules in the front-end processing chip to realize language recognition and voice service invocation functions.
  • the entire voice robot only needs a set of software and hardware to provide multilingual voice interactive services .
  • the host computer 403 is configured to: determine the current network connection status of the robot; if the current network connection status is online, send the voice input by the user to the cloud server for voice recognition, semantic understanding, and speech synthesis , Receive the first output voice feedback from the cloud; At the same time, send the user input voice to the local knowledge base for speech recognition, semantic understanding and speech synthesis, get the second output voice of the local feedback; if the first output voice is received first When the second output voice is obtained, the first output voice is played to the user; if the first output voice is received after the second output voice is obtained first, the second output voice is played to the user.
  • the host computer 403 is further configured to: if the current network status is offline, send the voice input by the user to the local knowledge base for voice recognition, semantic understanding, and voice synthesis to obtain the second local feedback Output voice; play the second output voice to the user.
  • the local knowledge base may also be located in the host computer 403.
  • the voice interactive robot 40 further includes a camera 405 configured to: recognize a human face in real time when the robot is in a sleep state; and wake up the robot from a sleep state to a working state when a human face is recognized.
  • the host computer 403 can not only communicate with the front-end processing chip 402 through a serial port, but also connect with a microphone 401, a microphone 404, a camera 405, and a detection sensor 406. Among them, the host computer 403 has the following voice application functions:
  • the voice After acquiring the voice input by the user from the microphone 401, the voice is forwarded to the front-end processing chip 402 for voice processing; if the voice input by the user is a motion control instruction, the host computer 403 can perform motion control on the voice robot 40; Voice is a state control instruction.
  • the host computer 403 can adjust the working state of the voice robot 40, such as voice service initialization, wake up/sleep, start/stop voice interaction, start/stop recording, etc.;
  • the voice interactive robot 40 further includes a terminal device 407.
  • the terminal device 407 may be, for example, a tablet computer, and may communicate with the front-end processing chip 402 through the TCP/IP protocol, so that the user can invoke the voice interaction service of the voice robot through the terminal device.
  • the microphone 401 is a microphone array.
  • the design standard of the microphone 401 is as follows.
  • the multiple microphones of the microphone array are located on the same circumference on the same horizontal plane, and the circumferential distance between adjacent microphones is equal, so that the direct sound of human voice has an equal chance of reaching each microphone in the microphone array.
  • the microphone array is covered with a silicone sleeve, and the silicone sleeve is fixedly connected with the shell of the voice robot.
  • a silicone sleeve is used to dampen and seal the microphone, and glue is used to fix the silicone sleeve to the robot housing.
  • the microphone can be kept away from interference and vibration (including speaker vibration, structural rotation vibration, etc.).
  • the loudspeaker and microphone are set in different cavities of the voice robot. For example, place the microphone on the head of the voice robot and use soundproof cotton to wrap the sound insulation; place the microphone on the belly of the voice robot so that the sound of the microphone cannot leak into the microphone in the structure.
  • the voice robot has a certain anti-noise ability and remote multi-directional voice recognition ability through the setting of the microphone, which is suitable for noisy workplaces and people such as the interior of the airport, the entrance of the exhibition venue, and the entrance and exit ports of the voice robot Interact remotely.
  • the microphone 401 and the front-end processing chip 402 may be subjected to device authentication of the relevant voice interaction platform.
  • the voice robot 40 also includes a router 408.
  • the router 408 can be connected to the cellular network and the WIFI network, and can automatically switch the selection before the cellular network and the WIFI network to ensure the stability and low latency of the network.
  • the host computer 403 can be connected to cloud servers of different voice interaction platforms through a router 408.
  • the knowledge base includes a general knowledge base and a professional knowledge base.
  • the professional knowledge base is divided into a question and answer database and a skill database.
  • the general knowledge base includes the daily greetings, life inquiries, encyclopedia knowledge and other fields used by users;
  • the professional knowledge base includes industry domain knowledge and Q&A such as safety and security inspection.
  • the question and answer library can solve the business requirements of the question and answer category, and the skill library can solve the needs of multiple rounds of interaction, which can involve business logic, data and other processing needs.
  • the voice robot handles security and security inspection services, it can assist security personnel to explain legal terms, publicize policies and regulations, lead consulting services, organize and analyze the content of on-site user-robot interactions, and analyze safety and security regulations.
  • the voice robot 40 can achieve the following indicators when the environmental noise is less than 60 dB:
  • FIG. 6 shows a schematic structural diagram of a voice interaction device according to some embodiments of the present disclosure.
  • the voice interaction device 60 of this embodiment includes a memory 610 and a processor 620 coupled to the memory 610.
  • the processor 620 is configured to perform any of the foregoing implementations based on instructions stored in the memory 610 Example voice interaction method.
  • the memory 610 may include, for example, a system memory, a fixed non-volatile storage medium, and so on.
  • the system memory stores, for example, an operating system, application programs, a boot loader (Boot Loader), and other programs.
  • the voice interaction device 60 may further include an input-output interface 630, a network interface 640, a storage interface 650, and the like.
  • the interfaces 630, 640, 650 and the memory 610 and the processor 620 may be connected via a bus 660, for example.
  • the input and output interface 630 provides a connection interface for input and output devices such as a display, a mouse, a keyboard, and a touch screen.
  • the network interface 640 provides connection interfaces for various networked devices.
  • the storage interface 650 provides a connection interface for external storage devices such as SD cards and U disks.
  • the present disclosure also includes a computer-readable storage medium on which computer instructions are stored, which when executed by a processor implements the voice interaction method in any of the foregoing embodiments.
  • the embodiments of the present disclosure may be provided as methods, systems, or computer program products. Therefore, the present disclosure may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware. Moreover, the present disclosure may take the form of a computer program product implemented on one or more computer usable non-transitory storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer usable program code .
  • a computer usable non-transitory storage media including but not limited to disk storage, CD-ROM, optical storage, etc.
  • each flow and/or block in the flowchart and/or block diagram and a combination of the flow and/or block in the flowchart and/or block diagram may be implemented by computer program instructions.
  • These computer program instructions can be provided to the processor of a general-purpose computer, special-purpose computer, embedded processing machine, or other programmable data processing device to produce a machine that enables the generation of instructions executed by the processor of the computer or other programmable data processing device
  • These computer program instructions may also be stored in a computer readable memory that can guide a computer or other programmable data processing device to work in a specific manner, so that the instructions stored in the computer readable memory produce an article of manufacture including an instruction device, the instructions
  • the device implements the functions specified in one block or multiple blocks of the flowchart one flow or multiple flows and/or block diagrams.
  • These computer program instructions can also be loaded onto a computer or other programmable data processing device, so that a series of operating steps are performed on the computer or other programmable device to produce computer-implemented processing, which is executed on the computer or other programmable device
  • the instructions provide steps for implementing the functions specified in one block or multiple blocks of the flowchart one flow or multiple flows and/or block diagrams.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Signal Processing (AREA)
  • Evolutionary Computation (AREA)
  • Environmental & Geological Engineering (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Manipulator (AREA)

Abstract

一种语音交互方法、装置、机器人以及计算机可读存储介质,涉及机器人技术领域。其中的语音交互方法包括:利用深度学习神经网络对用户输入的语音进行处理,识别语音的语种(S104);为用户提供与语种相对应的机器人语音交互服务(S106)。使机器人能够为用户提供与用户的语种相对应的语音交互服务,提升了机器人语音交互服务的用户体验。

Description

语音交互方法、装置、机器人及计算机可读存储介质
相关申请的交叉引用
本申请是以CN申请号为201811581336.6,申请日为2018年12月24日的申请为基础,并主张其优先权,该CN申请的公开内容在此作为整体引入本申请中。
技术领域
本公开涉及机器人技术领域,特别涉及一种语音交互方法、装置、机器人及计算机可读存储介质。
背景技术
人机交互方式的演变过程越来越贴近于人类的本能表达。人机交互方式中,语音交互是人与机器人之间更加自然的交互手段,语音交互简单快速,且能解放人的双手和眼睛,在很多场景能够给用户带来更好的体验。
此外,通过语音交互过程,机器人后台还可以获取用户数据,为后续的增值服务打下基础。由此可见,为用户提供更好的语音交互服务,是语音机器人领域的重要工作目标。
发明内容
本公开解决的一个技术问题是,如何使机器人为用户提供与用户的语种相对应的语音交互服务。
根据本公开实施例的一个方面,提供了一种语音交互方法,包括:利用深度学习神经网络对用户输入的语音进行处理,识别语音的语种;为用户提供与语种相对应的机器人语音交互服务。
在一些实施例中,还包括:利用标注了不同语种的语料对深度学习神经网络进行训练,使深度学习神经网络能够识别用户输入的语音的语种。
在一些实施例中,还包括:判断机器人当前的网络连接状态;若当前的网络连接状态为在线状态,则将用户输入的语音发送至云端服务器进行语音识别、语义理解和语音合成,接收云端反馈的第一输出语音;同时,将用户输入的语音发送至本地知识库进行语音识别、语义理解和语音合成,得到本地反馈的第二输出语音;若先接收到 第一输出语音后得到第二输出语音,则向用户播放第一输出语音;若先得到第二输出语音后接收到第一输出语音,则向用户播放第二输出语音。
在一些实施例中,还包括:若当前的网络状态为离线状态,则将用户输入的语音发送至本地知识库进行语音识别、语义理解和语音合成,得到本地反馈的第二输出语音;向用户播放第二输出语音。
在一些实施例中,进行语音识别时,根据用户输入的语音中各个字之间的时间间隔和语音能量对用户输入的语音进行断句。
在一些实施例中,进行语音识别时,采用隐马尔科夫模型对用户输入的语音进行断句。
在一些实施例中,还包括:在机器人休眠状态下,利用摄像头对人脸进行实时识别;响应于摄像头识别到人脸,将机器人从休眠状态唤醒至工作状态。
根据本公开实施例的另一个方面,提供了一种语音交互机器人,包括前端处理芯片,被配置为:利用深度学习神经网络对用户输入的语音进行处理,识别语音的语种;为用户提供与语种相对应的机器人语音交互服务。
在一些实施例中,深度学习神经网络经过标注了不同语种的语料的训练,能够识别用户输入的语音的语种。
在一些实施例中,还包括上位机,被配置为:判断机器人当前的网络连接状态;若当前的网络连接状态为在线状态,则将用户输入的语音发送至云端服务器进行语音识别、语义理解和语音合成,接收云端反馈的第一输出语音;同时,将用户输入的语音发送至本地知识库进行语音识别、语义理解和语音合成,得到本地反馈的第二输出语音;若先接收到第一输出语音后得到第二输出语音,则向用户播放第一输出语音;若先得到第二输出语音后接收到第一输出语音,则向用户播放第二输出语音。
在一些实施例中,上位机还被配置为:若当前的网络状态为离线状态,则将用户输入的语音发送至本地知识库进行语音识别、语义理解和语音合成,得到本地反馈的第二输出语音;向用户播放第二输出语音。
在一些实施例中,上位机被还配置为:根据用户输入的语音中各个字之间的时间间隔和语音能量对用户输入的语音进行断句。
在一些实施例中,上位机还被配置为:采用隐马尔科夫模型对用户输入的语音进行断句。
在一些实施例中,还包括摄像头,被配置为:在机器人休眠状态下对人脸进行实 时识别;在识别到人脸的情况下,将机器人从休眠状态唤醒至工作状态。
在一些实施例中,还包括麦克风阵列,麦克风阵列的多个咪头位于相同水平面的同一圆周上,且相邻咪头之间的圆周距离相等。
在一些实施例中,麦克风阵列上套有硅胶套,硅胶套与语音机器人的外壳固定连接。
在一些实施例中,还包括扩音器和多个腔体;其中,扩音器和麦克风阵列设置在语音交互机器人的不同腔体内。
在一些实施例中,还包括包裹在麦克风阵列上的隔音棉。
根据本公开实施例的又一个方面,提供了一种语音交互装置,包括:存储器;以及耦接至存储器的处理器,处理器被配置为基于存储在存储器中的指令,执行前述的语音交互方法。
根据本公开实施例的再一个方面,提供了一种计算机可读存储介质,其中,计算机可读存储介质存储有计算机指令,指令被处理器执行时实现前述的语音交互方法。
本公开能够使机器人能够为用户提供与用户的语种相对应的语音交互服务,提升了机器人语音交互服务的用户体验。
通过以下参照附图对本公开的示例性实施例的详细描述,本公开的其它特征及其优点将会变得清楚。
附图说明
为了更清楚地说明本公开实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本公开的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。
图1示出了本公开一些实施例的语音交互方法的流程示意图。
图2示出了本公开另一些实施例的语音交互方法的流程示意图。
图3示出了本公开又一些实施例的语音交互方法的流程示意图。
图4示出了本公开一些实施例的语音交互机器人的结构示意图。
图5示出了前端处理芯片的结构示意图。
图6示出了本公开一些实施例的语音交互装置的结构示意图。
具体实施方式
下面将结合本公开实施例中的附图,对本公开实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本公开一部分实施例,而不是全部的实施例。以下对至少一个示例性实施例的描述实际上仅仅是说明性的,决不作为对本公开及其应用或使用的任何限制。基于本公开中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其它实施例,都属于本公开保护的范围。
首先结合图1描述本公开语音交互方法的一些实施例,以解释本公开如何使机器人为用户提供与用户的语种相对应的语音交互服务。
图1示出了本公开一些实施例的语音交互方法的流程示意图。如图1所示,本实施例包括步骤S102~步骤S116。
在步骤S102中,利用标注了不同语种的语料对深度学习神经网络进行训练,使深度学习神经网络能够识别用户输入的语音的语种。
例如,可以使用大量的英文语料以及中文语料对深度学习神经网络进行训练,使深度学习神经网络能够准确识别用户输入的语音为中文语音或英文语音。
在步骤S104中,利用深度学习神经网络对用户输入的语音进行处理,识别用户语音的语种。
例如,利用深度学习神经网络对用户输入的语音进行处理,可以识别用户输入的语音为中文语音或英文语音。
在步骤S106中,为用户提供与语种相对应的机器人语音交互服务。
若识别到用户输入的语音为中文语音,则调用中文语音交互平台(例如科大讯飞等语音厂商的语音交互平台)中的语音识别、语义理解、语音合成等语音算法,为用户提供中文语音交互服务;若识别到用户输入的语音为英文语音,则调用语音交互平台(例如亚马逊Alexa语音服务平台)中的语音识别、语义理解、语音合成等语音算法,为用户提供英文语音交互服务。
上述实施例提供的语音交互方法,能够使语音机器人在复杂的语言场景下提供与用户语种相对应的语音交互服务,使语音机器人能够准确的与用户进行中英文语音交互,提升了机器人语音交互服务的用户体验。
对于调用的语音交互平台,可以在语音识别功能上进行改进。现有的语音交互平台在将语音转化为文本时,需要对文本进行断句以便于语义理解。为了使得断句更加准确,可以采用如下三个方面的技术手段。
(1)根据语音中各个字之间的时间间隔和语音能量对语音进行断句。
例如,在检测到字与字之间的时间间隔大于预设时长的情况下进行断句。或者,在检测到语音能量小于预设能量值(趋近于零)的情况下进行断句。
(2)根据用户的意图对语音进行断句。
例如,可以采用关键词库来提取语音“上海大学城书店在哪里”中的关键词。提取出的关键词“上海”带有地点意图属性,“在”带有位置意图属性,“哪里”带有询问意图属性。根据各个关键词的意图属性,可以得到语音的意图属性为询问位置。然后,利用预先训练的与询问位置意图属性对应的神经网络对语音进行断句。
(3)直接通过深度学习神经网络语音进行断句。
例如,采用海量的语料对隐马尔科夫模型CNN-HMM进行训练,可以使训练好的隐马尔科夫模型CNN-HMM具有直接对语音进行断句的功能。
本领域技术人员应理解,如果将上述三个方面的技术手段组合使用,能够更准确的对用户输入的语音进行断句,从而进一步提升了机器人语音交互服务的用户体验。
发明人研究发现,语音识别和语义理解时需要借助语音知识库。由于本地离线知识库中的语料较少,若仅采用本地离线知识库,虽然不依赖网络,但识别效果和语义分析能力相对较差,语音识别正确率较低;若仅采用云端在线知识库,虽然语料较多且识别效果和语义分析能力相对较强,但则过于依赖网络环境,而机器人使用场景网络条件相对复杂,语音机器人工作时很难应对网络信号不稳定和网络延迟大的情况。有鉴于此,发明人提供了一种离在线结合的方式,对语音进行识别和理解。
下面结合图2描述本公开语音交互方法的另一些实施例,以解释本公开如何使机器人通过离在线结合的方式为用户提供语音交互服务。
图2示出了本公开另一些实施例的语音交互方法的流程示意图。如图2所示,本实施例包括步骤S201~步骤S205。
在步骤S201中,判断机器人当前的网络连接状态。
例如,可以通过向特定网络端口发送ping消息。若收到恢复消息则当前的网络连接状态为在线状态,执行步骤S202;若未收到回复消息则当前的网络状态为离线状态,执行步骤S204。
在步骤S202中,将用户输入的语音发送至云端服务器进行语音识别、语义理解和语音合成,接收云端反馈的第一输出语音;同时,将用户输入的语音发送至本地知识库进行语音识别、语义理解和语音合成,得到本地反馈的第二输出语音。
在步骤S203中,向用户播放第一输出语音与第二输出语音当中首先获得的输出语音。
若先接收到第一输出语音后得到第二输出语音,则向用户播放第一输出语音;若先得到第二输出语音后接收到第一输出语音,则向用户播放第二输出语音。
在步骤S204中,将用户输入的语音发送至本地知识库进行语音识别、语义理解和语音合成,得到本地反馈的第二输出语音。
在步骤S205中,向用户播放第二输出语音。
上述实施例提供了离在线结合的方式对语音进行识别和理解。该实施例能够在网络条件较差的情况下使语音机器人能够提供语音交互服务,在网络条件较号的情况下使语音机器人能够提供更加准确的语音交互服务,更进一步的提升了机器人语音交互服务的用户体验。
下面结合图3描述本公开语音交互方法的又一些实施例,以解释本公开如何实现机器人的人脸唤醒功能。
图3示出了本公开又一些实施例的语音交互方法的流程示意图。如图3所示,本实施例包括步骤S302~步骤S304。
在步骤S302中,在机器人休眠状态下,利用摄像头对人脸进行实时识别。
若识别到人脸,则执行步骤S304;若未识别到人脸,则返回步骤S301。
在步骤S304中,将机器人从休眠状态唤醒至工作状态。
例如,在摄像头识别到人脸的情况下,调用系统的唤醒程序,使机器人从休眠状态唤醒至工作状态。
上述实施例中,通过摄像头与系统唤醒程序的配合,实现了语音机器人的人脸唤醒功能,不需要用户提供唤醒词对语音机器人进行唤醒,仅需靠近摄像头即可与语音机器人进行语音交互。
下面结合图4描述本公开语音交互机器人的一些实施例,以说明本公开语音交互机器人的硬件架构。
图4示出了本公开一些实施例的语音交互机器人的结构示意图。如图4所示,本实施例中的语音交互机器人40包括:麦克风401、前端处理芯片402、上位机403、扩音器404。其中,前端处理芯片402被配置为:利用深度学习神经网络对用户输入的语音进行处理,识别语音的语种;为用户提供与语种相对应的机器人语音交互服务。其中,深度学习神经网络经过标注了不同语种的语料的训练,能够识别用户输入的语 音的语种。
下面结合图5说明前端处理芯片402的如何实现。图5示出了前端处理芯片的结构示意图。如图5所示,前端处理芯片中的软件程序可以基于Android平台进行开发,具体包括语音识别模块5021、语音代理模块5022、语音侦听模块5023、语音服务模块5024、控制服务模块5025以及连接服务模块5026。各模块功能如下:
语音识别模块5021,用于对接收到的语音进行语种识别,并根据识别的语种类别调用不同的语音交互服务;
语音代理模块5022,用于向语音服务模块5024发送语音服务消息;
语音侦听模块5023,用于从语音服务模块5024获取反馈的语音服务消息;
语音服务模块5024,用于实现基础的语音算法服务,包括语音识别、语义理解和语音合成等;
控制服务模块5025,用于根据语音指令实现语音机器人的控制服务;
控制服务模块5026,用于根据语音指令实现语音机器人与外部设备的连接控制服务。
从图5可以看出,前端处理芯片中各个工作模块之间采用一套接口即可实现语种识别功能以及语音服务调用功能,整个语音机器人仅需一套软硬件即可提供多语种的语音交互服务。
在一些实施例中,上位机403被配置为:判断机器人当前的网络连接状态;若当前的网络连接状态为在线状态,则将用户输入的语音发送至云端服务器进行语音识别、语义理解和语音合成,接收云端反馈的第一输出语音;同时,将用户输入的语音发送至本地知识库进行语音识别、语义理解和语音合成,得到本地反馈的第二输出语音;若先接收到第一输出语音后得到第二输出语音,则向用户播放第一输出语音;若先得到第二输出语音后接收到第一输出语音,则向用户播放第二输出语音。
在一些实施例中,上位机403还被配置为:若当前的网络状态为离线状态,则将用户输入的语音发送至本地知识库进行语音识别、语义理解和语音合成,得到本地反馈的第二输出语音;向用户播放第二输出语音。本领域技术人员应理解,本地知识库也可以位于上位机403中。
在一些实施例中,语音交互机器人40还包括摄像头405,被配置为:在机器人休眠状态下对人脸进行实时识别;在识别到人脸的情况下,将机器人从休眠状态唤醒至工作状态。
上位机403不仅可以通过串口与前端处理芯片402通信,还可以与麦克风401、扩音器404、摄像头405、检测传感器406连接。其中,上位机403具有以下语音应用功能:
(1)从麦克风401获取用户输入的语音后将语音转发至前端处理芯片402进行语音处理;若用户输入的语音为运动控制指令,上位机403可以对语音机器人40进行运动控制;若用户输入的语音为状态控制指令,上位机403可以对语音机器人40的工作状态进行调整,如语音服务初始化、唤醒/休眠、启动/停止语音交互、开始/停止录音等等;
(2)从摄像头405获取接收的图像,借助人脸识别算法对图像进行识别,在识别到人脸的情况下唤醒语音机器人;
(3)从检测传感器406接收检测结果,并通过扩音器404对检测结果进行播报;
在一些实施例中,语音交互机器人40还包括终端设备407。终端设备407例如可以为平板电脑,可以通过TCP/IP协议与前端处理芯片402通信,以便用户通过终端设备调用语音机器人的语音交互服务。
在一些实施例中,麦克风401为麦克风阵列。麦克风401的设计标准如下。
(1)麦克风阵列的多个咪头位于相同水平面的同一圆周上,且相邻咪头之间的圆周距离相等,使得人声直达声到达麦克风阵列中各个咪头的机会均等。
(2)麦克风阵列的麦克风板的法线方向与语音机器人的正面方向一致。
(3)麦克风外侧没有遮挡,麦克风与声源之间没有阻挡,声源到达麦克风的路径尽可能短且宽。
(4)麦克风阵列上套有硅胶套,硅胶套与语音机器人的外壳固定连接。
例如,采用硅胶套对麦克风进行减震密封处理,并使用胶水将硅胶套固定在机器人外壳上。可以使麦克风远离干扰和震动(包括喇叭震动、结构转动震动等等)。
(5)扩音器和麦克风设置在语音机器人的不同腔体内。例如,将麦克风放置在语音机器人的头部,使用隔音棉包裹隔音;将扩音器放在语音机器人的腹部,使扩音器声音不能在结构内泄露到麦克风。
上述实施例中,通过对麦克风的设置使语音机器人具有一定的抗噪能力和远程多方位语音识别能力,适用于语音机器人在机场内部、展会会场入口、进出口口岸等环境嘈杂的工作场所与人进行远距离交互。
在一些实施例中,可以对麦克风401和前端处理芯片402进行相关语音交互平台 的设备认证。
在一些实施例中,语音机器人40还包括路由器408。路由器408可以与蜂窝网络和WIFI网络连接,并可以在蜂窝网络和WIFI网络之前自动切换选择,以确保网络的稳定和低延迟。上位机403可以通过路由器408连接至不同的语音交互平台的云端服务器。
发明人认为,在实现了离在线结合的语音识别和理解之后,可以对构建更加完善的语音知识库,来提升语音交互服务的质量。知识库包括通用知识库和专业知识库,专业知识库分为问答库和技能库。其中,通用知识库包括用户日常使用的寒暄、生活查询、百科知识等领域;专业知识库包括安全、安检等行业领域知识和问答。问答库可以解决问答类的业务需求,技能库可以解决多轮交互的需求,可以涉及业务逻辑、数据等处理的需求。在语音机器人处理安全和安检业务时,可以协助安检人员解释法律条款、政策法规宣贯、引领咨询服务、整理分析现场用户与机器人交互内容、分析研究安全和安检法规。
实验得出,语音机器人40在环境噪声低于60分贝的条件能够达到如下指标:
项目 指标
1米识别率 >92%
3米识别率 >90%
5米识别率 >85%
拾音角度 360度
唤醒成功率 85%,1s以内响应
应答正确率 对于知识库内问题应答正确率>80%
外部网络速率 上行速度>3Mbps,延迟<200ms
图6示出了本公开一些实施例的语音交互装置的结构示意图。如图6所示,该实施例的语音交互装置60包括:存储器610以及耦接至该存储器610的处理器620,处理器620被配置为基于存储在存储器610中的指令,执行前述任意一些实施例中的语音交互方法。
其中,存储器610例如可以包括系统存储器、固定非易失性存储介质等。系统存储器例如存储有操作系统、应用程序、引导装载程序(Boot Loader)以及其他程序等。
语音交互装置60还可以包括输入输出接口630、网络接口640、存储接口650等。 这些接口630、640、650以及存储器610和处理器620之间例如可以通过总线660连接。其中,输入输出接口630为显示器、鼠标、键盘、触摸屏等输入输出设备提供连接接口。网络接口640为各种联网设备提供连接接口。存储接口650为SD卡、U盘等外置存储设备提供连接接口。
本公开还包括一种计算机可读存储介质,其上存储有计算机指令,该指令被处理器执行时实现前述任意一些实施例中的语音交互方法。
本领域内的技术人员应明白,本公开的实施例可提供为方法、系统、或计算机程序产品。因此,本公开可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且,本公开可采用在一个或多个其中包含有计算机可用程序代码的计算机可用非瞬时性存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。
本公开是参照根据本公开实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。
这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。
以上所述仅为本公开的较佳实施例,并不用以限制本公开,凡在本公开的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本公开的保护范围之内。

Claims (20)

  1. 一种语音交互方法,包括:
    利用深度学习神经网络对用户输入的语音进行处理,识别所述语音的语种;
    为用户提供与所述语种相对应的机器人语音交互服务。
  2. 如权利要求1所述的语音交互方法,还包括:
    利用标注了不同语种的语料对所述深度学习神经网络进行训练,使所述深度学习神经网络能够识别用户输入的语音的语种。
  3. 如权利要求1所述的语音交互方法,还包括:
    判断机器人当前的网络连接状态;
    若当前的网络连接状态为在线状态,则将用户输入的语音发送至云端服务器进行语音识别、语义理解和语音合成,接收云端反馈的第一输出语音;同时,将用户输入的语音发送至本地知识库进行语音识别、语义理解和语音合成,得到本地反馈的第二输出语音;
    若先接收到所述第一输出语音后得到所述第二输出语音,则向用户播放所述第一输出语音;若先得到所述第二输出语音后接收到所述第一输出语音,则向用户播放所述第二输出语音。
  4. 如权利要求3所述的语音交互方法,还包括:
    若当前的网络状态为离线状态,则将用户输入的语音发送至本地知识库进行语音识别、语义理解和语音合成,得到本地反馈的第二输出语音;
    向用户播放所述第二输出语音。
  5. 如权利要求3或4所述的语音交互方法,其中,进行语音识别时,根据用户输入的语音中各个字之间的时间间隔和语音能量对用户输入的语音进行断句。
  6. 如权利要求3或4所述的语音交互方法,其中,进行语音识别时,采用隐马 尔科夫模型对用户输入的语音进行断句。
  7. 如权利要求1所述的语音交互方法,还包括:
    在机器人休眠状态下,利用摄像头对人脸进行实时识别;
    响应于所述摄像头识别到人脸,将机器人从休眠状态唤醒至工作状态。
  8. 一种语音交互机器人,包括前端处理芯片,被配置为:
    利用深度学习神经网络对用户输入的语音进行处理,识别所述语音的语种;
    为用户提供与所述语种相对应的机器人语音交互服务。
  9. 如权利要求8所述的语音交互机器人,其中,所述深度学习神经网络经过标注了不同语种的语料的训练,能够识别用户输入的语音的语种。
  10. 如权利要求8所述的语音交互机器人,还包括上位机,被配置为:
    判断机器人当前的网络连接状态;
    若当前的网络连接状态为在线状态,则将用户输入的语音发送至云端服务器进行语音识别、语义理解和语音合成,接收云端反馈的第一输出语音;同时,将用户输入的语音发送至本地知识库进行语音识别、语义理解和语音合成,得到本地反馈的第二输出语音;
    若先接收到所述第一输出语音后得到所述第二输出语音,则向用户播放所述第一输出语音;若先得到所述第二输出语音后接收到所述第一输出语音,则向用户播放所述第二输出语音。
  11. 如权利要求10所述的语音交互机器人,其中,所述上位机还被配置为:
    若当前的网络状态为离线状态,则将用户输入的语音发送至本地知识库进行语音识别、语义理解和语音合成,得到本地反馈的第二输出语音;
    向用户播放所述第二输出语音。
  12. 如权利要求10或11所述的语音交互方法,其中,所述上位机被还配置为: 根据用户输入的语音中各个字之间的时间间隔和语音能量对用户输入的语音进行断句。
  13. 如权利要求10或11所述的语音交互方法,其中,所述上位机还被配置为:采用隐马尔科夫模型对用户输入的语音进行断句。
  14. 如权利要求8所述的语音交互机器人,还包括摄像头,被配置为:
    在机器人休眠状态下对人脸进行实时识别;
    在识别到人脸的情况下,将机器人从休眠状态唤醒至工作状态。
  15. 如权利要求8所述的语音交互机器人,还包括麦克风阵列,所述麦克风阵列的多个咪头位于相同水平面的同一圆周上,且相邻咪头之间的圆周距离相等。
  16. 如权利要求15所述的语音交互机器人,其中,所述麦克风阵列上套有硅胶套,所述硅胶套与所述语音机器人的外壳固定连接。
  17. 如权利要求15所述的语音交互机器人,还包括扩音器和多个腔体;其中,所述扩音器和所述麦克风阵列设置在语音交互机器人的不同腔体内。
  18. 如权利要求15所述的语音交互机器人,还包括包裹在所述麦克风阵列上的隔音棉。
  19. 一种语音交互装置,包括:
    存储器;以及
    耦接至所述存储器的处理器,所述处理器被配置为基于存储在所述存储器中的指令,执行如权利要求1至7中任一项所述的语音交互方法。
  20. 一种计算机可读存储介质,其中,所述计算机可读存储介质存储有计算机指令,所述指令被处理器执行时实现如权利要求1至7中任一项所述的语音交互方法。
PCT/CN2019/124844 2018-12-24 2019-12-12 语音交互方法、装置、机器人及计算机可读存储介质 WO2020135067A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201811581336.6 2018-12-24
CN201811581336.6A CN111429924A (zh) 2018-12-24 2018-12-24 语音交互方法、装置、机器人及计算机可读存储介质

Publications (1)

Publication Number Publication Date
WO2020135067A1 true WO2020135067A1 (zh) 2020-07-02

Family

ID=71128374

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/124844 WO2020135067A1 (zh) 2018-12-24 2019-12-12 语音交互方法、装置、机器人及计算机可读存储介质

Country Status (2)

Country Link
CN (1) CN111429924A (zh)
WO (1) WO2020135067A1 (zh)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111986671A (zh) * 2020-08-28 2020-11-24 北京海益同展信息科技有限公司 服务机器人及其语音开关机方法和装置
CN112562670A (zh) * 2020-12-03 2021-03-26 深圳市欧瑞博科技股份有限公司 语音智能识别方法、语音智能识别装置及智能设备
CN113505874A (zh) * 2021-06-07 2021-10-15 广发银行股份有限公司 一种多模型智能机器人系统及构建方法

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112786044A (zh) * 2020-12-30 2021-05-11 乐聚(深圳)机器人技术有限公司 语音控制方法、装置、主控制器、机器人及存储介质
CN117727303A (zh) * 2024-02-08 2024-03-19 翌东寰球(深圳)数字科技有限公司 一种音视频的生成方法、装置、设备及存储介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103400577A (zh) * 2013-08-01 2013-11-20 百度在线网络技术(北京)有限公司 多语种语音识别的声学模型建立方法和装置
CN107424607A (zh) * 2017-07-04 2017-12-01 珠海格力电器股份有限公司 语音控制模式切换方法、装置及具有该装置的设备
US20180136615A1 (en) * 2016-11-15 2018-05-17 Roborus Co., Ltd. Concierge robot system, concierge service method, and concierge robot
CN108098767A (zh) * 2016-11-25 2018-06-01 北京智能管家科技有限公司 一种机器人唤醒方法及装置
CN108717853A (zh) * 2018-05-09 2018-10-30 深圳艾比仿生机器人科技有限公司 一种人机语音交互方法、装置及存储介质
CN109065041A (zh) * 2018-08-09 2018-12-21 上海常仁信息科技有限公司 一种基于机器人的语音交互系统和方法

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2804113A3 (en) * 2013-05-13 2014-12-24 Facebook, Inc. Hybrid, offline/online speech translation system
CN104965426A (zh) * 2015-06-24 2015-10-07 百度在线网络技术(北京)有限公司 基于人工智能的智能机器人控制系统、方法和装置
CN105336324B (zh) * 2015-11-17 2018-04-03 百度在线网络技术(北京)有限公司 一种语种识别方法及装置
CN106558313A (zh) * 2016-11-16 2017-04-05 北京云知声信息技术有限公司 语音识别方法及装置
CN106847274B (zh) * 2016-12-26 2020-11-17 北京光年无限科技有限公司 一种用于智能机器人的人机交互方法及装置
CN108335693B (zh) * 2017-01-17 2022-02-25 腾讯科技(深圳)有限公司 一种语种识别方法以及语种识别设备
CN108172225A (zh) * 2017-12-27 2018-06-15 浪潮金融信息技术有限公司 语音交互方法及机器人、计算机可读存储介质、终端

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103400577A (zh) * 2013-08-01 2013-11-20 百度在线网络技术(北京)有限公司 多语种语音识别的声学模型建立方法和装置
US20180136615A1 (en) * 2016-11-15 2018-05-17 Roborus Co., Ltd. Concierge robot system, concierge service method, and concierge robot
CN108098767A (zh) * 2016-11-25 2018-06-01 北京智能管家科技有限公司 一种机器人唤醒方法及装置
CN107424607A (zh) * 2017-07-04 2017-12-01 珠海格力电器股份有限公司 语音控制模式切换方法、装置及具有该装置的设备
CN108717853A (zh) * 2018-05-09 2018-10-30 深圳艾比仿生机器人科技有限公司 一种人机语音交互方法、装置及存储介质
CN109065041A (zh) * 2018-08-09 2018-12-21 上海常仁信息科技有限公司 一种基于机器人的语音交互系统和方法

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111986671A (zh) * 2020-08-28 2020-11-24 北京海益同展信息科技有限公司 服务机器人及其语音开关机方法和装置
CN111986671B (zh) * 2020-08-28 2024-04-05 京东科技信息技术有限公司 服务机器人及其语音开关机方法和装置
CN112562670A (zh) * 2020-12-03 2021-03-26 深圳市欧瑞博科技股份有限公司 语音智能识别方法、语音智能识别装置及智能设备
CN113505874A (zh) * 2021-06-07 2021-10-15 广发银行股份有限公司 一种多模型智能机器人系统及构建方法

Also Published As

Publication number Publication date
CN111429924A (zh) 2020-07-17

Similar Documents

Publication Publication Date Title
WO2020135067A1 (zh) 语音交互方法、装置、机器人及计算机可读存储介质
JP7248751B2 (ja) ホットワード認識音声合成
US9805718B2 (en) Clarifying natural language input using targeted questions
CN108133707B (zh) 一种内容分享方法及系统
US10860289B2 (en) Flexible voice-based information retrieval system for virtual assistant
CN102903362A (zh) 集成的本地和基于云的语音识别
WO2017084185A1 (zh) 基于语义分析的智能终端控制方法、系统及智能终端
KR20200095719A (ko) 전자 장치 및 그 제어 방법
JP7516571B2 (ja) ホットワードしきい値自動チューニング
EP3776171A1 (en) Non-disruptive nui command
KR102419374B1 (ko) 사용자 발화를 처리하는 전자 장치 및 그 전자 장치의 제어 방법
Yamamoto et al. Voice interaction system with 3D-CG virtual agent for stand-alone smartphones
Lewis et al. Implementing HARMS-based indistinguishability in ubiquitous robot organizations
KR20210042520A (ko) 전자 장치 및 이의 제어 방법
Inupakutika et al. Integration of NLP and Speech-to-text Applications with Chatbots
CN114064943A (zh) 会议管理方法、装置、存储介质及电子设备
Lin et al. Building a speech recognition system with privacy identification information based on Google Voice for social robots
CN106980640A (zh) 针对照片的交互方法、设备和计算机可读存储介质
Gentile et al. Privacy-oriented architecture for building automatic voice interaction systems in smart environments in disaster recovery scenarios
US20230297321A1 (en) Handling of noise and interruption during online meetings
Rajakumar et al. IoT based voice assistant using Raspberry Pi and natural language processing
KR102302029B1 (ko) 인공지능 기반 복합 입력 인지 시스템
WO2022078189A1 (zh) 一种支持动态意图的控制方法、装置及存储介质
KR20060091329A (ko) 대화식 시스템 및 대화식 시스템을 제어하는 방법
Panek et al. Challenges in adopting speech control for assistive robots

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19903397

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19903397

Country of ref document: EP

Kind code of ref document: A1