WO2021135604A1 - 语音控制方法、装置、服务器、终端设备及存储介质 - Google Patents

语音控制方法、装置、服务器、终端设备及存储介质 Download PDF

Info

Publication number
WO2021135604A1
WO2021135604A1 PCT/CN2020/125215 CN2020125215W WO2021135604A1 WO 2021135604 A1 WO2021135604 A1 WO 2021135604A1 CN 2020125215 W CN2020125215 W CN 2020125215W WO 2021135604 A1 WO2021135604 A1 WO 2021135604A1
Authority
WO
WIPO (PCT)
Prior art keywords
terminal
semantic
instruction
server
voice
Prior art date
Application number
PCT/CN2020/125215
Other languages
English (en)
French (fr)
Inventor
何雄辉
杨威
周剑辉
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to US17/789,873 priority Critical patent/US20230053765A1/en
Priority to EP20910466.0A priority patent/EP4064713A4/en
Publication of WO2021135604A1 publication Critical patent/WO2021135604A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/1815Semantic context, e.g. disambiguation of the recognition hypotheses based on word meaning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3343Query execution using phonetics
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • G06F40/35Discourse or dialogue representation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/41Structure of client; Structure of client peripherals
    • H04N21/4104Peripherals receiving signals from specially adapted client devices
    • H04N21/4131Peripherals receiving signals from specially adapted client devices home appliance, e.g. lighting, air conditioning system, metering devices
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/41Structure of client; Structure of client peripherals
    • H04N21/422Input-only peripherals, i.e. input devices connected to specially adapted client devices, e.g. global positioning system [GPS]
    • H04N21/42204User interfaces specially adapted for controlling a client device through a remote control device; Remote control devices therefor
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/436Interfacing a local distribution network, e.g. communicating with another STB or one or more peripheral devices inside the home
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/60Network structure or processes for video distribution between server and client or between remote clients; Control signalling between clients, server and network components; Transmission of management data between server and client, e.g. sending from server to client commands for recording incoming content stream; Communication details between server and client 
    • H04N21/65Transmission of management data between client and server
    • H04N21/654Transmission by server directed to the client
    • H04N21/6547Transmission by server directed to the client comprising parameters, e.g. for client setup
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/41Structure of client; Structure of client peripherals
    • H04N21/422Input-only peripherals, i.e. input devices connected to specially adapted client devices, e.g. global positioning system [GPS]
    • H04N21/42203Input-only peripherals, i.e. input devices connected to specially adapted client devices, e.g. global positioning system [GPS] sound input device, e.g. microphone

Definitions

  • This application belongs to the field of terminal technology, and in particular relates to voice control methods, devices, servers, terminal equipment, and storage media.
  • the voice assistant is an intelligent application that can be mounted on smart terminal devices such as mobile phones, TVs, tablets, computers or speakers, and can make judgments by receiving audio signals from the user, performing voice recognition Or response; the dialogue process of voice assistant being awakened, voice recognition and response requires voice database for cloud support; and dialogue manager (DM) can be used as a cloud service, responsible for maintaining and updating the process and status of the dialogue.
  • the input is utterance and related context. After the dialogue is understood, the system response is output.
  • the dialogue management will perform multiple stages of repeated processing for the user's task instructions, which prolongs the response time of the system and increases the delay of the dialogue. .
  • the embodiments of the application provide a voice control method, device, server, terminal device, and storage medium, which can solve the problem that when multiple devices are engaged in a joint dialogue, the dialogue management will perform multiple stages of repeated processing of the user’s task instructions, thereby prolonging the system response Time and increased dialogue delay.
  • an embodiment of the present application provides a voice control method, including:
  • the second semantic instruction is sent to the first terminal, and the first semantic instruction is used to instruct the first terminal to send the second semantic instruction to the second terminal; receiving the second terminal to identify the first terminal
  • the execution command fed back after the second semantic instruction, and the service logic corresponding to the second semantic instruction is sent to the second terminal according to the execution command.
  • the server As the main body of execution, by receiving the voice command recognition result sent by the first terminal, the voice command recognition result is semantically processed to obtain the operation information to be performed in the voice command recognition result, and The operation information is sent to the first terminal; the first terminal executes the first semantic instruction in the operation information, and sends the second semantic instruction in the operation information to the second terminal; after the second terminal recognizes the second semantic instruction, the server It can directly receive the execution command fed back by the second terminal, call the business logic corresponding to the second semantic instruction according to the execution command, and send the business logic to the second terminal, eliminating the need for processing the second semantic instruction and shortening
  • the delay of the dialogue is improved, and the response time of the dialogue system is improved.
  • the performing semantic processing on the voice command recognition result to obtain operation information includes:
  • the voice command recognition result Recognize the voice command recognition result to obtain the target intent and target sub-intent of the voice command recognition result; according to the target intent, pre-verify the target sub-intent to obtain the response logic of the target intent and the target sub-intent
  • the intended trial run result the response logic is taken as the first semantic instruction of the operation information
  • the target sub-intent and the trial run result are taken as the second semantic instruction of the operation information.
  • the voice command recognition result that is, the text information corresponding to the voice command input by the user
  • semantic recognition is performed on the voice command recognition result, and the result of the voice command recognition is obtained.
  • Target intent and target sub-intent by pre-verifying the target sub-intent according to the target intent, the response logic of the target intent and the trial operation result of pre-verifying the target sub-intent are obtained, and the response logic is sent to the first terminal as the first semantic instruction at the same time , Also send the target sub-intent and the test run result as the second semantic instruction to the first terminal; by executing the first semantic instruction on the first terminal, the second semantic instruction is sent to the second terminal, providing an information basis for the dialogue system, Improve the response speed of the dialogue system.
  • the sending the first semantic instruction and the second semantic instruction to the first terminal includes:
  • the first semantic instruction and the second semantic instruction are sent to the first terminal in the form of semantic representation.
  • the sending service logic corresponding to the second semantic instruction to the second terminal according to the execution command includes:
  • the test operation result is parsed; according to the parsed test operation result, the business logic is called, and the business logic is sent to the second terminal in the form of semantic representation.
  • the corresponding command after receiving the execution command sent by the second terminal, the corresponding command can be directly executed, the test run result can be analyzed, and the corresponding business logic can be directly called according to the analysis result of the test run result. Then execute the process of semantic processing of the target sub-intent and selecting the corresponding execution mode; shorten the response time of the dialogue system.
  • an embodiment of the present application provides a voice control method, including:
  • the operation information includes a first semantic instruction and a second semantic instruction; the first semantic instruction is executed, and the second semantic instruction is sent to the second terminal; the second semantic instruction is used to instruct all
  • the second terminal sends an execution command to the server, and receives the service logic corresponding to the second semantic instruction fed back by the server.
  • the first terminal is used as the execution subject. After the first terminal performs voice recognition on the voice command input by the user, it sends the obtained voice command recognition result to the server, and receives the voice command recognition result from the server. After semantically processed operation information, execute the first semantic instruction in the operation information, and send the second semantic instruction to the second terminal; receive the first semantic instruction and the second semantic instruction fed back by the server in response to the voice instruction recognition result; Execute the first semantic instruction and send the second semantic instruction to the second terminal, so that the second terminal directly calls the execution interface of the server according to the second semantic instruction, sends the execution command to the server, and receives feedback from the server corresponding to the second semantic instruction
  • the business logic provides an information basis for the dialogue system to further respond to the second voice instruction, saves the processing flow of the second semantic instruction, and thus can shorten the response time of the dialogue system.
  • the receiving operation information fed back after the server performs semantic processing on the voice command recognition result includes:
  • the first semantic instruction is the response logic of the server to the target intention feedback in the voice instruction recognition result
  • the second semantic instruction is the server responding to the The trial run result of the target sub-intent feedback in the voice command recognition result and the target sub-intent
  • executing the first semantic instruction and sending the second semantic instruction to the second terminal includes:
  • the response logic fed back by the server is executed, and the target sub-intent and the trial operation result fed back by the server are sent to the second terminal.
  • the result of the trial run of the intention is transmitted to the second terminal as intermediate data to provide a data basis for the second terminal; by executing the response logic fed back by the server, the target sub-intent is sent to the second terminal at the same time as the trial run result is also sent to
  • the second terminal allows the second terminal to directly call the execution interface of the server according to the results of the trial operation, without uploading the target sub-intent to the server for semantic processing and judgment execution, which saves the data processing flow and shortens the response of the dialogue system time.
  • an embodiment of the present application provides a voice control method, including:
  • the server receives the operation information fed back according to the recognition result of the voice instruction; recognizes the second semantic instruction to obtain the recognition result of the second semantic instruction; according to the recognition result, sends an execution command to the server;
  • the business logic corresponding to the second semantic instruction fed back by the execution command, and the business logic is executed.
  • the second terminal is used as the execution subject to recognize the received second semantic instruction, and the execution interface of the server is directly called according to the recognized result, and the server is instructed to feed back corresponding to the second semantic instruction
  • the business logic does not need to perform semantic processing on the second semantic instruction through the server, which saves the data processing flow, improves the response speed of the second terminal, and shortens the time delay of the conversation system.
  • the operation information includes the response logic of the server to the target intention feedback in the voice command recognition result, and the server's response to the voice command recognition result in the voice command recognition result.
  • the receiving the second semantic instruction sent when the first terminal executes the first semantic instruction includes: receiving the target sub-intent and the trial run result sent when the first terminal executes the response logic.
  • the second semantic instruction includes a trial run result obtained by the server pre-verifying the target sub-intent in the voice instruction recognition result;
  • the recognizing the second semantic instruction to obtain the recognition result of the second semantic instruction includes: recognizing the second semantic instruction to obtain the trial run result of the target sub-intention.
  • the sending an execution command to the server according to the recognition result includes:
  • the execution command corresponding to the trial operation result is sent to the server.
  • the trial operation result includes a skill identifier, an intention identifier, and a slot list, where the slot includes a slot name, a slot type, and a slot value.
  • the server, the first terminal, and the second terminal can be connected to each other in a networked state, and data transmission between each other can be realized through a data transmission protocol; or the three terminals can be connected separately To the cloud-side service, data interaction is realized through the cloud-side service.
  • the server, the first terminal, and the second terminal may be connected to each other to form a device circle of the dialogue system through wireless WiFi or a cellular network, through mutual confirmation of addresses and interfaces between the terminals, and through voice Instructions realize mutual control.
  • the server sends the first semantic instruction in the operation information to the first terminal, and sends the second semantic instruction directly to the second terminal.
  • an embodiment of the present application provides a voice control device, including:
  • the first receiving module is configured to receive the voice command recognition result sent by the first terminal
  • a semantic processing module configured to perform semantic processing on the voice instruction recognition result to obtain operation information, where the operation information includes a first semantic instruction and a second semantic instruction;
  • the first sending module is configured to send the first semantic instruction and the second semantic instruction to the first terminal, and the first semantic instruction is used to instruct the first terminal to send the second semantic instruction Sent to the second terminal;
  • the command execution module is configured to receive the execution command fed back after the second terminal recognizes the second semantic instruction, and send the business logic corresponding to the second semantic instruction to the second terminal according to the execution command .
  • the semantic processing module includes:
  • Semantic recognition sub-module used to recognize the voice command recognition result, and obtain the target intention and target sub-intent of the voice command recognition result
  • the task execution sub-module is used to pre-verify the target sub-intent according to the target intent to obtain the response logic of the target intent and the trial run result of the target sub-intent; use the response logic as the operation information
  • the first semantic instruction of the target sub-intention and the trial run result are used as the second semantic instruction of the operation information.
  • the first sending module is further configured to send the first semantic instruction and the second semantic instruction to the first terminal in a form of semantic representation.
  • the first sending module includes:
  • the first sub-module is used to parse the test run result according to the execution command
  • the second word module is used to call the business logic according to the parsed trial operation result, and send the business logic to the second terminal in the form of semantic representation.
  • an embodiment of the present application provides a voice control device, including:
  • the voice recognition module is used to receive the voice instruction input by the user and perform voice recognition on the voice instruction to obtain the voice instruction recognition result;
  • the second sending module is configured to send the voice command recognition result to the server
  • the second receiving module is configured to receive operation information fed back after the server performs semantic processing on the voice command recognition result, where the operation information includes a first semantic instruction and a second semantic instruction;
  • the instruction execution module is used to execute the first semantic instruction and send the second semantic instruction to a second terminal; the second semantic instruction is used to instruct the second terminal to send an execution command to the server and receive the The service logic corresponding to the second semantic instruction fed back by the server.
  • the second receiving module is further configured to receive the response logic of the server for the target intention feedback in the voice command recognition result, and receive the server for the voice command recognition result The result of the trial operation of the target sub-intent feedback in the.
  • the first semantic instruction is the response logic of the server to the target intention feedback in the voice instruction recognition result
  • the second semantic instruction is the server's response to the voice instruction
  • an embodiment of the present application provides a voice control device, including:
  • the third receiving module is configured to receive the second semantic instruction sent when the first terminal executes the first semantic instruction; the first semantic instruction and the second semantic instruction are for the first terminal to send the voice instruction recognition result to After the server, the operation information fed back by the server according to the voice command recognition result is received;
  • An instruction recognition module configured to recognize the second semantic instruction, and obtain a recognition result of the second semantic instruction
  • the third sending module is configured to send an execution command to the server according to the recognition result
  • the business execution module is configured to receive the business logic corresponding to the second semantic instruction fed back by the server according to the execution command, and execute the business logic.
  • the operation information includes the response logic of the server to the target intention feedback in the voice command recognition result, and the server feedback to the target sub-intent in the voice command recognition result
  • the trial run result; the third receiving module is also configured to receive the target sub-intent and the trial run result sent when the first terminal executes the response logic.
  • the second semantic instruction includes a trial run result obtained by the server pre-verifying the target sub-intent in the voice instruction recognition result; the instruction recognition module is also used to recognize the first Two semantic instructions to obtain the trial run result of the target sub-intention.
  • the third sending module is further configured to send an execution command corresponding to the trial operation result to the server according to the identification result.
  • an embodiment of the present application provides a server that includes a memory, a processor, a natural language understanding module, and a dialogue management module.
  • the memory is used to store a computer program, and the computer program includes instructions.
  • the server is caused to execute the voice control method.
  • an embodiment of the present application provides a terminal device.
  • the terminal device includes a memory, a processor, and a voice assistant.
  • the memory is used to store a computer program.
  • the computer program includes instructions. When the terminal device is executed, the terminal device is caused to execute the voice control method.
  • an embodiment of the present application provides a terminal device.
  • the terminal device includes a memory and a processor.
  • the memory is used to store a computer program.
  • the computer program includes instructions. When executed, the terminal device is caused to execute the voice control method.
  • an embodiment of the present application provides a computer storage medium, the computer-readable storage medium stores a computer program, and the computer program includes instructions that, when run on a terminal device, cause the terminal device to Perform the voice control method.
  • an embodiment of the present application provides a computer program product containing instructions that, when the computer program product runs on a terminal device, causes the terminal device to execute the voice control method described in any one of the above-mentioned first aspects .
  • the embodiment of the present application has the beneficial effect that the voice control method provided by the present application is used to receive the voice command recognition result sent by the first terminal, and perform semantic processing on the voice command recognition result to obtain voice command recognition.
  • the operation information to be executed in the result, and the operation information is sent to the first terminal; the first terminal executes the first semantic instruction in the operation information, and sends the second semantic instruction in the operation information to the second terminal; After the second terminal recognizes the second semantic instruction, it can directly receive the execution command fed back by the second terminal, call the business logic corresponding to the second semantic instruction according to the execution command, and send the business logic to the second terminal; through this embodiment After the second terminal receives the second semantic instruction, it can directly receive the execution command from the second terminal according to the task information contained in the second semantic instruction, and there is no need to perform semantic processing again on the second semantic instruction received by the second terminal , The corresponding business logic can be called according to the feedback execution command, and sent to the second terminal through the execution interface, which saves the processing flow of the second
  • FIG. 1 is a schematic diagram of a system architecture for multi-device interconnected voice control provided by an embodiment of the present application
  • FIG. 2 is a schematic diagram of a system architecture for multi-device interconnected voice control provided by another embodiment of the present application;
  • FIG. 3 is a schematic flowchart of a voice control method provided by an embodiment of the present application.
  • FIG. 4 is a schematic flowchart of a voice control method provided by another embodiment of the present application.
  • FIG. 5 is a schematic flowchart of a voice control method provided by another embodiment of the present application.
  • FIG. 6 is a schematic diagram of device interaction of a voice control method provided by an embodiment of the present application.
  • FIG. 7 is a schematic diagram of an application scenario of a voice control method provided by an embodiment of the present application.
  • FIG. 8 is a schematic diagram of an application scenario of a voice control method provided by another embodiment of the present application.
  • FIG. 9 is a schematic diagram of an application scenario of a voice control method provided by another embodiment of the present application.
  • FIG. 10 is a schematic structural diagram of a voice control device provided by an embodiment of the present application.
  • FIG. 11 is a schematic structural diagram of a voice control device provided by another embodiment of the present application.
  • FIG. 12 is a schematic structural diagram of a voice control device provided by another embodiment of the present application.
  • FIG. 13 is a schematic structural diagram of a server provided by an embodiment of the present application.
  • FIG. 14 is a schematic structural diagram of a terminal device provided by an embodiment of the present application.
  • FIG. 15 is a schematic structural diagram of a terminal device provided by another embodiment of the present application.
  • the term “if” can be construed as “when” or “once” or “in response to determination” or “in response to detecting “.
  • the phrase “if determined” or “if detected [described condition or event]” can be interpreted as meaning “once determined” or “in response to determination” or “once detected [described condition or event]” depending on the context ]” or “in response to detection of [condition or event described]”.
  • the voice control method provided in the present application can be applied to a full-scenario conversation scenario where multiple devices conduct cross-device joint dialogues through voice mutual control, such as voice interaction with a mobile phone, and a mobile phone to control a TV to execute corresponding business logic.
  • a full-scenario conversation scenario formed by voice mutual control between multiple devices requires that each device in the scene has a networking function, and each device can communicate with each other in a wired or wireless manner through mutual confirmation of addresses and interfaces.
  • each device is connected to the cloud-side service, and communication is realized through the cloud-side service.
  • wireless methods include the Internet, WiFi networks, or mobile networks; mobile networks can include existing 2G (such as Global System for Mobile Communication (GSM)), 3G (such as Universal Mobile Communication System (English: Universal Mobile Telecommunications System (UMTS)), 4G (such as FDD LTE, TDD LTE), 4.5G, 5G, etc.
  • Data transmission is realized between devices through transmission protocols, such as communication protocols such as http.
  • the various devices described can be mobile phones, TVs, tablets, speakers, computers, etc., and the devices can have functions such as networking and voice assistants.
  • dialogue manager is required as a cloud service to maintain and update the process and status of the dialogue, and enter the words corresponding to the voice command (utterance) and combined with the relevant context, through dialogue skills understanding, output the system response.
  • Dialog Manager obtains the task corresponding to the voice command according to the semantics of the input voice command, clarifies the information required for the task, and then docks with the business platform to complete the task, or requires further input of more voice command information, or Obtain the business logic of the task corresponding to the business platform, and finally return the execution result to the user.
  • DMs with different functions can be connected to different business platforms, which can be the preset business platform of the system, or a third-party platform.
  • the semantics of listening to songs or e-books can be connected to platforms such as NetEase Cloud Music or Himalaya, and watching videos.
  • Semantics can be connected to third-party platforms such as iQiyi or Bilibili.
  • FIG. 1 is a schematic diagram of a system architecture for multi-device interconnected voice control provided by an embodiment of the present application.
  • the first terminal 11 is provided with a voice assistant, which can receive the audio signal input by the user through a microphone; the first terminal 11 performs voice recognition ASR on the received audio signal to obtain the text information corresponding to the audio signal; the first terminal 11 transmits the text information To the server 12; the server 12 may be a dialogue management server, which performs semantic recognition on the received text information through Natural Language Understanding (NLU), and can obtain the target intention and target sub-intent after semantic recognition; after semantic recognition
  • NLU Natural Language Understanding
  • the output semantic representation performs business connection, obtains the business logic corresponding to the semantic representation, and finally returns the execution result to the first terminal 11; after receiving the execution result, the first terminal 11 sends the execution result to the second terminal 13; or the server 12 Directly send the execution result to the second terminal 13; the second terminal 13 recognizes the received execution result
  • the first terminal 11 can be a mobile phone;
  • the server 12 can be a conversation management cloud service or a local physical server;
  • the second terminal 13 can be a TV; through voice interaction with the mobile phone and the server for dialog management, Realize the control of the TV through the mobile phone; for example, the user says to the mobile phone: Use TV to play the movie Nezha, and the mobile phone displays: Switching to the TV for you (pre-verified that the TV supports playback during the interaction with the conversation management server), and finally the TV displays: Play Movie Nezha (actually starts playing).
  • a multi-device interconnected voice control system can include multiple devices, and the voice control implemented can include any type of voice commands controlled across devices, for example, through cross-device control of TV playback instructions, and through cross-device control. Instructions for adjusting the temperature of the air conditioner or instructions for controlling the cooking mode of cooking tools through cross-devices.
  • dialogue management is responsible for controlling the flow and state of the dialogue.
  • the system response is output.
  • FIG. 2 is a schematic diagram of a system architecture for multi-device interconnected voice control provided by another embodiment of the present application; currently, in a full-scenario conversation that uses voice mutual control, the first terminal 11 receives a voice command input by the user, such as " Use TV to play the movie Nezha"; the first terminal 11 performs voice recognition on the voice command, and obtains the voice command recognition result, that is, the text information corresponding to the voice command; the first terminal 11 sends the voice command recognition result to the server 12, and the server responds to the voice command.
  • the instruction recognition result is processed in parallel in multiple stages.
  • the first terminal is a mobile phone
  • the second terminal is a TV
  • the server is a dialogue management server as an example; multiple stages of parallel processing include: mobile phone context-based skills discovery, trial operation, selection, execution, Conversation continuation and skill discovery, trial operation, and selection based on the context of analog TV.
  • the dialogue management server combines the context of the mobile phone to perform semantic recognition on the voice command recognition results, find multiple skills corresponding to the semantics, conduct a trial run on each skill, aggregate the trial run results together, and filter out the results of the trial run failure.
  • the result of a successful test run is based on the sorting rules or sorting models (such as LambdaMART or the sorting model commonly used by search engines).
  • the first test run result is selected as the only ideal skill, and then executed based on the test run result, and finally the conversation is continued.
  • the execution result is returned to the client (ie mobile phone).
  • the dialogue management server performs semantic recognition based on the context of the mobile phone and determines that it is the "switching" skill.
  • it needs to pre-verify whether the TV supports “Play the movie Nezha”; in the dialogue management server, based on the analog TV context recognition verb "Play the movie Nezha” the process of skill discovery, trial operation and selection; if a skill can be selected, it means that the TV side supports ; Otherwise, it means that the TV does not support the task, and it needs to return the corresponding semantic processing result or further confirm to the user.
  • the dialogue management server performs the process of "skill discovery, trial operation and selection” on the “play movie Nezha” in the process of semantic processing relative to mobile phones and TVs, so that the dialogue system can interact with voices.
  • a long time delay is generated, the response time of the system is prolonged, the running load of the dialog management server is increased, and the user experience is poor at the same time.
  • the voice control method provided by this application uses the information interaction between the control devices in the full-scene multi-device collaborative dialogue.
  • the switching device is identified, the pre-verified trial operation result of the target device is used as the intermediate data.
  • the first terminal receives the voice command input by the user, the first terminal performs voice recognition on the voice command, and sends the recognized voice command recognition result to the server; the server receives After the voice command recognition result is reached, the voice command recognition result is processed in multiple stages, mainly including recognition tasks, execution tasks and result responses.
  • the operation information obtained by processing the voice command recognition result is used as a result reply and fed back to the first terminal.
  • the operation information includes the response logic based on the first terminal context and the trial run result based on the simulated second terminal context, and the trial run result and the response logic of the first terminal are sent to the first terminal at the same time; or the response logic is sent To the first terminal, the test operation result is directly sent to the second terminal.
  • the first terminal receives both the response logic and the test run result fed back by the server, the second terminal is called and the test run result is sent to the second terminal.
  • the second terminal directly calls the execution interface of the server according to the test run result.
  • the second terminal sends an execution command to the server; the server docks with the service platform according to the execution command, calls the corresponding service logic, and feeds back the service logic to the second terminal, and the second terminal executes the corresponding service logic.
  • the first terminal can respond to the user that the command is being switched or is being executed.
  • the server calls the second terminal and directly sends the test run results to the second terminal.
  • the second terminal recognizes the test run results and directly calls the server's execution interface to send execution commands to the server; the server connects to the business platform according to the instructions and calls the corresponding services Logic and feed back the business logic to the second terminal, and the second terminal executes the business logic. It saves the process of repeated processing of server dialogue, thereby improving the response speed of the target device, shortening the response time of the dialogue system, and reducing the delay of human-machine voice interaction.
  • FIG. 3 a schematic flow chart of a voice control method provided by an embodiment of the present application.
  • the server in FIG. 1 is used as the execution subject, and the server may be a dialogue management Cloud services or local physical servers are not specifically limited here; the specific implementation principle of this method includes the following steps:
  • Step S301 Receive a voice command recognition result sent by the first terminal.
  • the server receives the voice command recognition result sent by the first terminal;
  • the voice command recognition result is that after the first terminal receives the voice command input by the user, it performs voice recognition on the voice command audio information to obtain the text of the voice command Information, the text information of the voice command is used as the voice command recognition result.
  • the first terminal can be a terminal device equipped with a voice assistant, such as a mobile phone, computer, tablet, TV or speaker, etc., and the user’s audio information is received through the microphone of the first terminal. For example, the user says to the voice assistant of the mobile phone, "Play movies on TV. Nezha".
  • the first terminal After the first terminal recognizes the voice command, it obtains text information corresponding to the voice command, and transmits the text information to the server via wireless WiFi or cellular mobile network; the server performs semantic recognition and processing.
  • the voice command may be a task-type voice control command, and the voice command recognition result may include the target target intention and the target sub-intent.
  • the voice command recognition result may include the target target intention and the target sub-intent. For example, “Use TV to play movie Nezha” or “Use speakers to play Beatles song”, etc., “Use TV” or “Use speakers” to correspond to the target intention, "Play movie Nezha” or “Play Beatles song” "Can be correspondingly identified as a target sub-intent.
  • the first terminal and the server can realize networked communication through mutual confirmation of addresses and interfaces; they can also realize mutual communication through gateways or routes.
  • the information transmission between the server and the first terminal complies with the data transmission protocol, such as the HTTP protocol.
  • Step S302 Perform semantic processing on the voice command recognition result to obtain operation information, where the operation information includes a first semantic instruction and a second semantic instruction.
  • the server serves as a dialogue management system in the voice interaction process, and can perform semantic recognition on the voice command recognition result through natural language understanding, and obtain a machine-recognizable semantic representation.
  • the semantic representation the target intent and target sub-intent in the voice command recognition result are obtained, and after multiple stages of parallel processing, the operation information of the first terminal is obtained in response to the voice command recognition result.
  • the operation information can be the execution result of the server's completion of the target intent in the voice command recognition result, that is, the response logic, for example, the business logic called according to the voice command recognition result; it can also further require the client to input more information to complete the target intent. .
  • the server when the server receives the "play movie Nezha on TV” sent by the mobile phone, the server performs skills discovery, trial operation, and selection processes based on the set mobile phone context, and determines the "switching" skills; based on semantic recognition, it can Make sure that the target intention is “switch” and the target sub-intent is "play movie Nezha”. According to semantic recognition, it is necessary to switch to the target device TV, and then pre-verify whether the TV supports "Playing Movie Nezha” to avoid that the TV display does not support or understand after switching to the TV. Simulated TV context information is set on the server side, including the field, target object, and slot information, sequence, pronouns, etc. mentioned in the previous dialogue.
  • pre-verification of the "playing movie Nezha” skills that is, the process of skill discovery, trial operation, and selection and determination of skills. If the playback skills can be determined, the TV end supports; then the mobile phone The "switching" action that needs to be performed and the trial operation result of the pre-verification process generate corresponding operation information, perform conversation connection, and reply to the mobile phone.
  • the operation information can be divided into the operation instructions that the current mobile phone needs to execute and the operation instructions that the target device needs to execute. It is the first semantic instruction and the second semantic instruction.
  • the first semantic command corresponds to the response logic of the current mobile phone and corresponds to the target intention in the voice command recognition result; the second semantic command is the logic to be executed by the target device and corresponds to the target sub-intent in the voice command recognition result.
  • the dialogue management server can also set multiple slots to conduct multiple rounds of voice interaction with the client to clarify the target intention or target subordinates. Intent; for example, after receiving the "play on TV” speech sent by the mobile phone, the server can return the question “what to play", and then receive the "movie nezha”; through multiple rounds of dialogue, clarify the task of the target speech, thus the dialogue system Make an accurate answer or response.
  • the performing semantic processing on the voice command recognition result to obtain operation information includes:
  • the target intention pre-verifies the target sub-intent to obtain the response logic of the target intention and the trial operation result of the target sub-intent;
  • the server performs semantic processing on the voice command recognition result, recognizes the semantic information in the text information, and obtains the target intention and target sub-intent of the voice command recognition result.
  • the target intention may be determined according to the voice command recognition result.
  • the operation that the first terminal needs to perform, and the target sub-intent may be an operation that needs to be performed by the target device to control the target device across the device as a result of voice command recognition.
  • the server determines the target intention of the voice command recognition result based on the mobile phone context, such as the determined "switch" intention; the server pre-verifies and trial runs the target sub-intent to determine whether the target terminal supports the execution of the target sub-intent.
  • the response logic and the trial operation result may include skill identification, intent identification, and slot information.
  • the skill identification identifies a skill, and the skill is a collection of abilities that can support several intents, such as weather skills supporting the intent to check weather and PM2.5; the intent identification identifies the only intent within the skill; the slot information is required for the execution of the intent parameter list.
  • the number of parameters in the slot information can be any number, zero or more than one.
  • Slot information includes slot name, slot type, and slot value; slot name determines the parameter name of the slot, and slot type determines the type of the slot parameter, such as date, number, character string, etc., slot value That is the parameter value.
  • the server responds with the response logic and the test run result as the result, uses the response logic as the first semantic instruction of the operation information, and uses the words corresponding to the target sub-intent and the running result as the result.
  • the second semantic instruction of the operation information is the first semantic instruction of the operation information.
  • Step S303 Send the first semantic instruction and the second semantic instruction to the first terminal, where the first semantic instruction is used to instruct the first terminal to send the second semantic instruction to the second terminal. terminal.
  • the server responds with the first semantic instruction and the second semantic instruction as a result, and sends them to the first terminal at the same time.
  • the first semantic instruction includes the response logic to reply to the first terminal.
  • the second semantic instruction includes the words corresponding to the target sub-intent in the voice command recognition result and the trial run result of the target sub-intent.
  • the first terminal executes the first semantic instruction and sends the second semantic instruction to the second terminal; the second terminal recognizes the second semantic instruction, and can identify the target sub-intent in the second semantic instruction at the same time.
  • the server may also send the first semantic instruction to the first terminal in a wired or wireless manner, and send the second semantic instruction directly to the second terminal in a wired or wireless manner (That is, the target terminal).
  • the first terminal executes the switching skill and determines to switch to the second terminal (target terminal); the second terminal (target terminal) directly obtains the second semantic instruction sent by the server.
  • the second semantic instruction includes the trial run result of the target sub-intent; the second terminal can recognize the trial run result in the second semantic instruction, and directly sends the execution command to the server according to the trial run result, and calls the execution interface of the server.
  • the server calls the business logic corresponding to the target sub-intent according to the execution command, eliminating the need for the server to perform skills discovery, trial operation, and selection of the target sub-intent in the second semantic instruction, which improves the response speed of the dialogue system .
  • the address and Mutual confirmation of interfaces to achieve networked communication can also be achieved through gateways or routing. Therefore, the trial run result in the second semantic control instruction may be transmitted from the first terminal to the second terminal as an intermediate result, or may be directly sent by the server to the second terminal to call the second terminal.
  • the sending the first semantic instruction and the second semantic instruction to the first terminal includes:
  • the first semantic instruction and the second semantic instruction are sent to the first terminal in the form of semantic representation.
  • the form of semantic representation is a machine-readable language representation
  • the server uses the semantically processed voice command recognition result as the reply result of the first terminal or the second terminal in the form of semantic representation.
  • the server may also send the first semantic instruction to the first terminal in the form of semantic representation.
  • Step S304 Receive an execution command fed back after the second terminal recognizes the second semantic instruction, and send the business logic corresponding to the second semantic instruction to the second terminal according to the execution command.
  • the second semantic instruction includes the target sub-intent and the trial run result obtained by pre-verifying the target sub-intent; after receiving the second semantic instruction, the second terminal obtains the trial run result by identifying the second semantic instruction ;
  • the second terminal directly calls the execution interface of the server according to the result of the trial operation, and sends an execution command to the server.
  • the server receives the execution command sent by the second terminal, docks the service logic corresponding to the second semantic instruction according to the execution command, and sends the service logic to the second terminal device; for example, calls the movie data in the server and sends the movie data as the response logic
  • the corresponding business logic is executed by the second terminal, that is, the movie Nezha is played.
  • the sending the service logic corresponding to the second semantic instruction to the second terminal according to the execution command includes:
  • the server receives the execution command sent by the second terminal, analyzes the trial run result of the target sub-intent, calls the business logic corresponding to the target sub-intent according to the analysis result, and expresses the business logic in the form of semantics.
  • the dialogue management server corresponding to the first terminal and the dialogue management server corresponding to the second terminal may be the same server or two servers with the same function.
  • the voice command recognition result is semantically processed to obtain the operation information to be performed in the voice command recognition result, and
  • the operation information is sent to the first terminal; the first terminal executes the first semantic instruction in the operation information, and sends the second semantic instruction in the operation information to the second terminal; after the second terminal recognizes the second semantic instruction, Directly receive the execution command fed back by the second terminal, call the business logic corresponding to the second semantic instruction according to the execution command, and send the business logic to the second terminal; through this embodiment, the second semantic is received at the second terminal After the instruction, the second terminal can directly receive the execution command according to the task information contained in the second semantic instruction.
  • the business logic is sent to the second terminal through the execution interface, eliminating the need for processing the second semantic instruction, shortening the delay of the dialogue, and improving the response time of the dialogue system.
  • FIG 4 is a schematic flowchart of a voice control method provided by another embodiment of the present application.
  • the first terminal in Figure 1 is used as the execution subject.
  • the terminal can be a mobile phone, computer, tablet, speaker, etc., which is not specifically limited here; the specific implementation principle of the method includes the following steps:
  • Step S401 Receive a voice command input by a user, and perform voice recognition on the voice command to obtain a voice command recognition result;
  • Step S402 Send the voice command recognition result to the server
  • Step S403 Receive operation information fed back after the server performs semantic processing on the voice command recognition result, where the operation information includes a first semantic instruction and a second semantic instruction;
  • Step S404 Execute the first semantic instruction and send the second semantic instruction to a second terminal; the second semantic instruction is used to instruct the second terminal to send an execution command to the server, and receive feedback from the server The business logic corresponding to the second semantic instruction.
  • the first terminal may be provided with a voice assistant, which receives voice commands input by the user through a microphone, performs voice recognition ASR on the voice commands, and obtains the voice command recognition result, that is, the text information corresponding to the voice command.
  • the voice command recognition result is sent to the server in a wired or wireless manner, and the operation information fed back by the server is received; the operation information may include a first semantic instruction corresponding to the first terminal and a second semantic instruction corresponding to the second terminal.
  • the first terminal executes the first semantic instruction in the operation information, calls and switches to the second terminal, and at the same time sends the second semantic instruction to the second terminal.
  • the second semantic instruction may include a trial run result of the target sub-intent in the voice instruction recognition result.
  • the second terminal can identify the test run result in the second semantic instruction, and directly send the execution command to the server according to the test run result, and call the execution interface of the server; the server docks the business logic corresponding to the target sub-intent according to the execution command, and feeds back the business logic To the second terminal; the second terminal completes the business logic. It saves the process of repeated processing of the target sub-intent by the server, thereby improving the response speed of the target device, shortening the response time of the dialogue system, and reducing the delay of human-machine voice interaction.
  • the receiving operation information fed back after the server performs semantic processing on the voice command recognition result includes:
  • the first semantic instruction is the response logic of the server to the target intention feedback in the voice instruction recognition result
  • the second semantic instruction is the server's response to the voice instruction
  • executing the first semantic instruction and sending the second semantic instruction to the second terminal includes:
  • the response logic fed back by the server is executed, and the target sub-intent and the trial operation result fed back by the server are sent to the second terminal.
  • the first terminal obtains the response logic of the server based on the context feedback of the first terminal, it obtains the test run result of the target sub-intent in the voice command recognition result, and when the second terminal is called, the test run will be performed
  • the result is also sent to the second terminal, so that the second terminal can directly obtain the test run result of the target sub-intent in the voice command recognition result, without the need for a series of semantic processing on the target sub-intent through the server, which optimizes the dialogue system
  • the data processing flow improves the response speed of the dialogue system.
  • FIG. 5 is a schematic flowchart of a voice control method provided by another embodiment of the present application.
  • the second terminal in FIG. can be a mobile phone, a tablet, a computer, a speaker, a TV, etc., which is not specifically limited here; the specific implementation principle of the method includes the following steps:
  • Step S501 Receive the second semantic instruction sent when the first terminal executes the first semantic instruction; the first semantic instruction and the second semantic instruction are received after the first terminal sends the voice instruction recognition result to the server Operation information fed back to the server according to the voice command recognition result;
  • Step S502 Recognizing the second semantic instruction, and obtaining a recognition result of the second semantic instruction
  • Step S503 Send an execution command to the server according to the recognition result
  • Step S504 Receive the business logic corresponding to the second semantic instruction fed back by the server according to the execution command, and execute the business logic.
  • the second terminal after receiving the second semantic instruction fed back by the server through the first terminal, the second terminal recognizes the second semantic instruction, and the test of the target sub-intent in the voice instruction recognition result can be obtained. According to the results of the trial operation, there is no need to perform semantic recognition processing on the words of the target sub-intent, directly send the execution command to the server, call the execution interface of the server, and make the server dock with the corresponding business platform and call the corresponding business according to the result of the trial operation logic.
  • the second terminal receives the business logic fed back by the server and executes the business logic.
  • the operation information includes the response logic of the server to the target intention feedback in the voice command recognition result, and the server feedback to the target sub-intent in the voice command recognition result Test run results;
  • the receiving the second semantic instruction sent when the first terminal executes the first semantic instruction includes:
  • the second semantic instruction includes a trial run result obtained by the server pre-verifying the target sub-intent in the voice instruction recognition result;
  • the recognizing the second semantic instruction and obtaining the recognition result of the second semantic instruction includes:
  • the second semantic instruction is recognized, and the trial run result of the target sub-intention is obtained.
  • the second terminal when the second terminal receives the test run result of the target sub-intent in the voice command recognition result, it can directly call the execution interface of the server according to the test run result, and there is no need to perform semantic recognition processing on the words of the target sub-intent.
  • the server docks with the service platform corresponding to the target sub-intent, calls the corresponding service logic, and feeds back the service logic to the second terminal, and the second terminal executes the service logic.
  • the process of repeated semantic processing of words corresponding to the target sub-intent in the voice command recognition result is saved, and the response speed of the dialogue system is improved.
  • FIG. 6 a schematic diagram of device interaction of the voice control method provided by an embodiment of the present application. Through the network interconnection of multiple devices, voice control across devices is realized.
  • the interaction process includes the following steps:
  • the first terminal receives a voice command input by a user, and performs voice recognition on the voice command to obtain a voice command recognition result;
  • the first terminal sends the voice command recognition result to the server
  • the server performs semantic processing on the voice command recognition result to obtain operation information
  • the server sends the operation information to the first terminal;
  • the operation information includes a first semantic instruction and a second semantic instruction;
  • the first terminal executes the first semantic instruction
  • the first terminal sends the second semantic instruction to the second terminal
  • the second terminal recognizes the second semantic instruction
  • the second terminal sends an execution command to the server, and calls the execution interface of the server;
  • the server invokes the business logic corresponding to the second semantic instruction according to the execution command
  • the server sends the service logic to the second terminal
  • the second terminal executes the service logic.
  • the first terminal is a mobile phone
  • the server is a dialog management server
  • the second terminal is a television
  • all devices are in The status of the network can communicate with each other through the confirmation of the address and interface.
  • the mobile phone receives the voice command of "Play movie Nezha on TV” input by the user, performs voice recognition on the voice command, and obtains the text information of the voice command.
  • the mobile phone sends the text information to the dialogue management via wired or wireless means. server.
  • the dialogue management server performs semantic recognition of "playing movie Nezha on TV” based on the mobile phone context, and selects the best skill “switching” through skill discovery, trial operation, and determining the target as “TV” and the phrase as "playing movie Nezha” "; When determining the intention of switching, it is necessary to pre-verify whether the TV supports playback.
  • the verification result is supported, and the trial operation result is " Target object";
  • the skills "switch”, determine the target, "TV”, and “play movie Nezha” are fed back to the mobile phone as the response logic;
  • the mobile phone receives the response logic and executes the switching instruction to "play the movie Nezha” Send it to the TV, and send the trial run result "Object” to the TV;
  • the TV recognizes the trial run result "Object”, and directly sends the execution command to the dialogue management server, calling the execution interface of the dialogue management server;
  • the dialogue management server is connected with "Play “Movie Nezha” corresponds to the business logic, and feeds back the business logic to the TV, and the TV executes the operation of playing the movie Nezha according to the feedback business logic.
  • FIG. 8 a schematic diagram of the application scenario of the voice control method provided by another embodiment of the present application
  • the dialogue management server is performing semantics of "playing movie Nezha” based on analog TV context
  • the trial run results can be sent directly to the TV through the network, and the words "play movie Nezha” can be sent to the TV through the mobile phone
  • the TV directly calls the server's execution interface according to the trial run results to send execution commands
  • the dialogue management server docks the business logic corresponding to "play movie Nezha", and feeds back the business logic to the TV, and the TV executes the operation of playing the movie Nezha according to the feedback business logic.
  • the server after the server is executed based on the context of the mobile phone, the response logic corresponding to the target intent and the trial run result of the target sub-intent are obtained; the server can directly call the TV to play the movie with the words of the target sub-intent. "Nezha" and the results of the trial run are sent to the TV at the same time. The TV recognizes the words corresponding to the target sub-intent and the result of the trial operation.
  • the TV directly calls the execution interface of the dialogue management server according to the result of the trial operation, and sends the execution command to the dialogue management server; the dialogue management server connects with the service corresponding to "play movie Nezha” Logic, and feed back the business logic to the TV, and the TV executes the operation of playing the movie Nezha according to the feedback business logic.
  • the first terminal is a mobile phone
  • the server is a dialogue management server
  • the second terminal is a TV
  • all devices are In a networked state, mutual communication can be achieved through confirmation of addresses and interfaces.
  • the mobile phone receives the user’s input of the voice command "Switch to TV to play the movie The Great Sage Returns", performs voice recognition on the voice command, and obtains the text information corresponding to the voice command; the mobile phone calls the dialogue management server and responds to the text of the voice command.
  • the information is voice-recognized, and it is recognized that it is the skill and intention of switching the device.
  • the target device is the TV
  • the target sub-intent is "Playing the Movie: The Return of the King”
  • the dialogue management server verifies whether the TV supports the "Playing the Movie: The Return of the King", based on simulation
  • the first half of the process of the target device is reduced, the response delay of the dialogue system is significantly shortened (in actual applications, the delay can be shortened by more than 50%), and the dialogue experience is improved.
  • FIG. 10 shows a structural block diagram of the voice control device provided in an embodiment of the present application. For ease of description, only the same as the embodiment of the present application is shown. The relevant part.
  • the device includes a first receiving module 101, a semantic processing module 102, a first sending module 103, and a command execution module 104.
  • the functions of each module are as follows:
  • the first receiving module 101 is configured to receive a voice command recognition result sent by the first terminal;
  • the semantic processing module 102 is configured to perform semantic processing on the voice instruction recognition result to obtain operation information, where the operation information includes a first semantic instruction and a second semantic instruction;
  • the first sending module 103 is configured to send the first semantic instruction and the second semantic instruction to the first terminal, where the first semantic instruction is used to instruct the first terminal to send the second semantic instruction The instruction is sent to the second terminal;
  • the command execution module 104 is configured to receive the execution command fed back after the second terminal recognizes the second semantic instruction, and according to the execution command, send the business logic corresponding to the second semantic instruction to the second terminal.
  • the semantic processing module includes:
  • Semantic recognition sub-module used to recognize the voice command recognition result, and obtain the target intention and target sub-intent of the voice command recognition result
  • the task execution sub-module is used to pre-verify the target sub-intent according to the target intent to obtain the response logic of the target intent and the trial run result of the target sub-intent; use the response logic as the operation information
  • the first semantic instruction of the target sub-intention and the trial run result are used as the second semantic instruction of the operation information.
  • the first sending module is further configured to send the first semantic instruction and the second semantic instruction to the first terminal in a form of semantic representation.
  • the first sending module includes:
  • the first sub-module is used to parse the test run result according to the execution command
  • the second word module is used to call the business logic according to the parsed trial operation result, and send the business logic to the second terminal in the form of semantic representation.
  • FIG. 11 shows a structural block diagram of a voice control device provided by another embodiment of the present application. The relevant part of the embodiment.
  • the device includes a voice recognition module 111, a second sending module 112, a second receiving module 113, and an instruction execution module 114.
  • the functions of each module are as follows:
  • the voice recognition module 111 is configured to receive a voice command input by a user, and perform voice recognition on the voice command to obtain a voice command recognition result;
  • the second sending module 112 is configured to send the voice command recognition result to the server
  • the second receiving module 113 is configured to receive operation information fed back after the server performs semantic processing on the voice command recognition result, where the operation information includes a first semantic instruction and a second semantic instruction;
  • the instruction execution module 114 is configured to execute the first semantic instruction and send the second semantic instruction to a second terminal; the second semantic instruction is used to instruct the second terminal to send an execution command to the server and receive The business logic corresponding to the second semantic instruction fed back by the server.
  • the second receiving module is further configured to receive the response logic of the server for the target intention feedback in the voice command recognition result, and receive the server for the voice command recognition result The result of the trial operation of the target sub-intent feedback in the.
  • the first semantic instruction is the response logic of the server to the target intention feedback in the voice instruction recognition result
  • the second semantic instruction is the server's response to the voice instruction
  • FIG. 12 shows a structural block diagram of a voice control device provided by another embodiment of the present application. The relevant part of the embodiment.
  • the device includes a third receiving module 121, an instruction recognition module 122, a third sending module 123, and a service execution module 124.
  • the functions of each module are as follows:
  • the third receiving module 121 is configured to receive a second semantic instruction sent when the first terminal executes the first semantic instruction; the first semantic instruction and the second semantic instruction are for the first terminal to send the voice instruction recognition result After arriving at the server, the operation information fed back by the server according to the voice command recognition result is received;
  • the instruction recognition module 122 is configured to recognize the second semantic instruction and obtain the recognition result of the second semantic instruction
  • the third sending module 123 is configured to send an execution command to the server according to the recognition result
  • the business execution module 124 is configured to receive the business logic corresponding to the second semantic instruction fed back by the server according to the execution command, and execute the business logic.
  • the operation information includes the response logic of the server to the target intention feedback in the voice command recognition result, and the server feedback to the target sub-intent in the voice command recognition result
  • the trial run result; the third receiving module is also configured to receive the target sub-intent and the trial run result sent when the first terminal executes the response logic.
  • the second semantic instruction includes a trial run result obtained by the server pre-verifying the target sub-intent in the voice instruction recognition result; the instruction recognition module is also used to recognize the first Two semantic instructions to obtain the trial run result of the target sub-intention.
  • the third sending module is further configured to send an execution command corresponding to the trial operation result to the server according to the identification result.
  • the voice control method is adopted, by receiving the voice command recognition result sent by the first terminal, the voice command recognition result is semantically processed, and the operation information to be executed in the voice command recognition result is obtained, and the operation information is sent to the first terminal.
  • the execution command fed back by the terminal calls the business logic corresponding to the second semantic instruction according to the execution command, and sends the business logic to the second terminal; through this embodiment, after the second terminal receives the second semantic instruction, it can directly After receiving the second terminal feedback execution command according to the task information contained in the second semantic instruction, there is no need to perform semantic processing again on the second semantic instruction received by the second terminal, and the corresponding business logic can be invoked according to the feedback execution command.
  • the interface is sent to the second terminal, eliminating the need for processing the second semantic instruction, shortening the delay of the dialogue, and
  • FIG. 13 is a schematic structural diagram of a server provided by an embodiment of this application.
  • the server 13 of this embodiment includes: at least one processor 131 (only one is shown in FIG. 13), a memory 132, stored in the memory 132 and available on the at least one processor 131
  • the memory 132, the natural language understanding module 134, and the dialogue management module 135 are coupled to the processor 131.
  • the memory 132 is used to store a computer program 133.
  • the computer program 133 includes instructions.
  • the processor 131 reads instructions from the memory 132 so that The server 13 performs the following operations:
  • the operation information includes a first semantic instruction and a second semantic instruction;
  • the second semantic instruction is sent to the first terminal, and the first semantic instruction is used to instruct the first terminal to send the second semantic instruction to the second terminal; receiving the second terminal to identify the first terminal
  • the execution command fed back after the second semantic instruction, and the service logic corresponding to the second semantic instruction is sent to the second terminal according to the execution command.
  • FIG. 14 is a schematic structural diagram of a terminal device provided by an embodiment of this application.
  • the terminal device 14 of this embodiment includes: at least one processor 141 (only one is shown in FIG. 14), a memory 142, and is stored in the memory 142 and can be stored in the at least one processor 141.
  • the memory 142 and the voice assistant 144 are coupled to the processor 141.
  • the memory 142 is used to store a computer program 143.
  • the computer program 143 includes instructions.
  • the processor 141 reads the instructions from the memory 142 so that the terminal device 14 performs the following operations:
  • the operation information includes a first semantic instruction and a second semantic instruction; the first semantic instruction is executed, and the second semantic instruction is sent to the second terminal; the second semantic instruction is used to instruct all
  • the second terminal sends an execution command to the server, and receives the service logic corresponding to the second semantic instruction fed back by the server.
  • FIG. 15 is a schematic structural diagram of a terminal device provided by an embodiment of this application.
  • the terminal device 15 of this embodiment includes: at least one processor 151 (only one is shown in FIG. 15), a memory 152, stored in the memory 152 and available in the at least one processor 151 Computer program 153 running on.
  • the memory 152 is coupled with the processor 151.
  • the memory 152 is used to store a computer program 153.
  • the computer program 153 includes instructions.
  • the processor 151 reads the instructions from the memory 152, so that the terminal device 15 performs the following operations:
  • the server receives the operation information fed back according to the recognition result of the voice instruction; recognizes the second semantic instruction to obtain the recognition result of the second semantic instruction; according to the recognition result, sends an execution command to the server;
  • the business logic corresponding to the second semantic instruction fed back by the execution command, and the business logic is executed.
  • the server 13 may be a cloud server or a local physical server; the terminal device 14 and the terminal device 15 may be devices such as a desktop computer, a notebook, a palm computer, a mobile phone, a TV, and a speaker.
  • the server 13, the terminal device 14, and the terminal device 15 may include, but are not limited to, a processor and a memory.
  • FIG. 13, FIG. 14, and FIG. 15 are only examples of servers and terminal devices, and do not constitute a limitation on servers and terminal devices, and may include more or less components than those shown in the figure, or a combination Some components, or different components, for example, may also include input and output devices, network access devices, and so on.
  • the so-called processor may be a central processing unit (Central Processing Unit, CPU), the processor may also be other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC). ), off-the-shelf programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc.
  • the general-purpose processor may be a microprocessor or the processor may also be any conventional processor or the like.
  • the memory may be an internal storage unit of the server 13, the terminal device 14, or the terminal device 15, such as a hard disk or a memory.
  • the memory may also be an external storage device of the server 13, the terminal device 14, or the terminal device 15, such as an equipped plug-in hard disk, a smart memory card (Smart Media Card, SMC). ), Secure Digital (SD) card, Flash Card, etc.
  • the storage may also include not only the internal storage unit of the server 13, the terminal device 14, or the terminal device 15, but also an external storage device.
  • the memory is used to store an operating system, an application, a boot loader (BootLoader), data, and other programs, such as the program code of the computer program.
  • the memory can also be used to temporarily store data that has been output or will be output.
  • the embodiment of the present application also provides a computer-readable storage medium, the computer-readable storage medium stores a computer program, and the computer program includes instructions that, when run on a terminal device, cause the terminal device to execute The skin detection method.
  • the embodiment of the present application provides a computer program product containing instructions, which when the computer program product runs on a terminal device, causes the terminal device to execute the skin detection method described in any one of the above-mentioned first aspects.
  • the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer readable storage medium.
  • the computer program can be stored in a computer-readable storage medium.
  • the computer program can be stored in a computer-readable storage medium.
  • the steps of the foregoing method embodiments can be implemented.
  • the computer program includes computer program code, and the computer program code may be in the form of source code, object code, executable file, or some intermediate forms.
  • the computer-readable medium may at least include: any entity or device capable of carrying the computer program code to the photographing device/terminal device, recording medium, computer memory, read-only memory (ROM, Read-Only Memory), and random access memory (RAM, Random Access Memory), electric carrier signal, telecommunications signal and software distribution medium.
  • ROM read-only memory
  • RAM random access memory
  • electric carrier signal telecommunications signal and software distribution medium.
  • U disk mobile hard disk, floppy disk or CD-ROM, etc.
  • computer-readable media cannot be electrical carrier signals and telecommunication signals.
  • the disclosed apparatus/network equipment and method may be implemented in other ways.
  • the device/network device embodiments described above are only illustrative.
  • the division of the modules or units is only a logical function division, and there may be other divisions in actual implementation, such as multiple units.
  • components can be combined or integrated into another system, or some features can be omitted or not implemented.
  • the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.

Abstract

一种语音控制方法、装置、服务器、终端设备及存储介质,属于终端技术领域,所述方法包括:接收第一终端发送的语音指令识别结果(S301);对语音指令识别结果进行语义处理,得到操作信息,操作信息包括第一语义指令和第二语义指令(S302);将第一语义指令和第二语义指令发送至第一终端,第一语义指令用于指示第一终端将第二语义指令发送至第二终端(S303);接收所述第二终端识别所述第二语义指令后反馈的执行命令,并根据所述执行命令将与所述第二语义指令对应的业务逻辑发送至所述第二终端(S304)。所述方法可以解决针对多个设备联合对话时,对话管理会对用户的任务指令进行多个阶段的重复处理,而延长系统响应时间及增加对话时延的问题。

Description

语音控制方法、装置、服务器、终端设备及存储介质
本申请要求于2019年12月31日提交国家知识产权局、申请号为201911417229.4、申请名称为“语音控制方法、装置、服务器、终端设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请属于终端技术领域,尤其涉及语音控制方法、装置、服务器、终端设备及存储介质。
背景技术
在人机自然语言对话系统中,语音助手是一款智能型的应用,可以搭载于手机、电视、平板、电脑或音箱等智能终端设备,通过接收用户的音频信号,进行语音识别,做出判断或回应;语音助手被唤醒、语音识别及回应的对话过程,需要语音数据库进行云端支持;而对话管理(Dialog Manager,DM)则可以作为云端服务,负责维护和更新对话的流程和状态,它的输入是话述(utterance)以及相关上下文,经过对话术理解,输出系统应答。
随着互联网与物联网的发展,基于多个设备之间的网络连接,可以通过语音互相控制,使用多个设备进行跨设备的联合对话,形成全场景会话场景;例如与手机进行语音交互,通过手机控制电视执行相应的任务操作等。
目前,在使用多个设备进行跨设备的联合对话时,针对多个设备,对话管理对用户的任务指令,会进行多个阶段的重复处理,延长了系统的响应时间,增加了对话的时延。
发明内容
本申请实施例提供了语音控制方法、装置、服务器、终端设备及存储介质,可以解决针对多个设备联合对话时,对话管理会对用户的任务指令进行多个阶段的重复处理,而延长系统响应时间及增加对话时延的问题。
第一方面,本申请实施例提供了一种语音控制方法,包括:
接收第一终端发送的语音指令识别结果;对所述语音指令识别结果进行语义处理,得到操作信息,所述操作信息包括第一语义指令和第二语义指令;将所述第一语义指令和所述第二语义指令发送至所述第一终端,所述第一语义指令用于指示所述第一终端将所述第二语义指令发送至第二终端;接收所述第二终端识别所述第二语义指令后反馈的执行命令,并根据所述执行命令将与所述第二语义指令对应的业务逻辑发送至所述第二终端。
采用本申请提供的语音控制方法,以服务器作为执行主体,通过接收第一终端发送的语音指令识别结果,对语音指令识别结果进行语义处理,得到语音指令识别结果中待执行的操作信息,并将操作信息发送至第一终端;由第一终端执行操作信息中的第一语义指令,并将操作信息中的第二语义指令发送至第二终端;在第二终端识别第二语义指令后,服务器可以直接接收到第二终端反馈的执行命令,根据执行命令调用与第二语义指令相对应的业务逻辑,并将业务逻辑发送至第二终端,省去了对第二语义指令的处理流程,缩短了对话的延时,提高了对话系统的响应时间。
在第一方面的一种可能的实现方式中,所述对所述语音指令识别结果进行语义处理,得到操作信息包括:
识别所述语音指令识别结果,得到所述语音指令识别结果的目标意图及目标子意图;根据所述目标意图,预验证所述目标子意图,得到所述目标意图的响应逻辑和所述目标子意图的试运行结果;将所述响应逻辑作为所述操作信息的所述第一语义指令,将所述目标子意图及所述试运行结果作为所述操作信息的第二语义指令。
通过采用该可能的实现方式,在接收到第一终端发送的语音指令识别结果(即用户输入的语音指令对应的文本信息)后,对语音指令识别结果进行语义识别,得到语音指令识别结果中的目标意图及目标子意图;通过根据目标意图预验证目标子意图,得到目标意图的响应逻辑及预验证目标子意图的试运行结果,在将响应逻辑作为第一语义指令发送至第一终端的同时,还将目标子意图及试运行结果作为第二语义指令发送至第一终端;通过在第一终端执行第一语义指令,将第二语义指令发送至第二终端,为对话系统提供信息基础,提高对话系统的响应速度。
在第一方面的一种可能的实现方式中,所述将所述第一语义指令和所述第二语义指令发送至所述第一终端包括:
将所述第一语义指令和所述第二语义指令以语义表征的形式发送至所述第一终端。
在第一方面的一种可能的实现方式中,所述根据所述执行命令将与所述第二语义指令对应的业务逻辑发送至所述第二终端包括:
根据所述执行命令,解析所述试运行结果;根据解析后的所述试运行结果,调用所述业务逻辑,并将所述业务逻辑以语义表征的形式发送至所述第二终端。
通过采用该可能的实现方式,在接收到第二终端发送的执行命令后,可以直接执行相应的命令,对试运行结果进行解析,根据对试运行结果的解析结果直接调用相应的业务逻辑,无需再执行对目标子意图进行语义处理、选择相应的执行方式等流程;缩短的对话系统的响应时间。
第二方面,本申请实施例提供了一种语音控制方法,包括:
接收用户输入的语音指令,并对所述语音指令进行语音识别,得到语音指令识别结果;将所述语音指令识别结果发送至服务器;接收所述服务器对所述语音指令识别结果进行语义处理后反馈的操作信息,所述操作信息包括第一语义指令和第二语义指令;执行所述第一语义指令,将所述第二语义指令发送至第二终端;所述第二语义指令用于指示所述第二终端发送执行命令至服务器,并接收所述服务器反馈的与所述第二语义指令对应的业务逻辑。
采用本申请提供的语音控制方法,以第一终端作为执行主体,第一终端对用户输入的语音指令进行语音识别后,将得到的语音指令识别结果发送至服务器,接收到服务器对语音指令识别结果进行语义处理后的操作信息,执行操作信息中的第一语义指令,将第二语义指令发送至第二终端;接收服务器响应语音指令识别结果,所反馈的第一语义指令和第二语义指令;执行第一语义指令,将第二语义指令发送至第二终端,使第二终端根据第二语义指令直接调用服务器的执行接口,发送执行命令至服务器,并接收服务器反馈的与第二语义指令对应的业务逻辑;为对话系统进一步响应第二语音指令提供了信息基础,节省了对第二语义指令的处理流程,从而可以缩短对话系统的响应时间。
在第二方面的一种可能的实现方式中,所述接收所述服务器对所述语音指令识别结果进行语义处理后反馈的操作信息包括:
接收所述服务器针对所述语音指令识别结果中的目标意图反馈的响应逻辑,以及接收所述服务器针对所述语音指令识别结果中的目标子意图反馈的试运行结果。
在第二方面的一种可能的实现方式中,所述第一语义指令为所述服务器对所述语音指令识别 结果中的目标意图反馈的响应逻辑,所述第二语义指令为所述服务器对所述语音指令识别结果中的目标子意图反馈的试运行结果及所述目标子意图;
相应的,执行所述第一语义指令,将所述第二语义指令发送至第二终端包括:
执行所述服务器反馈的所述响应逻辑,将所述服务器反馈的所述目标子意图及所述试运行结果发送所述第二终端。
采用该可能的实现方式,在接收服务器针对语音指令识别结果中的目标意图所反馈的响应逻辑的同时,还接收服务器针对语音指令识别结果中的目标子意图所反馈的试运行结果,将目标子意图的试运行结果作为中间数据传输给第二终端,为第二终端提供数据基础;通过执行服务器反馈的响应逻辑,将目标子意图发送至第二终端的同时,还将试运行结果也发送至第二终端,使得第二终端可以根据试运行结果直接调用服务器的执行接口,无需再将目标子意图上传服务器进行语义处理及判断执行等流程,节省了数据的处理流程,缩短了对话系统的响应时间。
第三方面,本申请实施例提供了一种语音控制方法,包括:
接收第一终端执行第一语义指令时发送的第二语义指令;所述第一语义指令和所述第二语义指令为所述第一终端将语音指令识别结果发送至服务器后,接收到所述服务器根据所述语音指令识别结果反馈的操作信息;识别所述第二语义指令,得到所述第二语义指令的识别结果;根据所述识别结果,发送执行命令至服务器;接收所述服务器根据所述执行命令反馈的与所述第二语义指令对应的业务逻辑,并执行所述业务逻辑。
采用本申请提供的语音控制方法,以第二终端作为执行主体,对接收到的第二语义指令进行识别,根据识别到的结果直接调用服务器的执行接口,指示服务器反馈与第二语义指令相对应的业务逻辑,无需再通过服务器对第二语义指令进行语义处理,节省了数据处理流程,提高了第二终端的响应速度,缩短了会话系统的时延。
在第三方面的一种可能的实现方式中,所述操作信息包括所述服务器对所述语音指令识别结果中的目标意图反馈的响应逻辑,以及所述服务器对所述语音指令识别结果中的目标子意图反馈的试运行结果;
相应的,所述接收第一终端执行第一语义指令时发送的第二语义指令包括:接收第一终端执行所述响应逻辑时发送的所述目标子意图和所述试运行结果。
在第三方面的一种可能的实现方式中,所述第二语义指令包括所述服务器预验证所述语音指令识别结果中的目标子意图得到的试运行结果;
相应的,所述识别所述第二语义指令,得到所述第二语义指令的识别结果包括:识别所述第二语义指令,得到所述目标子意图的所述试运行结果。
在第三方面的一种可能的实现方式中,所述根据所述识别结果,发送执行命令至服务器包括:
根据所述识别结果,将所述试运行结果对应的执行命令发送至所述服务器。
示例性的,所述试运行结果包括技能标识、意图标识、槽位列表,其中槽位包括槽位名称、槽位类型和槽位值。
应理解,所述服务器、所述第一终端、所述第二终端可以在联网的状态下,彼此之间实现互联,通过数据传输协议,实现彼此之间的数据传输;或者三个终端分别连接到云侧服务,通过云侧服务实现数据的交互。
示例性的,所述服务器、所述第一终端、所述第二终端可以通过无线WiFi或蜂窝网络,通过各终端之间地址及接口的相互确认,彼此连接形成对话系统的设备圈,通过语音指令实现互相控 制。
示例性的,所述服务器将所述操作信息中的第一语义指令发送至所述第一终端,将所述第二语义指令直接发送至所述第二终端。
第四方面,本申请实施例提供了一种语音控制装置,包括:
第一接收模块,用于接收第一终端发送的语音指令识别结果;
语义处理模块,用于对所述语音指令识别结果进行语义处理,得到操作信息,所述操作信息包括第一语义指令和第二语义指令;
第一发送模块,用于将所述第一语义指令和所述第二语义指令发送至所述第一终端,所述第一语义指令用于指示所述第一终端将所述第二语义指令发送至第二终端;
命令执行模块,用于接收所述第二终端识别所述第二语义指令后反馈的执行命令,并根据所述执行命令将与所述第二语义指令对应的业务逻辑发送至所述第二终端。
在一种可能的实现方式中,所述语义处理模块包括:
语义识别子模块,用于识别所述语音指令识别结果,得到所述语音指令识别结果的目标意图及目标子意图;
任务执行子模块,用于根据所述目标意图,预验证所述目标子意图,得到所述目标意图的响应逻辑和所述目标子意图的试运行结果;将所述响应逻辑作为所述操作信息的所述第一语义指令,将所述目标子意图及所述试运行结果作为所述操作信息的第二语义指令。
在一种可能的实现方式中,所述第一发送模块还用于将所述第一语义指令和所述第二语义指令以语义表征的形式发送至所述第一终端。
在一种可能的实现方式中,所述第一发送模块包括:
第一子模块,用于根据所述执行命令,解析所述试运行结果;
第二字模块,用于根据解析后的所述试运行结果,调用所述业务逻辑,并将所述业务逻辑以语义表征的形式发送至所述第二终端。
第五方面,本申请实施例提供了一种语音控制装置,包括:
语音识别模块,用于接收用户输入的语音指令,并对所述语音指令进行语音识别,得到语音指令识别结果;
第二发送模块,用于将所述语音指令识别结果发送至服务器;
第二接收模块,用于接收所述服务器对所述语音指令识别结果进行语义处理后反馈的操作信息,所述操作信息包括第一语义指令和第二语义指令;
指令执行模块,用于执行所述第一语义指令,将所述第二语义指令发送至第二终端;所述第二语义指令用于指示所述第二终端发送执行命令至服务器,并接收所述服务器反馈的与所述第二语义指令对应的业务逻辑。
在一种可能的实现方式中,所述第二接收模块还用于接收所述服务器针对所述语音指令识别结果中的目标意图反馈的响应逻辑,以及接收所述服务器针对所述语音指令识别结果中的目标子意图反馈的试运行结果。
在一种可能的实现方式中,所述第一语义指令为所述服务器对所述语音指令识别结果中的目标意图反馈的响应逻辑,所述第二语义指令为所述服务器对所述语音指令识别结果中的目标子意图反馈的试运行结果及所述目标子意图;所述指令执行模块还用于执行所述服务器反馈的所述响应逻辑,将所述服务器反馈的所述目标子意图及所述试运行结果发送所述第二终端。
第六方面,本申请实施例提供了一种语音控制装置,包括:
第三接收模块,用于接收第一终端执行第一语义指令时发送的第二语义指令;所述第一语义指令和所述第二语义指令为所述第一终端将语音指令识别结果发送至服务器后,接收到所述服务器根据所述语音指令识别结果反馈的操作信息;
指令识别模块,用于识别所述第二语义指令,得到所述第二语义指令的识别结果;
第三发送模块,用于根据所述识别结果,发送执行命令至服务器;
业务执行模块,用于接收所述服务器根据所述执行命令反馈的与所述第二语义指令对应的业务逻辑,并执行所述业务逻辑。
在一种可能的实现方式中,所述操作信息包括所述服务器对所述语音指令识别结果中的目标意图反馈的响应逻辑,以及所述服务器对所述语音指令识别结果中的目标子意图反馈的试运行结果;所述第三接收模块还用于接收第一终端执行所述响应逻辑时发送的所述目标子意图和所述试运行结果。
在一种可能的实现方式中,所述第二语义指令包括所述服务器预验证所述语音指令识别结果中的目标子意图得到的试运行结果;所述指令识别模块还用于识别所述第二语义指令,得到所述目标子意图的所述试运行结果。
在一种可能的实现方式中,所述第三发送模块还用于根据所述识别结果,将所述试运行结果对应的执行命令发送至所述服务器。
第七方面,本申请实施例提供了一种服务器,所述服务器包括存储器、处理器、自然语言理解模块及对话管理模块,所述存储器用于存储计算机程序,所述计算机程序包括指令,当所述指令被所述服务器执行时,使得所述服务器执行所述语音控制方法。
第八方面,本申请实施例提供了一种终端设备,所述终端设备包括存储器、处理器及语音助手,所述存储器用于存储计算机程序,所述计算机程序包括指令,当所述指令被所述终端设备执行时,使得所述终端设备执行所述语音控制方法。
第九方面,本申请实施例提供了一种终端设备,所述终端设备包括存储器、处理器,所述存储器用于存储计算机程序,所述计算机程序包括指令,当所述指令被所述终端设备执行时,使得所述终端设备执行所述语音控制方法。
第十方面,本申请实施例提供了一种计算机存储介质,所述计算机可读存储介质存储有计算机程序,所述计算机程序包括指令,所述指令在终端设备上运行时,使得所述终端设备执行所述语音控制方法。
第十一方面,本申请实施例提供了一种包含指令的计算机程序产品,所述计算机程序产品在终端设备上运行时,使得终端设备执行上述第一方面中任一项所述的语音控制方法。
可以理解的是,上述第二方面至第十一方面的有益效果可以参见上述第一方面或第一方面的各个实现方式的技术效果,在此不再赘述。
本申请实施例与现有技术相比存在的有益效果是:采用本申请提供的语音控制方法,通过接收第一终端发送的语音指令识别结果,对语音指令识别结果进行语义处理,得到语音指令识别结果中待执行的操作信息,并将操作信息发送至第一终端;由第一终端执行操作信息中的第一语义指令,并将操作信息中的第二语义指令发送至第二终端;在第二终端识别第二语义指令后,可以直接接收到第二终端反馈的执行命令,根据执行命令调用与第二语义指令相对应的业务逻辑,并将业务逻辑发送至第二终端;通过本实施例,在第二终端接收到第二语义指令后,可以直接接收 到第二终端根据第二语义指令中包含的任务信息反馈执行命令,无需再对第二终端接收的第二语义指令进行再次语义处理,可以根据反馈的执行命令调用相应的业务逻辑,通过执行接口发送至第二终端,节省了对第二语义指令的处理流程,缩短了对话的延时,提高了对话系统的响应时间。
附图说明
图1是本申请一实施例提供的多设备互联语音控制的系统架构示意图;
图2是本申请另一实施例提供的多设备互联语音控制的系统架构示意图;
图3是本申请一实施例提供的语音控制方法的流程示意图;
图4是本申请另一实施例提供的语音控制方法的流程示意图;
图5是本申请另一实施例提供的语音控制方法的流程示意图;
图6是本申请一实施例提供的语音控制方法的设备交互示意图;
图7是本申请一实施例提供的语音控制方法的应用场景示意图;
图8是本申请另一实施例提供的语音控制方法的应用场景示意图;
图9是本申请另一实施例提供的语音控制方法的应用场景示意图
图10是本申请一实施例提供的语音控制装置的结构示意图;
图11是本申请另一实施例提供的语音控制装置的结构示意图;
图12是本申请另一实施例提供的语音控制装置的结构示意图;
图13是本申请一实施例提供的服务器的结构示意图;
图14是本申请一实施例提供的终端设备的结构示意图;
图15是本申请另一实施例提供的终端设备的结构示意图。
具体实施方式
以下描述中,为了说明而不是为了限定,提出了诸如特定系统结构、技术之类的具体细节,以便透彻理解本申请实施例。然而,本领域的技术人员应当清楚,在没有这些具体细节的其它实施例中也可以实现本申请。在其它情况中,省略对众所周知的系统、装置、电路以及方法的详细说明,以免不必要的细节妨碍本申请的描述。
应当理解,当在本申请说明书和所附权利要求书中使用时,术语“包括”指示所描述特征、整体、步骤、操作、元素和/或组件的存在,但并不排除一个或多个其它特征、整体、步骤、操作、元素、组件和/或其集合的存在或添加。
还应当理解,在本申请说明书和所附权利要求书中使用的术语“和/或”是指相关联列出的项中的一个或多个的任何组合以及所有可能组合,并且包括这些组合。
如在本申请说明书和所附权利要求书中所使用的那样,术语“如果”可以依据上下文被解释为“当...时”或“一旦”或“响应于确定”或“响应于检测到”。类似地,短语“如果确定”或“如果检测到[所描述条件或事件]”可以依据上下文被解释为意指“一旦确定”或“响应于确定”或“一旦检测到[所描述条件或事件]”或“响应于检测到[所描述条件或事件]”。
另外,在本申请说明书和所附权利要求书的描述中,术语“第一”、“第二”、“第三”等仅用于区分描述,而不能理解为指示或暗示相对重要性。
在本申请说明书中描述的参考“一个实施例”或“一些实施例”等意味着在本申请的一个或多个实施例中包括结合该实施例描述的特定特征、结构或特点。由此,在本说明书中的不同之处出现的语句“在一个实施例中”、“在一些实施例中”、“在其他一些实施例中”、“在另外一些实施例中”等不是必然都参考相同的实施例,而是意味着“一个或多个但不是所有的实施例”,除非是以其他 方式另外特别强调。术语“包括”、“包含”、“具有”及它们的变形都意味着“包括但不限于”,除非是以其他方式另外特别强调。
本申请提供的语音控制方法,可以应用在多个设备进行跨设备联合对话时,通过语音相互控制的全场景的会话场景,例如与手机进行语音交互,通过手机控制电视执行相应的业务逻辑等。
目前,多个设备之间通过语音相互控制形成的全场景的会话场景,需要场景内的各设备具备联网功能,各设备之间可以通过地址及接口的相互确认,以有线或无线的方式进行通信,或者各设备连接到云侧服务,通过云侧服务实现通信。其中,无线方式包括互联网、WiFi网络或移动网络;移动网络可以包括现有的2G(如全球移动通信系统(英文:Global System for Mobile Communication,GSM))、3G(如通用移动通信系统(英文:Universal Mobile Telecommunications System,UMTS))、4G(如FDD LTE、TDD LTE)以及4.5G、5G等。各设备之间通过传输协议,实现数据的传输,例如http等通讯协议。所述的各个设备可以是手机、电视、平板、音箱、电脑等,设备可以具备联网及语音助手等功能。
在实际应用场景中,在多个设备进行跨设备联合对话,通过语音相互控制时,需要对话管理(Dialog Manager,DM)作为云端服务,维护和更新对话的流程和状态,输入语音指令对应的话述(utterance)以及结合相关上下文,经过对话术理解,输出系统应答。
对话管理(Dialog Manager,DM)根据输入的语音指令的语义,获得对应语音指令的任务,明确出任务所需要的信息,然后对接业务平台完成任务,或者要求进一步输入更多的语音指令信息,或者获取业务平台对应任务的业务逻辑,最后将执行结果返回给用户。
其中,不同功能的DM可以对接不同的业务平台,可以是系统预设的业务平台,还可以是第三方平台,例如听歌或电子书的语义可以对接网易云音乐或喜马拉雅等平台,看视频的语义可以对接爱奇艺或哔哩哔哩等第三方平台。
参见图1,是本申请一实施例提供的多设备互联语音控制的系统架构示意图,在各个设备联网或者相互确定地址及接口的情况下,通过语音实现相互控制。第一终端11设置有语音助手,可以通过麦克风接收用户输入的音频信号;第一终端11对接收到的音频信号进行语音识别ASR,得到音频信号对应的文本信息;第一终端11将文本信息传输至服务器12;服务器12可以是对话管理服务器,对接收到的文本信息通过自然语言理解(Natural Language Understanding,NLU)进行语义识别,可以得到语义识别后的目标意图和目标子意图;根据语义识别后输出的语义表征进行业务对接,获取与语义表征对应的业务逻辑,最后将执行结果返回至第一终端11;第一终端11接收到执行结果后,将执行结果发送至第二终端13;或者服务器12直接将执行结果发送至第二终端13;第二终端13对接收到的执行结果进行识别,得到执行结果中目标子意图的试运行结果,根据试运行结果直接发送执行命令至服务器12,调用服务器12的执行接口;服务器12接收到执行命令后,根据试运行结果对接业务逻辑,并将业务逻辑反馈至第二终端13;最后第二终端12执行相应的业务逻辑。
如图1所示,第一终端11可以是手机;服务器12可以是对话管理云端服务,还可以是本地物理服务器;第二终端13可以是电视;通过与手机语音交互,及服务器进行对话管理,实现通过手机控制电视;例如,用户对手机说:用电视播放电影哪吒,手机则显示:正在为您切换到电视(与对话管理服务器交互过程中预验证电视支持播放),最后电视显示:播放电影哪吒(实际开始播放)。
需要说明的是,多设备互联语音控制的系统可以包括多个设备,所实现的语音控制,可以包 括跨设备控制的任何类别的语音指令,例如通过跨设备控制电视的播放指令、通过跨设备控制空调调节温度的指令或者通过跨设备控制烹饪工具的烹饪模式的指令等。
在人机自然语言对话系统中,对话管理负责控制对话的流程和状态,通过输入话术以及相关上下文,经过多路并行技能发现、试运行、排序选择、执行和会话接续后,输出系统应答。
参见图2,是本申请另一实施例提供的多设备互联语音控制的系统架构示意图;目前在通过语音相互控制的全场景的会话过程中,第一终端11接收用户输入的语音指令,如“用电视播放电影哪吒”;第一终端11对语音指令进行语音识别,得到语音指令识别结果,即语音指令对应的文本信息;第一终端11将语音指令识别结果发送至服务器12,服务器对语音指令识别结果进行多个阶段的并行处理。
如图2中所示,以第一终端为手机,第二终端为电视,服务器为对话管理服务器为例;多个阶段的并行处理包括:基于手机上下文的技能发现、试运行、选择、执行、会话接续以及基于模拟电视上下文的技能发现、试运行、选择。在对话管理服务器结合手机上下文,对语音指令识别结果进行语义识别,查找对应语义的多个技能,对每个技能进行试运行,将试运行结果汇总在一起,过滤掉试运行失败的结果,对试运行成功的结果基于排序规则或排序模型(如LambdaMART或者搜索引擎常用的排序模型),选择排在第一位的试运行结果作为唯一最理想技能,然后基于试运行结果执行,最后进行会话接续将执行结果返回客户端(即手机端)。
例如,当用于对手机说“用电视播放电影哪吒”时,对话管理服务器基于手机上下文,进行语义识别,确定是“切换”技能,执行该“切换”技能时,需要预验证电视是否支持“播放电影哪吒”;在对话管理服务器中基于模拟的电视上下文识别话术“播放电影哪吒”进行技能发现、试运行及选择的处理流程;若能选择出一个技能,则说明电视端支持;否则,说明电视端不支持该任务,需要返回相应的语义处理结果或向用户进一步确认。
当验证结果为支持时,对话管理服务器返回语义处理后的响应逻辑,即返回技能=切换,目标=电视,话述=播放电影哪吒至手机;在手机接收到“切换”的响应逻辑,则执行切换逻辑:将“播放电影哪吒”发送至电视。电视在接收到“播放电影哪吒”后,识别“播放电影哪吒”的文本信息,再次调用对话管理服务器,基于真实的电视上下文对“播放电影哪吒”进行语义处理:技能发现、试运行、选择,然后根据选择的试运行结果,调用服务器的执行接口,发送执行命令至服务器;服务器根据执行命令对接“播放电影哪吒”的业务逻辑,并将业务逻辑反馈至电视,返回:技能=播放电影,名称=哪吒;由电视完成电影播放。
目前,在对话管理服务器中,可以设置模拟目标终端(电视)的上下文信息,预验证目标终端是否支持当前话术的意图,只得到验证的结果,而不执行任务。
由上述流程可以看出,对话管理服务器在相对手机和电视进行语义处理过程中,对“播放电影哪吒”进行了“技能发现、试运行及选择”流程的重复处理,使得对话系统在语音交互过程中产生了较长的时延,延长了系统的响应时间,增加了对话管理服务器的运行负载,同时使得用户体验较差。
基于上述问题,本申请提供的语音控制方法,在全场景多设备协同对话中,通过控制设备之间的信息交互,当识别到切换设备时,将目标设备进行预验证的试运行结果作为中间数据,由中间设备传输给目标终端,或者通过对话管理服务器直接传输至目标终端。
例如图1中所示的多设备互联语音控制的系统架构,第一终端接收到用户输入的语音指令,第一终端对语音指令进行语音识别,将识别后语音指令识别结果发送至服务器;服务器接收到语 音指令识别结果后,对语音指令识别结果进行多阶段的处理,主要包括识别任务、执行任务及结果答复。将处理语音指令识别结果得到的操作信息作为结果答复,反馈至第一终端。其中,操作信息中包括基于第一终端上下文的响应逻辑和基于模拟的第二终端上下文的试运行结果,将试运行结果和第一终端的响应逻辑同时发送至第一终端;或者将响应逻辑发送至第一终端,将试运行结果直接发送至第二终端。当第一终端同时接收到服务器反馈的响应逻辑和试运行结果,则调用第二终端,并将试运行结果发送至第二终端,由第二终端根据试运行结果直接调用服务器的执行接口,第二终端向服务器发送执行命令;服务器根据执行命令对接业务平台,调用相应的业务逻辑,并将业务逻辑反馈至第二终端,由第二终端执行相应的业务逻辑。
当服务器反馈响应逻辑至第一终端,将试运行结果直接发送至第二终端时,第一终端可向用户回应正在切换或正在执行命令的答复。服务器调用第二终端,直接向第二终端发送试运行结果,第二终端识别到试运行结果,直接调用服务器的执行接口,发送执行命令至服务器;服务器根据指令命令对接业务平台,调用相应的业务逻辑,并将业务逻辑反馈至第二终端,由第二终端执行业务逻辑。节省了服务器对话术的重复处理的过程,从而提高目标设备的响应速度,缩短对话系统的响应时间,减少人机语音交互的延时。
参见图3,本申请一实施例提供的语音控制方法的流程示意图,作为本申请提供的语音控制方法的一实施例,以图1中的服务器作为执行主体,所述的服务器可以为对话管理的云端服务或本地物理服务器,在此不做具体限定;该方法具体的实现原理包括以下步骤:
步骤S301,接收第一终端发送的语音指令识别结果。
在本实施例中,服务器接收第一终端发送的语音指令识别结果;语音指令识别结果为第一终端接收到用户输入的语音指令后,对语音指令音频信息进行语音识别,得到的语音指令的文本信息,将语音指令的文本信息作为语音指令识别结果。第一终端可以是设置有语音助手的终端设备,例如手机、电脑、平板、电视或音箱等,通过第一终端的麦克风接收用户的音频信息,例如用户对手机的语音助手说“用电视播放电影哪吒”。
具体的,第一终端对语音指令识别后,得到对应语音指令的文本信息,将文本信息通过无线WiFi或蜂窝移动网络传输至服务器;由服务器进行语义识别及处理。
语音指令可以是任务型的语音控制指令,语音指令识别结果中可以包含目标目标意图及目标子意图。例如“用电视播放电影哪吒”或者“用音箱播放披头士的歌”等,“用电视”或“用音箱”则对应目标意图,“播放电影哪吒”或“播放披头士的歌”可以对应识别为目标子意图。
需要说明的是,在服务器和第一终端都联网的状态下,第一终端和服务器之间可以通过地址和接口的相互确认,实现联网通信;也可以通过网关或者路由实现相互通信。服务器与第一终端之间的信息传输符合数据的传输协议,例如HTTP协议等。
步骤S302,对所述语音指令识别结果进行语义处理,得到操作信息,所述操作信息包括第一语义指令和第二语义指令。
在本实施例中,服务器作为在语音交互过程的对话管理系统,可以通过自然语言理解对语音指令识别结果进行语义识别,得到机器可以识别语义表示。根据语义表示,获取语音指令识别结果中的目标意图和目标子意图,经过多个阶段的并行处理,得到答复第一终端的操作信息,以响应语音指令识别结果。
其中,操作信息可以是服务器完成语音指令识别结果中目标意图的执行结果,即响应逻辑,例如根据语音指令识别结果调用的业务逻辑;还可以是进一步要求客户端输入更多信息,以完成 目标意图。
示例性的,服务器在接收到手机发送的“用电视播放电影哪吒”时,服务器基于设置的手机上下文,进行技能发现、试运行及选择等流程,确定“切换”技能;基于语义识别,可以确定目标意图为“切换”,目标子意图为“播放电影哪吒”。根据语义识别,需要切换到目标设备电视,则对电视是否支持“播放电影哪吒”进行预验证,避免在切换到电视后,电视显示不支持或听不懂。在服务器端设置有模拟的电视上下文信息,包括目前对话中的领域、目标对象以及之前对话中提到的槽位信息、顺序、代词等。基于模拟的电视上下文信息,对“播放电影哪吒”的话术进行预验证,即进行技能发现、试运行、选择确定技能的处理流程,若能确定播放技能,则说明电视端支持;则将手机需要执行的“切换”动作及预验证过程的试运行结果生成相应的操作信息,进行会话接续,答复手机。
具体的,在基于手机上下文信息,确定为跨设备控制的“切换”动作时,可以将操作信息划分为当前手机需要执行的操作指令以及目标设备需要执行的操作指令,即将答复手机的操作信息分为第一语义指令和第二语义指令。第一语义指令对应响应当前手机的答复逻辑,对应语音指令识别结果中的目标意图;第二语义指令为目标设备需要执行的逻辑,对应语音指令识别结果中的目标子意图。
需要说明的是,对话管理服务器在根据语音指令识别结果识别任务、执行任务及答复结果的过程中,还可以设置多个槽位与客户端进行多轮的语音交互,以明确目标意图或目标子意图;例如在接收到手机发送的“用电视播放”的话术,服务器可以返回提问“播放什么”,然后接收到“电影哪吒”;通过多轮对话,明确目标话术的任务,从而对话系统进行准确的答复或响应。
在一种可能的实现方式中,所述对所述语音指令识别结果进行语义处理,得到操作信息包括:
3.1、识别所述语音指令识别结果,得到所述语音指令识别结果的目标意图及目标子意图;
3.2、所述目标意图,预验证所述目标子意图,得到所述目标意图的响应逻辑和所述目标子意图的试运行结果;
3.3、将所述响应逻辑作为所述操作信息的所述第一语义指令,将所述目标子意图及所述试运行结果作为所述操作信息的第二语义指令。
在本实施例中,服务器对语音指令识别结果进行语义处理,识别其文本信息中语义信息,得到语音指令识别结果的目标意图和目标子意图,所述目标意图可以为根据语音指令识别结果确定的第一终端需要执行的操作,所述目标子意图可以为语音指令识别结果跨设备控制目标设备需要执行的操作。服务器端基于手机上下文确定语音指令识别结果的目标意图,例如确定的“切换”意图;服务器对目标子意图进行预验证及试运行,以确定目标终端是否支持目标子意图的执行。通过执行流程,确定目标意图的响应逻辑{技能=切换,目标=电视,话述=播放电影哪吒},以及目标子意图的验证结果和试运行结果;验证结果用于表示目标终端是否支持所述目标子意图的执行,试运行结果用于表示对目标子意图模拟运行得到的处理结果。
具体的,响应逻辑和试运行结果可以包括技能标识、意图标识及槽位信息。技能标识确定一个技能,技能为能力的集合,可以支持若干个意图,例如天气技能支持查天气、查PM2.5的意图;意图标识确定技能内的唯一意图;槽位信息为意图执行所需的参数列表。槽位信息中的参数个数可以为任意个,可以为零也可以为多个。槽位信息包括槽位名称、槽位类型及槽位值;槽位名称确定该槽位的参数名称,槽位类型确定该槽位参数的类型,如日期、数字、字符串等,槽位值即为参数值。
示例性的,服务器将响应逻辑和试运行结果作为结果答复,将所述响应逻辑作为所述操作信息的所述第一语义指令,将所述目标子意图对应的话术及所述运行结果作为所述操作信息的第二语义指令。
步骤S303、将所述第一语义指令和所述第二语义指令发送至所述第一终端,所述第一语义指令用于指示所述第一终端将所述第二语义指令发送至第二终端。
在本实施例中,通过有线或无线方式,服务器将第一语义指令和第二语义指令作为结果答复,同时发送至第一终端。
具体的,第一语义指令中包括答复第一终端的响应逻辑,例如在上述场景中,对应第一终端的响应逻辑可以为{技能=切换,目标=电视,话述=播放电影哪吒}。第二语义指令中包括语音指令识别结果中的目标子意图对应的话术和目标子意图的试运行结果,例如试运行结果可以为{技能=播放电影,名称=哪吒}。第一终端执行第一语义指令,将第二语义指令发送至第二终端;第二终端识别第二语义指令,可以在第二语义指令中识别到目标子意图的同时,还可以识别到目标子意图的试运行结果,无需再通过服务器对目标子意图的话术进行技能发现、试运行及选择的处理流程。
另外,在另一种可能的实现方式中,服务器还可以通过有线或无线的方式将第一语义指令发送至第一终端,通过有线或无线的方式将第二语义指令直接发送至第二终端(即目标终端)。第一终端执行切换技能,确定切换到第二终端(目标终端);第二终端(目标终端)直接获取服务器发送的第二语义指令。第二语义指令包括目标子意图的试运行结果;第二终端可以识别出第二语义指令中的试运行结果,根据试运行结果直接发送执行命令至服务器,调用服务器的执行接口。服务器根据执行命令调用目标子意图相对应的业务逻辑,省去了服务器对第二语义指令中的目标子意图的话术再次进行技能发现、试运行及选择的处理过程,提高了对话系统的响应速度。
需要说明的是,在服务器、第一终端及第二终端都联网的状态下,服务器和第一终端之间、服务器和第二终端之间以及第一终端和第二终端之间可以通过地址和接口的相互确认,实现联网通信;也可以通过网关或者路由实现相互通信。因此,第二语义控制指令中的试运行结果可以作为中间结果由第一终端传输至第二终端,还可以由服务器直接发送至第二终端,调用第二终端。
在一种可能的实现方式中,所述将所述第一语义指令和所述第二语义指令发送至所述第一终端包括:
将所述第一语义指令和所述第二语义指令以语义表征的形式发送至所述第一终端。
在本身实施例中,语义表征的形式为机器可读的语言表示方式,服务器将语义处理后的语音指令识别结果,以语义表征的形式作为第一终端或第二终端的答复结果。
相应的,服务器还可以将第一语义指令以语义表征的形式发送至第一终端,如语义表征的形式为{技能=切换,目标=电视,话述=播放电影哪吒};服务器还可以将第二语义指令中的试运行结果以语义表征的形式发送至第二终端,如试运行结果的语义表征形式为{技能=播放电影,名称=哪吒}。
步骤S304,接收所述第二终端识别所述第二语义指令后反馈的执行命令,并根据所述执行命令将与所述第二语义指令对应的业务逻辑发送至所述第二终端。
在本实施例中,第二语义指令中包括目标子意图及预验证目标子意图得到的试运行结果;第二终端在接收到第二语义指令后,通过识别第二语义指令,得到试运行结果;第二终端根据试运行结果直接调用服务器的执行接口,向服务器发送执行命令。服务器接收第二终端发送的执行命 令,根据执行命令对接与第二语义指令对应的业务逻辑,将业务逻辑发送至第二终端设备;例如调用服务器中的电影数据,并将电影数据作为响应逻辑发送至第二终端,响应逻辑可以为{技能=播放电影,名称=哪吒}。由第二终端执行相应的业务逻辑,即播放电影哪吒。
在一种可能的实现方式中,所述根据所述执行命令将与所述第二语义指令对应的业务逻辑发送至所述第二终端包括:
3.4、根据所述执行命令,解析所述试运行结果;
3.5、根据解析后的所述试运行结果,调用所述业务逻辑,并将所述业务逻辑以语义表征的形式发送至所述第二终端。
在本实施例中,服务器接收到第二终端发送的执行命令,对目标子意图的试运行结果进行解析,根据解析结果调用与目标子意图对应的业务逻辑,并将业务逻辑以语义表征的形式发送至第二终端,例如服务器向第二终端返回{技能=播放电影,名称=哪吒}。
需要说明的是,对于第一终端对应的对话管理服务器与第二终端对应的对话管理服务器可以为同一个服务器,或者具有相同功能的两个服务器。
采用本申请提供的语音控制方法,以服务器作为执行主体,通过接收第一终端发送的语音指令识别结果,对语音指令识别结果进行语义处理,得到语音指令识别结果中待执行的操作信息,并将操作信息发送至第一终端;由第一终端执行操作信息中的第一语义指令,并将操作信息中的第二语义指令发送至第二终端;在第二终端识别第二语义指令后,可以直接接收到第二终端反馈的执行命令,根据执行命令调用与第二语义指令相对应的业务逻辑,并将业务逻辑发送至第二终端;通过本实施例,在第二终端接收到第二语义指令后,可以直接接收到第二终端根据第二语义指令中包含的任务信息反馈执行命令,无需再对第二终端接收的第二语义指令进行再次语义处理,可以根据反馈的执行命令调用相应的业务逻辑,通过执行接口发送至第二终端,省去了对第二语义指令的处理流程,缩短了对话的延时,提高了对话系统的响应时间。
参见图4,是本申请另一实施例提供的语音控制方法的流程示意图,作为本申请提供的语音控制方法的一实施例,以图1中的第一终端作为执行主体,所述的第一终端可以为手机、电脑、平板、音箱等设备,在此不做具体限定;该方法具体的实现原理包括以下步骤:
步骤S401、接收用户输入的语音指令,并对所述语音指令进行语音识别,得到语音指令识别结果;
步骤S402、将所述语音指令识别结果发送至服务器;
步骤S403、接收所述服务器对所述语音指令识别结果进行语义处理后反馈的操作信息,所述操作信息包括第一语义指令和第二语义指令;
步骤S404、执行所述第一语义指令,将所述第二语义指令发送至第二终端;所述第二语义指令用于指示所述第二终端发送执行命令至服务器,并接收所述服务器反馈的与所述第二语义指令对应的业务逻辑。
在本申请的一些实施例中,第一终端可以设置有语音助手,通过麦克风接收用户输入的语音指令,对语音指令进行语音识别ASR,得到语音指令识别结果,即语音指令对应的文本信息。将语音指令识别结果通过有线或无线的方式发送至服务器,并接收服务器反馈的操作信息;所述操作信息可以包括对应第一终端的第一语义指令和对应第二终端的第二语义指令。第一终端执行操作信息中的第一语义指令,调用并切换到第二终端,同时将第二语义指令发送至第二终端。第二语义指令可以包括语音指令识别结果中目标子意图的试运行结果。第二终端可以识别第二语义指 令中的试运行结果,根据试运行结果直接发送执行命令至服务器,调用服务器的执行接口;服务器根据执行命令对接目标子意图对应的业务逻辑,并将业务逻辑反馈至第二终端;由第二终端完成业务逻辑。节省了服务器对目标子意图的话术重复处理的过程,从而提高目标设备的响应速度,缩短对话系统的响应时间,减少人机语音交互的延时。
在一种可能的实现方式中,所述接收所述服务器对所述语音指令识别结果进行语义处理后反馈的操作信息包括:
接收所述服务器针对所述语音指令识别结果中的目标意图反馈的响应逻辑,以及接收所述服务器针对所述语音指令识别结果中的目标子意图反馈的试运行结果。
在一种可能的实现方式中,所述第一语义指令为所述服务器对所述语音指令识别结果中的目标意图反馈的响应逻辑,所述第二语义指令为所述服务器对所述语音指令识别结果中的目标子意图反馈的试运行结果及所述目标子意图;
相应的,执行所述第一语义指令,将所述第二语义指令发送至第二终端包括:
执行所述服务器反馈的所述响应逻辑,将所述服务器反馈的所述目标子意图及所述试运行结果发送所述第二终端。
通过采用本申请实施例,在第一终端获取服务器基于第一终端上下文反馈的响应逻辑的同时,获取语音指令识别结果中目标子意图的试运行结果,并在调用第二终端时,将试运行结果也发送至第二终端,使得第二终端可以直接获取语音指令识别结果中目标子意图的试运行结果,无需再通过服务器对目标子意图的话术进行一系列的语义处理,优化了对话系统的数据处理流程,提高了对话系统的响应速度。
参见图5,是本申请另一实施例提供的语音控制方法的流程示意图,作为本申请提供的语音控制方法的一实施例,以图1中的第二终端作为执行主体,所述的第二终端可以为手机、平板、电脑、音箱及电视等设备,在此不做具体限定;该方法具体的实现原理包括以下步骤:
步骤S501、接收第一终端执行第一语义指令时发送的第二语义指令;所述第一语义指令和所述第二语义指令为所述第一终端将语音指令识别结果发送至服务器后,接收到所述服务器根据所述语音指令识别结果反馈的操作信息;
步骤S502、识别所述第二语义指令,得到所述第二语义指令的识别结果;
步骤S503、根据所述识别结果,发送执行命令至服务器;
步骤S504、接收所述服务器根据所述执行命令反馈的与所述第二语义指令对应的业务逻辑,并执行所述业务逻辑。
在本申请的一些实施例中,第二终端在接收到由服务器通过第一终端反馈的第二语义指令后,对第二语义指令进行识别,可以得到语音指令识别结果中的目标子意图的试运行结果,根据试运行结果,无需再对目标子意图的话术进行语义识别处理,直接发送执行命令至服务器,调用服务器的执行接口,使得服务器根据试运行结果对接相应的业务平台,调用相应的业务逻辑。第二终端接收服务器反馈的业务逻辑,并执行业务逻辑。
在一种可能的实现方式中,所述操作信息包括所述服务器对所述语音指令识别结果中的目标意图反馈的响应逻辑,以及所述服务器对所述语音指令识别结果中的目标子意图反馈的试运行结果;
相应的,所述接收第一终端执行第一语义指令时发送的第二语义指令包括:
接收第一终端执行所述响应逻辑时发送的所述目标子意图和所述试运行结果。
在一种可能的实现方式中,所述第二语义指令包括所述服务器预验证所述语音指令识别结果中的目标子意图得到的试运行结果;
相应的,所述识别所述第二语义指令,得到所述第二语义指令的识别结果包括:
识别所述第二语义指令,得到所述目标子意图的所述试运行结果。
通过本申请实施例,在第二终端接收到语音指令识别结果中目标子意图的试运行结果时,可以直接根据试运行结果调用服务器的执行接口,无需再对目标子意图的话术进行语义识别处理,服务器接收到第二终端的执行命令后,对接与目标子意图对应的业务平台,调用相应的业务逻辑,并将业务逻辑反馈至第二终端,第二终端执行业务逻辑。节省了语音指令识别结果中目标子意图对应的话术的重复语义处理的流程,提高了对话系统的响应速度。
参见图6,本申请一实施例提供的语音控制方法的设备交互示意图,通过多个设备的网络互连,实现跨设备的语音控制。该交互过程包括以下步骤:
1、第一终端接收用户输入的语音指令,并对所述语音指令进行语音识别,得到语音指令识别结果;
2、第一终端将所述语音指令识别结果发送至服务器;
3、服务器对所述语音指令识别结果进行语义处理,得到操作信息;
4、服务器将所述操作信息发送至所述第一终端;所述操作信息包括第一语义指令和第二语义指令;
5、第一终端执行所述第一语义指令;
6、第一终端将所述第二语义指令发送至第二终端;
7、第二终端识别所述第二语义指令;
8、第二终端发送执行命令至所述服务器,调用所述服务器的执行接口;
9、服务器根据所述执行命令,调用与所述第二语义指令相对应的业务逻辑;
10、服务器将所述业务逻辑发送至所述第二终端;
11、第二终端执行所述业务逻辑。
本实施例与以上实施例步骤的执行原理相同,再次不再赘述。
参见图7,是本申请一实施例提供的语音控制方法的应用场景示意图,将第一终端以手机为例,服务器以对话管理服务器为例,第二终端以电视为例,且各设备均处于联网状态,通过地址和接口的确认,可以实现相互通信。
如图所示,手机接收用户输入的“用电视播放电影哪吒”的语音指令,对语音指令进行语音识别,得到语音指令的文本信息,手机将文本信息通过有线或无线的方式发送至对话管理服务器。对话管理服务器基于手机上下文对“用电视播放电影哪吒”进行语义识别,通过技能发现、试运行、选择出最优技能“切换”,确定目标为“电视”,话术为“播放电影哪吒”;在确定是切换的意图时,需要对电视是否支持播放进行预验证,经过基于模拟的电视上下文进行技能发现、试运行、选择一系列处理,得到验证结果为支持,得到试运行结果为“目标对象Object”;将技能“切换”、确定目标、“电视”、话术“播放电影哪吒”作为响应逻辑反馈至手机;手机接收到响应逻辑,执行切换指令,将“播放电影哪吒”发送至电视,并将试运行结果“Object”发送至电视;电视识别到试运行结果“Object”,直接发送执行命令至对话管理服务器,调用对话管理服务器的执行接口;对话管理服务器对接与“播放电影哪吒”对应的业务逻辑,并将业务逻辑反馈至电视,电视根据反馈的业务逻辑执行播放电影哪吒的操作。
在一种可能的实现方式中,如图8所示,本申请另一实施例提供的语音控制方法的应用场景示意图;对话管理服务器在对“播放电影哪吒”,基于模拟的电视上下文进行语义处理,得到试运行结果后,可以将试运行结果通过网络直接发送至电视,通过手机将“播放电影哪吒”的话术发送至电视;电视根据试运行结果直接调用服务器的执行接口,发送执行命令至对话管理服务器;对话管理服务器对接与“播放电影哪吒”对应的业务逻辑,并将业务逻辑反馈至电视,电视根据反馈的业务逻辑执行播放电影哪吒的操作。
在另一种可能的实现方式中,在服务器端基于手机上下文执行后,得到对应目标意图的响应逻辑和目标子意图的试运行结果;服务器可以直接调用电视,将目标子意图的话术“播放电影哪吒”以及试运行结果同时发送至电视。电视识别目标子意图对应的话术及试运行结果,由电视根据试运行结果直接调用对话管理服务器的执行接口,发送执行命令至对话管理服务器;对话管理服务器对接与“播放电影哪吒”对应的业务逻辑,并将业务逻辑反馈至电视,电视根据反馈的业务逻辑执行播放电影哪吒的操作。
参见图9,是本申请另一实施例提供的语音控制方法的应用场景示意图,将第一终端以手机为例,服务器以对话管理服务器为例,第二终端以电视为例,且各设备均处于联网状态,通过地址和接口的确认,可以实现相互通信。
如图9所示,手机接收用户输入的“切到电视播放电影大圣归来”语音指令,对语音指令进行语音识别,得到语音指令对应的文本信息;手机调用对话管理服务器,对语音指令的文本信息进行语音识别,识别出是切换设备的技能和意图,目标设备是电视,目标子意图是“播放电影大圣归来”;对话管理服务器验证电视是否支持“播放电影大圣归来”,基于模拟的电视上下文,进行“技能发现→试运行→选择”语义处理流程,并获取验证结果:支持,以及试运行结果“{skill(技能)=video(播放),intent(意图)=play(播放),slots(槽位)={name(名称)=大圣归来}”。对话管理服务器将技能=switch,意图=switch,目标=TV,目标话述=播放电影大圣归来,试运行结果={skill=video,intent=play,slots={name=大圣归来},返回至手机。手机接收到结果后,识别到是切换,则调用电视,并把目标话术“播放电影大圣归来”,以及试运行结果“{skill=video,intent=play,slots={name=大圣归来}”发送至电视。电视收到切换命令后,识别到试运行结果,则直接调用对话管理服务器的执行接口,执行“{skill=video,intent=play,slots={name=大圣归来}”。对话管理服务器收到执行命令后,解释“{skill=video,intent=play,slots={name=大圣归来}”,直接调用对应业务逻辑,返回技能=video,意图=play,名称=大圣归来至电视。电视收到后,播放电影“大圣归来”。
通过本申请实施例,减少了目标设备的前半部分流程,显著缩短对话系统的响应时延(在实际应用中缩短的时延可达50%以上),提升了对话体验。
对应于上文实施例所述的语音控制方法以及应用场景的实施例,图10示出了本申请实施例提供的语音控制装置的结构框图,为了便于说明,仅示出了与本申请实施例相关的部分。
参照图10,该装置包括第一接收模块101,语义处理模块102,第一发送模块103以及命令执行模块104。其中,各模块功能如下:
第一接收模块101,用于接收第一终端发送的语音指令识别结果;
语义处理模块102,用于对所述语音指令识别结果进行语义处理,得到操作信息,所述操作信息包括第一语义指令和第二语义指令;
第一发送模块103,用于将所述第一语义指令和所述第二语义指令发送至所述第一终端,所 述第一语义指令用于指示所述第一终端将所述第二语义指令发送至第二终端;
命令执行模块104,用于接收所述第二终端识别所述第二语义指令后反馈的执行命令,并根据所述执行命令将与所述第二语义指令对应的业务逻辑发送至所述第二终端。
在一种可能的实现方式中,所述语义处理模块包括:
语义识别子模块,用于识别所述语音指令识别结果,得到所述语音指令识别结果的目标意图及目标子意图;
任务执行子模块,用于根据所述目标意图,预验证所述目标子意图,得到所述目标意图的响应逻辑和所述目标子意图的试运行结果;将所述响应逻辑作为所述操作信息的所述第一语义指令,将所述目标子意图及所述试运行结果作为所述操作信息的第二语义指令。
在一种可能的实现方式中,所述第一发送模块还用于将所述第一语义指令和所述第二语义指令以语义表征的形式发送至所述第一终端。
在一种可能的实现方式中,所述第一发送模块包括:
第一子模块,用于根据所述执行命令,解析所述试运行结果;
第二字模块,用于根据解析后的所述试运行结果,调用所述业务逻辑,并将所述业务逻辑以语义表征的形式发送至所述第二终端。
对应于上文实施例所述的语音控制方法以及应用场景的实施例,图11示出了本申请另一实施例提供的语音控制装置的结构框图,为了便于说明,仅示出了与本申请实施例相关的部分。
参照图11,该装置包括语音识别模块111,第二发送模块112,第二接收模块113以及指令执行模块114。其中,各模块功能如下:
语音识别模块111,用于接收用户输入的语音指令,并对所述语音指令进行语音识别,得到语音指令识别结果;
第二发送模块112,用于将所述语音指令识别结果发送至服务器;
第二接收模块113,用于接收所述服务器对所述语音指令识别结果进行语义处理后反馈的操作信息,所述操作信息包括第一语义指令和第二语义指令;
指令执行模块114,用于执行所述第一语义指令,将所述第二语义指令发送至第二终端;所述第二语义指令用于指示所述第二终端发送执行命令至服务器,并接收所述服务器反馈的与所述第二语义指令对应的业务逻辑。
在一种可能的实现方式中,所述第二接收模块还用于接收所述服务器针对所述语音指令识别结果中的目标意图反馈的响应逻辑,以及接收所述服务器针对所述语音指令识别结果中的目标子意图反馈的试运行结果。
在一种可能的实现方式中,所述第一语义指令为所述服务器对所述语音指令识别结果中的目标意图反馈的响应逻辑,所述第二语义指令为所述服务器对所述语音指令识别结果中的目标子意图反馈的试运行结果及所述目标子意图;所述指令执行模块还用于执行所述服务器反馈的所述响应逻辑,将所述服务器反馈的所述目标子意图及所述试运行结果发送所述第二终端。
对应于上文实施例所述的语音控制方法以及应用场景的实施例,图12示出了本申请另一实施例提供的语音控制装置的结构框图,为了便于说明,仅示出了与本申请实施例相关的部分。
参照图12,该装置包括第三接收模块121,指令识别模块122,第三发送模块123以及业务执行模块124。其中,各模块功能如下:
第三接收模块121,用于接收第一终端执行第一语义指令时发送的第二语义指令;所述第一 语义指令和所述第二语义指令为所述第一终端将语音指令识别结果发送至服务器后,接收到所述服务器根据所述语音指令识别结果反馈的操作信息;
指令识别模块122,用于识别所述第二语义指令,得到所述第二语义指令的识别结果;
第三发送模块123,用于根据所述识别结果,发送执行命令至服务器;
业务执行模块124,用于接收所述服务器根据所述执行命令反馈的与所述第二语义指令对应的业务逻辑,并执行所述业务逻辑。
在一种可能的实现方式中,所述操作信息包括所述服务器对所述语音指令识别结果中的目标意图反馈的响应逻辑,以及所述服务器对所述语音指令识别结果中的目标子意图反馈的试运行结果;所述第三接收模块还用于接收第一终端执行所述响应逻辑时发送的所述目标子意图和所述试运行结果。
在一种可能的实现方式中,所述第二语义指令包括所述服务器预验证所述语音指令识别结果中的目标子意图得到的试运行结果;所述指令识别模块还用于识别所述第二语义指令,得到所述目标子意图的所述试运行结果。
在一种可能的实现方式中,所述第三发送模块还用于根据所述识别结果,将所述试运行结果对应的执行命令发送至所述服务器。
通过本实施例,采用语音控制方法,通过接收第一终端发送的语音指令识别结果,对语音指令识别结果进行语义处理,得到语音指令识别结果中待执行的操作信息,并将操作信息发送至第一终端;由第一终端执行操作信息中的第一语义指令,并将操作信息中的第二语义指令发送至第二终端;在第二终端识别第二语义指令后,可以直接接收到第二终端反馈的执行命令,根据执行命令调用与第二语义指令相对应的业务逻辑,并将业务逻辑发送至第二终端;通过本实施例,在第二终端接收到第二语义指令后,可以直接接收到第二终端根据第二语义指令中包含的任务信息反馈执行命令,无需再对第二终端接收的第二语义指令进行再次语义处理,可以根据反馈的执行命令调用相应的业务逻辑,通过执行接口发送至第二终端,省去了对第二语义指令的处理流程,缩短了对话的延时,提高了对话系统的响应时间。
所属领域的技术人员可以清楚地了解到,为了描述的方便和简洁,仅以上述各功能单元、模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能单元、模块完成,即将所述装置的内部结构划分成不同的功能单元或模块,以完成以上描述的全部或者部分功能。实施例中的各功能单元、模块可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中,上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。另外,各功能单元、模块的具体名称也只是为了便于相互区分,并不用于限制本申请的保护范围。上述系统中单元、模块的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
图13为本申请一实施例提供的服务器的结构示意图。如图13所示,该实施例的服务器13包括:至少一个处理器131(图13中仅示出一个)、存储器132、存储在所述存储器132中并可在所述至少一个处理器131上运行的计算机程序133、自然语言处理模块134以及对话管理模块135。所述存储器132、自然语言理解模块134及对话管理模块135与处理器131耦合,存储器132用于存储计算机程序133,计算机程序133包括指令,处理器131从所述存储器132中读取指令,使得服务器13执行如下操作:
接收第一终端发送的语音指令识别结果;对所述语音指令识别结果进行语义处理,得到操作 信息,所述操作信息包括第一语义指令和第二语义指令;将所述第一语义指令和所述第二语义指令发送至所述第一终端,所述第一语义指令用于指示所述第一终端将所述第二语义指令发送至第二终端;接收所述第二终端识别所述第二语义指令后反馈的执行命令,并根据所述执行命令将与所述第二语义指令对应的业务逻辑发送至所述第二终端。
图14为本申请一实施例提供的终端设备的结构示意图。如图14所示,该实施例的终端设备14包括:至少一个处理器141(图14中仅示出一个)、存储器142、存储在所述存储器142中并可在所述至少一个处理器141上运行的计算机程序143以及语音助手144。所述存储器142、语音助手144与处理器141耦合,存储器142用于存储计算机程序143,计算机程序143包括指令,处理器141从所述存储器142中读取指令,使得终端设备14执行如下操作:
接收用户输入的语音指令,并对所述语音指令进行语音识别,得到语音指令识别结果;将所述语音指令识别结果发送至服务器;接收所述服务器对所述语音指令识别结果进行语义处理后反馈的操作信息,所述操作信息包括第一语义指令和第二语义指令;执行所述第一语义指令,将所述第二语义指令发送至第二终端;所述第二语义指令用于指示所述第二终端发送执行命令至服务器,并接收所述服务器反馈的与所述第二语义指令对应的业务逻辑。
图15为本申请一实施例提供的终端设备的结构示意图。如图15所示,该实施例的终端设备15包括:至少一个处理器151(图15中仅示出一个)、存储器152、存储在所述存储器152中并可在所述至少一个处理器151上运行的计算机程序153。所述存储器152与处理器151耦合,存储器152用于存储计算机程序153,计算机程序153包括指令,处理器151从所述存储器152中读取指令,使得终端设备15执行如下操作:
接收第一终端执行第一语义指令时发送的第二语义指令;所述第一语义指令和所述第二语义指令为所述第一终端将语音指令识别结果发送至服务器后,接收到所述服务器根据所述语音指令识别结果反馈的操作信息;识别所述第二语义指令,得到所述第二语义指令的识别结果;根据所述识别结果,发送执行命令至服务器;接收所述服务器根据所述执行命令反馈的与所述第二语义指令对应的业务逻辑,并执行所述业务逻辑。
所述服务器13可以是云服务器或本地物理服务器等设备;所述终端设备14和所述终端设备15可以是桌上型计算机、笔记本、掌上电脑、手机、电视、音箱等设备。所述服务器13、所述终端设备14和所述终端设备15可包括,但不仅限于,处理器、存储器。本领域技术人员可以理解,图13、图14及图15仅仅是服务器和终端设备的举例,并不构成对服务器和终端设备的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件,例如还可以包括输入输出设备、网络接入设备等。
所称处理器可以是中央处理单元(Central Processing Unit,CPU),该处理器还可以是其他通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现成可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。
所述存储器在一些实施例中可以是所述服务器13、所述终端设备14或所述终端设备15的内部存储单元,例如硬盘或内存。所述存储器在另一些实施例中也可以是所述服务器13、所述终端设备14或所述终端设备15的外部存储设备,例如配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)等。进一步地,所述存储 器还可以既包括所述服务器13、所述终端设备14或所述终端设备15的内部存储单元也包括外部存储设备。所述存储器用于存储操作系统、应用、引导装载程序(BootLoader)、数据以及其他程序等,例如所述计算机程序的程序代码等。所述存储器还可以用于暂时地存储已经输出或者将要输出的数据。
本申请实施例还提供了一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,所述计算机程序包括指令,所述指令在终端设备上运行时,使得所述终端设备执行所述皮肤检测方法。
本申请实施例提供了一种包含指令的计算机程序产品,所述计算机程序产品在终端设备上运行时,使得终端设备执行上述第一方面中任一项所述的皮肤检测方法。
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请实现上述实施例方法中的全部或部分流程,可以通过计算机程序来指令相关的硬件来完成,所述的计算机程序可存储于一计算机可读存储介质中,该计算机程序在被处理器执行时,可实现上述各个方法实施例的步骤。其中,所述计算机程序包括计算机程序代码,所述计算机程序代码可以为源代码形式、对象代码形式、可执行文件或某些中间形式等。所述计算机可读介质至少可以包括:能够将计算机程序代码携带到拍照装置/终端设备的任何实体或装置、记录介质、计算机存储器、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、电载波信号、电信信号以及软件分发介质。例如U盘、移动硬盘、磁碟或者光盘等。在某些司法管辖区,根据立法和专利实践,计算机可读介质不可以是电载波信号和电信信号。
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述或记载的部分,可以参见其它实施例的相关描述。
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
在本申请所提供的实施例中,应该理解到,所揭露的装置/网络设备和方法,可以通过其它的方式实现。例如,以上所描述的装置/网络设备实施例仅仅是示意性的,例如,所述模块或单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通讯连接可以是通过一些接口,装置或单元的间接耦合或通讯连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
以上所述实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围,均应包含在本申请的保护范围。

Claims (19)

  1. 一种语音控制方法,其特征在于,包括:
    接收第一终端发送的语音指令识别结果;
    对所述语音指令识别结果进行语义处理,得到操作信息,所述操作信息包括第一语义指令和第二语义指令;
    将所述第一语义指令和所述第二语义指令发送至所述第一终端,所述第一语义指令用于指示所述第一终端将所述第二语义指令发送至第二终端;
    接收所述第二终端识别所述第二语义指令后反馈的执行命令,并根据所述执行命令将与所述第二语义指令对应的业务逻辑发送至所述第二终端。
  2. 如权利要求1所述的语音控制方法,其特征在于,所述对所述语音指令识别结果进行语义处理,得到操作信息包括:
    识别所述语音指令识别结果,得到所述语音指令识别结果的目标意图及目标子意图;
    根据所述目标意图,预验证所述目标子意图,得到所述目标意图的响应逻辑和所述目标子意图的试运行结果;
    将所述响应逻辑作为所述操作信息的所述第一语义指令,将所述目标子意图及所述试运行结果作为所述操作信息的第二语义指令。
  3. 如权利要求1所述的语音控制方法,其特征在于,所述将所述第一语义指令和所述第二语义指令发送至所述第一终端包括:
    将所述第一语义指令和所述第二语义指令以语义表征的形式发送至所述第一终端。
  4. 如权利要求2所述的语音控制方法,其特征在于,所述根据所述执行命令将与所述第二语义指令对应的业务逻辑发送至所述第二终端包括:
    根据所述执行命令,解析所述试运行结果;
    根据解析后的所述试运行结果,调用所述业务逻辑,并将所述业务逻辑以语义表征的形式发送至所述第二终端。
  5. 一种语音控制方法,其特征在于,包括:
    接收用户输入的语音指令,并对所述语音指令进行语音识别,得到语音指令识别结果;
    将所述语音指令识别结果发送至服务器;
    接收所述服务器对所述语音指令识别结果进行语义处理后反馈的操作信息,所述操作信息包括第一语义指令和第二语义指令;
    执行所述第一语义指令,将所述第二语义指令发送至第二终端;所述第二语义指令用于指示所述第二终端发送执行命令至服务器,并接收所述服务器反馈的与所述第二语义指令对应的业务逻辑。
  6. 如权利要求5所述的语音控制方法,其特征在于,所述接收所述服务器对所述语音指令识别结果进行语义处理后反馈的操作信息包括:
    接收所述服务器针对所述语音指令识别结果中的目标意图反馈的响应逻辑,以及接收所述服务器针对所述语音指令识别结果中的目标子意图反馈的试运行结果。
  7. 如权利要求5所述的语音控制方法,其特征在于,所述第一语义指令为所述服务器对所述语音指令识别结果中的目标意图反馈的响应逻辑,所述第二语义指令为所述服务器对所述语音指 令识别结果中的目标子意图反馈的试运行结果及所述目标子意图;
    相应的,执行所述第一语义指令,将所述第二语义指令发送至第二终端包括:
    执行所述服务器反馈的所述响应逻辑,将所述服务器反馈的所述目标子意图及所述试运行结果发送所述第二终端。
  8. 一种语音控制方法,其特征在于,包括:
    接收第一终端执行第一语义指令时发送的第二语义指令;所述第一语义指令和所述第二语义指令为所述第一终端将语音指令识别结果发送至服务器后,接收到所述服务器根据所述语音指令识别结果反馈的操作信息;
    识别所述第二语义指令,得到所述第二语义指令的识别结果;
    根据所述识别结果,发送执行命令至服务器;
    接收所述服务器根据所述执行命令反馈的与所述第二语义指令对应的业务逻辑,并执行所述业务逻辑。
  9. 如权利要求8所述的语音控制方法,其特征在于,所述操作信息包括所述服务器对所述语音指令识别结果中的目标意图反馈的响应逻辑,以及所述服务器对所述语音指令识别结果中的目标子意图反馈的试运行结果;
    相应的,所述接收第一终端执行第一语义指令时发送的第二语义指令包括:
    接收第一终端执行所述响应逻辑时发送的所述目标子意图和所述试运行结果。
  10. 如权利要求8所述的语音控制方法,其特征在于,所述第二语义指令包括所述服务器预验证所述语音指令识别结果中的目标子意图得到的试运行结果;
    相应的,所述识别所述第二语义指令,得到所述第二语义指令的识别结果包括:
    识别所述第二语义指令,得到所述目标子意图的所述试运行结果。
  11. 如权利要求10所述的语音控制方法,其特征在于,所述根据所述识别结果,发送执行命令至服务器包括:
    根据所述识别结果,将所述试运行结果对应的执行命令发送至所述服务器。
  12. 一种语音控制装置,其特征在于,包括:
    第一接收模块,用于接收第一终端发送的语音指令识别结果;
    语义处理模块,用于对所述语音指令识别结果进行语义处理,得到操作信息,所述操作信息包括第一语义指令和第二语义指令;
    第一发送模块,用于将所述第一语义指令和所述第二语义指令发送至所述第一终端,所述第一语义指令用于指示所述第一终端将所述第二语义指令发送至第二终端;
    命令执行模块,用于接收所述第二终端识别所述第二语义指令后反馈的执行命令,并根据所述执行命令将与所述第二语义指令对应的业务逻辑发送至所述第二终端。
  13. 一种语音控制装置,其特征在于,包括:
    语音识别模块,用于接收用户输入的语音指令,并对所述语音指令进行语音识别,得到语音指令识别结果;
    第二发送模块,用于将所述语音指令识别结果发送至服务器;
    第二接收模块,用于接收所述服务器对所述语音指令识别结果进行语义处理后反馈的操作信息,所述操作信息包括第一语义指令和第二语义指令;
    指令执行模块,用于执行所述第一语义指令,将所述第二语义指令发送至第二终端;所述第 二语义指令用于指示所述第二终端发送执行命令至服务器,并接收所述服务器反馈的与所述第二语义指令对应的业务逻辑。
  14. 一种语音控制装置,其特征在于,包括:
    第三接收模块,用于接收第一终端执行第一语义指令时发送的第二语义指令;所述第一语义指令和所述第二语义指令为所述第一终端将语音指令识别结果发送至服务器后,接收到所述服务器根据所述语音指令识别结果反馈的操作信息;
    指令识别模块,用于识别所述第二语义指令,得到所述第二语义指令的识别结果;
    第三发送模块,用于根据所述识别结果,发送执行命令至服务器;
    业务执行模块,用于接收所述服务器根据所述执行命令反馈的与所述第二语义指令对应的业务逻辑,并执行所述业务逻辑。
  15. 一种服务器,其特征在于,所述服务器包括存储器、处理器、自然语言理解模块及对话管理模块,所述存储器用于存储计算机程序,所述计算机程序包括指令,当所述指令被所述服务器执行时,使得所述服务器执行权利要求1至4任一项所述语音控制方法。
  16. 一种终端设备,其特征在于,所述终端设备包括存储器、处理器及语音助手,所述存储器用于存储计算机程序,所述计算机程序包括指令,当所述指令被所述终端设备执行时,使得所述终端设备执行权利要求5至7任一项所述语音控制方法。
  17. 一种终端设备,其特征在于,所述终端设备包括存储器、处理器,所述存储器用于存储计算机程序,所述计算机程序包括指令,当所述指令被所述终端设备执行时,使得所述终端设备执行权利要求8至11任一项所述语音控制方法。
  18. 一种计算机存储介质,其特征在于,所述计算机可读存储介质存储有计算机程序,所述计算机程序包括指令,所述指令在终端设备上运行时,使得所述终端设备执行如权利要求1至4、5至7或8至11任一项所述语音控制方法。
  19. 一种包含指令的计算机程序产品,其特征在于,所述计算机程序产品在终端设备上运行时,使得终端设备执行如权利要求1至4、5至7或8至11任一项所述语音控制方法。
PCT/CN2020/125215 2019-12-31 2020-10-30 语音控制方法、装置、服务器、终端设备及存储介质 WO2021135604A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
US17/789,873 US20230053765A1 (en) 2019-12-31 2020-10-30 Speech Control Method and Apparatus, Server, Terminal Device, and Storage Medium
EP20910466.0A EP4064713A4 (en) 2019-12-31 2020-10-30 VOICE CONTROL METHOD AND APPARATUS, SERVER, TERMINAL DEVICE AND STORAGE MEDIA

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201911417229.4A CN113127609A (zh) 2019-12-31 2019-12-31 语音控制方法、装置、服务器、终端设备及存储介质
CN201911417229.4 2019-12-31

Publications (1)

Publication Number Publication Date
WO2021135604A1 true WO2021135604A1 (zh) 2021-07-08

Family

ID=76686450

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/125215 WO2021135604A1 (zh) 2019-12-31 2020-10-30 语音控制方法、装置、服务器、终端设备及存储介质

Country Status (4)

Country Link
US (1) US20230053765A1 (zh)
EP (1) EP4064713A4 (zh)
CN (1) CN113127609A (zh)
WO (1) WO2021135604A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114494267A (zh) * 2021-11-30 2022-05-13 北京国网富达科技发展有限责任公司 一种变电站和电缆隧道场景语义构建系统和方法
CN114785842A (zh) * 2022-06-22 2022-07-22 北京云迹科技股份有限公司 基于语音交换系统的机器人调度方法、装置、设备及介质

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116114255A (zh) * 2020-06-08 2023-05-12 搜诺思公司 使用分布式命令处理的控制
US11830489B2 (en) * 2021-06-30 2023-11-28 Bank Of America Corporation System and method for speech processing based on response content
CN113838463A (zh) * 2021-09-16 2021-12-24 Oppo广东移动通信有限公司 信息传输方法、装置、电子设备及存储介质
CN114286167A (zh) * 2021-12-03 2022-04-05 杭州逗酷软件科技有限公司 跨设备的交互方法、装置、电子设备以及存储介质
CN116805488A (zh) * 2022-03-18 2023-09-26 华为技术有限公司 一种多设备的语音控制系统及方法
CN115097738A (zh) * 2022-06-17 2022-09-23 青岛海尔科技有限公司 基于数字孪生的设备控制方法、装置和存储介质及电子装置
CN115567567A (zh) * 2022-09-20 2023-01-03 中国联合网络通信集团有限公司 一种设备控制方法、装置、存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105161106A (zh) * 2015-08-20 2015-12-16 深圳Tcl数字技术有限公司 智能终端的语音控制方法、装置及电视机系统
WO2018213323A1 (en) * 2017-05-16 2018-11-22 Google Llc Cross-device handoffs
CN109451338A (zh) * 2018-12-12 2019-03-08 央广视讯传媒股份有限公司 一种语音遥控电视的方法、装置、电子设备及可读介质
CN110265033A (zh) * 2019-06-21 2019-09-20 四川长虹电器股份有限公司 扩展设备语音交互功能的系统及方法
CN110491387A (zh) * 2019-08-23 2019-11-22 三星电子(中国)研发中心 一种基于多个终端的交互服务实现方法和系统

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7640160B2 (en) * 2005-08-05 2009-12-29 Voicebox Technologies, Inc. Systems and methods for responding to natural language speech utterance
CN102736536A (zh) * 2012-07-13 2012-10-17 海尔集团公司 语音控制电器设备的方法、装置
US9548066B2 (en) * 2014-08-11 2017-01-17 Amazon Technologies, Inc. Voice application architecture
US10740384B2 (en) * 2015-09-08 2020-08-11 Apple Inc. Intelligent automated assistant for media search and playback
CN107085463A (zh) * 2016-02-15 2017-08-22 北京北信源软件股份有限公司 一种支持自然语言信息交互的智能设备控制体系和方法
WO2019142427A1 (ja) * 2018-01-16 2019-07-25 ソニー株式会社 情報処理装置、情報処理システム、および情報処理方法、並びにプログラム
CN109493851A (zh) * 2018-11-20 2019-03-19 新视家科技(北京)有限公司 一种语音控制方法、相关装置及存储介质

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105161106A (zh) * 2015-08-20 2015-12-16 深圳Tcl数字技术有限公司 智能终端的语音控制方法、装置及电视机系统
WO2018213323A1 (en) * 2017-05-16 2018-11-22 Google Llc Cross-device handoffs
CN109451338A (zh) * 2018-12-12 2019-03-08 央广视讯传媒股份有限公司 一种语音遥控电视的方法、装置、电子设备及可读介质
CN110265033A (zh) * 2019-06-21 2019-09-20 四川长虹电器股份有限公司 扩展设备语音交互功能的系统及方法
CN110491387A (zh) * 2019-08-23 2019-11-22 三星电子(中国)研发中心 一种基于多个终端的交互服务实现方法和系统

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP4064713A4

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114494267A (zh) * 2021-11-30 2022-05-13 北京国网富达科技发展有限责任公司 一种变电站和电缆隧道场景语义构建系统和方法
CN114785842A (zh) * 2022-06-22 2022-07-22 北京云迹科技股份有限公司 基于语音交换系统的机器人调度方法、装置、设备及介质

Also Published As

Publication number Publication date
EP4064713A4 (en) 2023-01-18
US20230053765A1 (en) 2023-02-23
EP4064713A1 (en) 2022-09-28
CN113127609A (zh) 2021-07-16

Similar Documents

Publication Publication Date Title
WO2021135604A1 (zh) 语音控制方法、装置、服务器、终端设备及存储介质
US10074365B2 (en) Voice control method, mobile terminal device, and voice control system
CN107004411B (zh) 话音应用架构
CN108831469B (zh) 语音命令定制方法、装置和设备及计算机存储介质
CN108133707B (zh) 一种内容分享方法及系统
US20120197629A1 (en) Speech translation system, first terminal apparatus, speech recognition server, translation server, and speech synthesis server
US20220334795A1 (en) System and method for providing a response to a user query using a visual assistant
CN111049996A (zh) 多场景语音识别方法及装置、和应用其的智能客服系统
US20210398527A1 (en) Terminal screen projection control method and terminal
CN110246499B (zh) 家居设备的语音控制方法及装置
CN109215652A (zh) 音量调节方法、装置、播放终端及计算机可读存储介质
WO2019228138A1 (zh) 音乐播放方法、装置、存储介质及电子设备
CN113921004A (zh) 智能设备控制方法、装置、存储介质和电子设备
CN109918492A (zh) 一种人机对话设置方法和人机对话设置系统
CN108597499B (zh) 语音处理方法以及语音处理装置
CN112837683B (zh) 语音服务方法及装置
KR102357620B1 (ko) 챗봇 채널연계 통합을 위한 챗봇 통합 에이전트 플랫폼 시스템 및 그 서비스 방법
CN111147530B (zh) 多语音平台的系统、切换方法、智能终端及存储介质
WO2019228140A1 (zh) 指令执行方法、装置、存储介质及电子设备
JP2023510518A (ja) 音声端末機の音声検証及び制限方法
CN114694645A (zh) 一种确定用户意图的方法及装置
CN112489644A (zh) 用于电子设备的语音识别方法及装置
CN111312254A (zh) 语音对话方法和装置
CN110472254A (zh) 语音翻译方法、通信终端、及计算机可读存储介质
WO2022226715A1 (en) Hybrid text to speech

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20910466

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2020910466

Country of ref document: EP

Effective date: 20220622

NENP Non-entry into the national phase

Ref country code: DE