WO2023241482A1 - 一种人机对话方法、设备及系统 - Google Patents

一种人机对话方法、设备及系统 Download PDF

Info

Publication number
WO2023241482A1
WO2023241482A1 PCT/CN2023/099440 CN2023099440W WO2023241482A1 WO 2023241482 A1 WO2023241482 A1 WO 2023241482A1 CN 2023099440 W CN2023099440 W CN 2023099440W WO 2023241482 A1 WO2023241482 A1 WO 2023241482A1
Authority
WO
WIPO (PCT)
Prior art keywords
instruction
terminal device
voice signal
server
vehicle
Prior art date
Application number
PCT/CN2023/099440
Other languages
English (en)
French (fr)
Inventor
宋凯凯
赵世奇
周剑辉
沈波
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2023241482A1 publication Critical patent/WO2023241482A1/zh

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • G10L15/30Distributed recognition, e.g. in client-server systems, for mobile phones or network applications
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/225Feedback of the input speech

Definitions

  • the embodiments of the present application relate to technology, and in particular, to a human-computer dialogue method, device and system.
  • the present application provides a method, device and system for human-machine dialogue.
  • the terminal device sends a voice signal or the text of the voice signal to the server, and the server and the terminal device simultaneously generate a third message based on the voice signal or the text.
  • a first instruction and a second instruction when the instruction obtained first by the terminal device cannot be executed, the target instruction is selected from the first instruction and the second instruction for execution.
  • the terminal device can select the target command from two commands, which can improve the accuracy of voice response.
  • embodiments of the present application provide a human-machine dialogue method, which is applied to a terminal device.
  • the method includes:
  • the terminal device receives the voice signal
  • the terminal device obtains the first instruction based on the voice signal
  • the terminal device sends the voice signal or the text of the voice signal to the server, and the server is used to obtain the second instruction based on the voice signal; the text of the voice signal is obtained by the terminal device based on the voice signal;
  • the terminal device receives the second instruction sent by the server
  • the terminal device selects a target instruction from the first instruction and the second instruction for execution, and the target instruction is the first instruction or the second instruction.
  • the terminal device sends the voice signal or its text to the server, and the server and the terminal device simultaneously generate the first instruction and the second instruction respectively based on the voice signal or text; when the previously obtained instruction cannot be executed, the server Select the target instruction from the first instruction and the second instruction for execution.
  • the terminal device and the server process the voice signal at the same time, and the terminal device can select the target command from the two commands, which can improve the accuracy of the voice response.
  • the meaning that the instruction cannot be executed directly means that another instruction needs to be considered, and the instruction cannot be executed without considering another instruction.
  • the terminal device needs to wait for the arrival of another instruction, then compare the two instructions, and select the target instruction to be executed; it cannot wait for another instruction to be executed or the other instruction has not been determined. executed before the instruction is executed.
  • the terminal device determines that the first instruction cannot be executed directly, so it waits for the second instruction; if the confidence level of the second instruction is higher than the first instruction, the terminal device executes the second instruction.
  • the first instruction is executed.
  • the terminal device waits for the second instruction; if the second instruction is not obtained after the preset time, it can determine that the situation of the second instruction is abnormal, and then execute the first instruction. .
  • an instruction being directly executable is that it can be executed directly without considering another instruction. For example, if the terminal device first obtains the first instruction and determines that the first instruction can be executed directly, the terminal device may execute the first instruction immediately without waiting for the second instruction. It should be noted that when the above instruction is the first instruction, the other instruction is the second instruction, and when the above instruction is the second instruction, the other instruction is the first instruction.
  • the target instruction is the first-obtained instruction and the first-obtained instruction cannot be executed, the first-obtained instruction is not executed;
  • the target instruction is the first-obtained instruction and the first-obtained instruction is executable
  • the first-obtained instruction is executed.
  • the method also includes:
  • the terminal device does not execute the instructions obtained after.
  • the method before selecting a target instruction from the first instruction and the second instruction for execution, the method further includes:
  • the terminal device When the terminal device obtains the first instruction, if it does not receive the second instruction, it determines the executability of the first instruction; the executability includes whether it can be directly executed or not;
  • the terminal device determines that the first instruction cannot be executed directly, it waits for the second instruction.
  • this method can wait for the second execution when the first instruction is inaccurate or unexecutable, thereby ensuring the accuracy of the response.
  • determining the executability of the first instruction includes:
  • the terminal device determines the executability of the first instruction based on the first decision information; the first decision information includes the decision result of the previous round, the status identifier of the previous round of dialogue, the intention result of the current round determined by the terminal device, and the result of the current round determined by the terminal device. At least one of the instruction result of this round and the confidence of the first instruction
  • the last round decision result is used to indicate the source of the instructions executed for the previous round of dialogue, including servers and terminal devices; the status identifier is used to indicate a single round or multiple rounds; the intent result is used to indicate a new intention or multiple rounds of intentions; instructions
  • the results include normal or abnormal, and abnormal is used to indicate that the operation corresponding to the voice signal cannot be performed.
  • the method before selecting a target instruction from the first instruction and the second instruction for execution, the method further includes:
  • the terminal device obtains the second instruction, if it does not receive the first instruction, determine the executability of the second instruction;
  • the terminal device determines that the second instruction cannot be directly executed, it waits for the first instruction.
  • determining the executability of the second instruction includes:
  • the terminal device determines the executability of the second instruction based on the second decision information;
  • the second decision information includes the last round of decision results, the status identifier of the previous round of dialogue, the intention result of this round determined by the server, and the current round determined by the server. At least one of the instruction results;
  • the last round decision result is used to indicate the source of the instructions executed for the previous round of dialogue, including servers and terminal devices; the status identifier is used to indicate a single round or multiple rounds; the intent result is used to indicate a new intention or multiple rounds of intentions; instructions
  • the results include normal or abnormal, and abnormal is used to indicate that the operation corresponding to the voice signal cannot be performed.
  • the terminal device determines the target instruction from the first instruction and the second instruction based on the third decision information
  • the third decision information includes the decision result of the previous round, the status identifier of the previous round of dialogue, the intention result of this round determined by the server, the instruction result of this round determined by the server, the intention result of this round determined by the terminal device, and the intention result of this round determined by the terminal device. At least one of the instruction result of this round and the confidence level of the first instruction;
  • the last round decision result is used to indicate the source of the instructions executed for the previous round of dialogue, including servers and terminal devices; the status identifier is used to indicate a single round or multiple rounds; the intent result is used to indicate a new intention or multiple rounds of intentions; instructions
  • the results include normal or abnormal, and abnormal is used to indicate that the operation corresponding to the voice signal cannot be performed.
  • the terminal device sends the dialogue state to the server, and the dialogue state is used by the server to generate the second instruction; the dialogue state includes at least one of the last round of decision results and the status identifier of the previous round of dialogue.
  • the terminal device obtains the first instruction based on the voice signal, including:
  • the terminal device obtains the intent based on the voice signal
  • the terminal device obtains the first instruction based on the voice signal, including:
  • the method also includes:
  • the terminal device uses the voice signal and the intention corresponding to the second command as training samples to train the terminal-side model.
  • the terminal device recognizes the intention of the voice signal based on the end-side model; the terminal device can train the end-side model based on the results of the server to improve the terminal device's ability to generate voice instructions.
  • selecting a target instruction from the first instruction and the second instruction for execution includes:
  • the server obtains the second instruction based on the voice signal
  • the present application provides an electronic device.
  • the electronic device may include memory and a processor.
  • memory can be used to store computer programs.
  • the processor may be used to call a computer program so that the electronic device executes the first aspect or any possible implementation of the first aspect.
  • the present application provides a computer program product containing instructions, which is characterized in that when the above computer program product is run on an electronic device, the electronic device is caused to execute the second aspect or any of the possible methods in the second aspect. Method to realize.
  • the electronic devices provided by the third and fourth aspects, the computer program products provided by the fifth and sixth aspects, and the computer-readable storage media provided by the seventh and eighth aspects are all used to execute the present application. Examples of methods provided. Therefore, the beneficial effects it can achieve can be referred to the beneficial effects in the corresponding methods, and will not be described again here.
  • Figure 1 is a system architecture diagram of a human-machine dialogue system provided by an embodiment of the present application
  • Figure 3 is a flow chart of a human-computer dialogue method provided by an embodiment of the present application.
  • Figure 4 is a schematic flowchart of a decision-making method provided by an embodiment of the present application.
  • Figure 7 is a schematic flowchart of a vehicle's decision-making method when receiving a second instruction from a server provided by an embodiment of the present application;
  • Figure 8 is another human-machine dialogue method provided by an embodiment of the present application.
  • FIG. 10 is a software structure block diagram of the electronic device 100 provided by the embodiment of the present application.
  • first and second are used for descriptive purposes only and shall not be understood as implying or implying relative importance or implicitly specifying the quantity of indicated technical features. Therefore, the features defined as “first” and “second” may explicitly or implicitly include one or more of the features. In the description of the embodiments of this application, unless otherwise specified, “plurality” The meaning is two or more.
  • GUI graphical user interface
  • Figure 1 is a system architecture diagram of a human-machine dialogue system provided by an embodiment of the present application.
  • the human-machine dialogue system 10 may include a server 100 and a terminal device 200 .
  • the server 100 may be a server deployed in the cloud;
  • the terminal device 200 may be a vehicle (FIG. 1 exemplarily shows that the terminal device 200 is a vehicle).
  • the server 100 is a server with a voice recognition function.
  • the server 100 in the embodiment of this application can be implemented as an independent server or a server cluster composed of multiple servers.
  • the terminal device 200 may include, but is not limited to, a vehicle, a tablet, a monitor, a television, a desktop computer, a laptop computer, a handheld computer, a notebook computer, a super mobile personal computer, a netbook, an augmented reality device, a virtual reality device, an artificial intelligence device , vehicle equipment, smart home equipment, etc.
  • the terminal device 200 in the embodiment of the present application may be the vehicle 002 as shown in FIG. 2 .
  • Vehicle 002 is a car that senses the road environment through an on-board sensing system, automatically plans a driving route, and controls the vehicle to reach a predetermined destination.
  • Smart cars make intensive use of computers, modern sensing, information fusion, communications, artificial intelligence and automatic control technologies. They are a high-tech complex that integrates environmental perception, planning and decision-making, multi-level assisted driving and other functions.
  • the terminal device 200 can use the voice assistant to perform speech recognition ASR on the received language signal to obtain the text information corresponding to the language signal; and then perform speech recognition on the received text information through natural language understanding.
  • speech recognition the intention after semantic recognition can be obtained; through dialogue management (Dialog Manager, DM), the task is obtained according to the input intention, the information required for the task is clarified, and then the business platform is connected to complete the task, or further input of more information is required. Voice command information, or obtain the business logic of the corresponding task of the business platform, and finally return the execution results to the user.
  • the above-mentioned voice assistant may be an embedded application in the terminal device 200 (ie, a system application of the terminal device 200), or may be a downloadable application.
  • the embedded application is an application provided as part of the implementation of the terminal device 200 (such as a vehicle);
  • the downloadable application is an application that can provide its own Internet Protocol Multimedia Subsystem (IMS) connection.
  • IMS Internet Protocol Multimedia Subsystem
  • the downloadable application may be pre-installed in the terminal device 200, or may be a third-party application downloaded and installed in the terminal device 200 by the user.
  • human-computer dialogue system architecture in Figure 1 is only an exemplary implementation in the embodiment of the present application.
  • the human-computer dialogue system architecture in the embodiment of the present application includes but is not limited to the above human-computer dialogue system architecture. .
  • FIG. 2 is a functional block diagram of a vehicle 002 provided by an embodiment of the present application. .
  • vehicle 002 may be configured in a fully or partially autonomous driving mode.
  • vehicle 002 may control itself while in an autonomous driving mode, and may use human operations to determine the current state of the vehicle and its surrounding environment, determine the possible behavior of at least one other vehicle in the surrounding environment, and determine the other vehicle's
  • the vehicle 002 is controlled based on the determined information with a confidence level corresponding to the likelihood of performing a possible action. While vehicle 002 is in autonomous driving mode, vehicle 002 may be configured to operate without human interaction.
  • Vehicle 002 may include various subsystems, such as travel system 202 , sensor system 204 , control system 206 , one or more peripheral devices 208 , voice system 260 , and power supply 210 , computer system 212 , and user interface 216 .
  • vehicle 002 may include more or fewer subsystems, and each subsystem may include multiple elements. Additionally, each subsystem and element of vehicle 002 may be interconnected via wires or wirelessly.
  • Speech system 260 may include components that enable speech recognition for vehicle 002 .
  • the voice system 260 may include a device-cloud collaboration module 261, a service execution module 262, a speech recognition (ASR) module 263, a natural language understanding (NLU) module 264, and a dialogue management (DM) module 265.
  • ASR speech recognition
  • NLU natural language understanding
  • DM dialogue management
  • the terminal-cloud collaboration module 261 can perform configuration management, such as updating configuration files, and the configuration files may include decision rules; perform terminal-cloud decisions, such as determining target instructions from instructions generated by servers and vehicles based on decision rules; perform terminal-cloud distribution, For example, the received language signal is sent to the server and the speech recognition module 263 at the same time; timeout management is performed, for example, when the vehicle 002 does not receive an instruction from the server for more than a preset time, the instruction generated by the vehicle 002 is executed.
  • configuration management such as updating configuration files
  • the configuration files may include decision rules; perform terminal-cloud decisions, such as determining target instructions from instructions generated by servers and vehicles based on decision rules; perform terminal-cloud distribution, For example, the received language signal is sent to the server and the speech recognition module 263 at the same time; timeout management is performed, for example, when the vehicle 002 does not receive an instruction from the server for more than a preset time, the instruction generated by the vehicle 002 is executed.
  • the speech recognition module 263 is used to convert speech signals into text information.
  • the main function of the natural language understanding module 264 is to process the sentences input by the user or the results of speech recognition, and extract the user's conversation intention and the information conveyed by the user. For example, if the user asks "I want to eat mutton steamed buns", the NLU module can identify that the user's intention is to "find a restaurant” and the key entity is "mutton steamed buns".
  • entities refer to words with specific meanings or strong references in the text, usually including names of people, places, organizations, dates and times, proper nouns, etc. For example, if there is no mutton steamed bun in the preset entity of the vehicle, the vehicle cannot recognize the mutton steamed bun as a kind of food when recognizing the voice signal.
  • Dialog management module 265 its main function is to based on the results such as intent, slot and key entity output by the NLU module Control the process of human-computer dialogue to update the status of the system and generate corresponding system actions.
  • the DM module can maintain the current dialogue state based on the dialogue history.
  • the dialogue state can include the accumulated intentions, slots, and key entity sets of the entire dialogue history; it can output the next dialogue action based on the current dialogue state. For example, based on speech-derived text, the vehicle can generate text in response to that text.
  • the speech recognition module can first convert the speech signal into corresponding text information. Furthermore, the speech understanding module in the vehicle 002 can use the NLU algorithm to extract the user intention (intent) and slot information (slot) in the above text information.
  • the user intention in the above voice signal is: navigation
  • the slot information in the voice signal is: Big Wild Goose Pagoda.
  • the dialogue management module can request corresponding service content from the server of the relevant third-party application based on the extracted user intention and slot information. For example, the dialogue management module can request navigation services whose destination is the Big Wild Goose Pagoda from the Baidu Map APP server.
  • the server of Baidu Map APP can send the navigation route with the destination of Big Wild Goose Pagoda to vehicle 002.
  • the voice assistant APP in vehicle 002 can display the above navigation route in the dialogue interface in the form of cards, etc., so that the voice assistant APP can complete this task. response to speech signals.
  • the speech system 260 may also include a speech synthesis (Text To Speech, TTS) module, an intelligent storage service (IDS) module, a full-scenario brain module, etc.
  • TTS Text To Speech
  • IDDS intelligent storage service
  • the TTS module can process the received voice stream, extract the voice characteristics of the user corresponding to the voice stream, and store them in the voice library. For example, it can identify multi-timbral sounds and children's voices; the IDS module can store conversation status, etc.
  • the travel system 202 may include components that provide powered motion for the vehicle 002 .
  • travel system 202 may include an engine 218 , an energy source 219 , a transmission 220 and wheels 221 .
  • the engine 218 may be an internal combustion engine, an electric motor, an air compression engine, or other types of engine combinations, such as a hybrid engine composed of a gas oil engine and an electric motor, or a hybrid engine composed of an internal combustion engine and an air compression engine.
  • Engine 218 converts energy source 219 into mechanical energy.
  • Examples of energy sources 219 include gasoline, diesel, other petroleum-based fuels, propane, other compressed gas-based fuels, ethanol, solar panels, batteries, and other sources of electricity. Energy source 219 may also provide energy for other systems of vehicle 002 .
  • Transmission 220 may transmit mechanical power from engine 218 to wheels 221 .
  • Transmission 220 may include a gearbox, differential, and driveshaft.
  • the transmission device 220 may also include other components, such as a clutch.
  • the drive shaft may include one or more axles that may be coupled to one or more wheels 221 .
  • Sensor system 204 may include a number of sensors that sense information about the environment surrounding vehicle 002 .
  • the sensor system 204 may include a global positioning system 222 (the global positioning system may be a GPS system, a Beidou system or other positioning systems), an inertial measurement unit (IMU) 224, a radar 226, and a laser range finder. 228 and camera 230.
  • the sensor system 204 may also include sensors for the internal systems of the vehicle 002 being monitored (eg, an interior air quality monitor, a fuel gauge, an oil temperature gauge, etc.). Sensor data from one or more of these sensors can be used to detect objects and their corresponding properties (position, shape, orientation, speed, etc.). This detection and identification is a critical function for the safe operation of autonomous vehicle 002.
  • Global positioning system 222 may be used to estimate the geographic location of vehicle 002 .
  • the IMU 224 is used to sense the position and orientation changes of the vehicle 002 based on inertial acceleration.
  • IMU 224 may be a combination of an accelerometer and a gyroscope.
  • IMU 224 can be used to measure the curvature of vehicle 002.
  • Radar 226 may utilize radio signals to sense objects within the environment surrounding vehicle 002 . In some embodiments, in addition to sensing objects, radar 226 may be used to sense the speed and/or heading of the object.
  • Camera 230 may be used to capture multiple images of the surrounding environment of vehicle 002 .
  • the camera 230 may be a still camera or a video camera, a visible light camera or an infrared camera, or any camera used to acquire images, which is not limited in the embodiment of the present application.
  • Control system 206 controls the operation of vehicle 002 and its components.
  • Control system 206 may include various elements, including steering unit 232 , throttle 234 , braking unit 236 , sensor fusion algorithm unit 238 , computer vision system 240 , route control system 242 , and obstacle avoidance system 244 .
  • Throttle 234 is used to control the operating speed of engine 218 and thereby the speed of vehicle 002 .
  • Computer vision system 240 may operate to process and analyze images captured by camera 230 in order to identify objects and/or features in the environment surrounding vehicle 002 . Such objects and/or features may include traffic signals, road boundaries, and obstacles. Computer vision system 240 may use object recognition algorithms, Structure from Motion (SFM) algorithms, video tracking, and other computer vision techniques. In some embodiments, computer vision system 240 may be used to map an environment, track objects, estimate the speed of objects, and the like.
  • SFM Structure from Motion
  • Obstacle avoidance system 244 is used to identify, evaluate, and avoid or otherwise negotiate potential obstacles in the environment of vehicle 002 .
  • control system 206 may additionally or alternatively include components in addition to those shown and described. Alternatively, some of the components shown above may be reduced.
  • Peripheral devices 208 may include a wireless communication system 246 , an onboard computer 248 , a microphone 250 and/or a speaker 252 .
  • Wireless communication system 246 may wirelessly communicate with one or more devices directly or via a communication network.
  • the wireless communication system 246 may utilize WiFi to communicate with a wireless local area network (WLAN).
  • wireless communication system 246 may utilize infrared links, Bluetooth, or wireless personal area network (ZigBee) to communicate directly with the device.
  • Other wireless protocols such as various vehicle communication systems.
  • Computer system 212 may include at least one processor 213 that executes instructions 215 stored in a non-transitory computer-readable medium such as memory 214.
  • Computer system 212 may also be multiple computing devices that control individual components or subsystems of vehicle 002 in a distributed manner.
  • the processor may be located remotely from the vehicle and in wireless communication with the vehicle. In other aspects, some of the processes described herein are performed on a processor disposed within the vehicle and others are performed by a remote processor, including taking the steps necessary to perform a single maneuver.
  • memory 214 may store data such as road maps, route information, vehicle location, direction, speed and other such vehicle data, as well as other information. This information may be used by vehicle 002 and computer system 212 during operation of vehicle 002 in autonomous, semi-autonomous and/or manual modes. For example, the current speed and current curvature of the vehicle can be fine-tuned based on the road information of the target road segment and the received vehicle speed range and vehicle curvature range, so that the speed and curvature of the smart vehicle are within the vehicle speed range and vehicle curvature range.
  • one or more of these components described above may be installed separately or associated with vehicle 002 .
  • memory 214 may exist partially or completely separate from vehicle 002 .
  • the components described above may be communicatively coupled together in wired and/or wireless manners.
  • An autonomous vehicle traveling on the road can identify objects within its surrounding environment to determine adjustments to its current speed.
  • the objects may be other vehicles, traffic control equipment, or other types of objects.
  • each identified object can be considered independently and based on the object's respective characteristics, such as its current speed, acceleration, distance from the vehicle, etc., can be used to determine the speed to which the autonomous vehicle will adjust.
  • the smart vehicle function diagram in Figure 2 is only an exemplary implementation in the embodiment of the present application.
  • the smart vehicle in the embodiment of the present application includes but is not limited to the above structure.
  • Figure 3 is a flow chart of a human-computer dialogue method provided by an embodiment of the present application. This method can be applied to the human-computer dialogue system described in Figure 1 above, in which the server 100 can be used to support and To execute steps S304 to S305 of the method flow shown in FIG. 3 , the terminal device 200 may be used to support and execute steps S301 to S303 and step S306 of the method flow shown in FIG. 3 .
  • the method may include some or all of the following steps.
  • FIG. 3 schematically shows the flow of the human-computer dialogue method provided by the embodiment of the present application.
  • the human-machine dialogue method may include some or all of the following steps:
  • the vehicle receives voice signals through the voice assistant.
  • the voice command input by the user can be received through the microphone, triggering the vehicle to perform an operation corresponding to the voice command.
  • the voice signal in the embodiment of this application is a voice command. For example, after the user wakes up the vehicle's voice assistant in the vehicle, the user can say “turn on the air conditioner", then “turn on the air conditioner” is a voice command, and the voice assistant can turn on the air conditioner in the car after receiving the above voice command.
  • the vehicle sends the voice signal to the server.
  • the vehicle sends the dialogue status to the server, and the server can combine the decision results of previous rounds to generate voice commands for this round.
  • the cloud does not know the decision-making results of the previous round. If the decision-making results are not uploaded to the server, it may cause errors in the current round of decision-making by the server.
  • the previous round of dialogue was: User: Help me buy a ticket from Shanghai to Beijing; the current round of dialogue is: User: How is the weather there? If there is no dialogue information from the previous round on the server side, the server cannot determine the address indicated by "there" spoken by the user in this round of dialogue.
  • the vehicle is equipped with a voice assistant.
  • the device-cloud collaboration module in the voice assistant can send the language signal to the server and the ASR module in the voice assistant at the same time, or the device-cloud collaboration module After receiving the voice signal, the module sends the language signal to the server and calls the ASR interface to convert the voice signal into text.
  • the vehicle Based on the voice signal, the vehicle obtains the first instruction corresponding to the voice signal.
  • the vehicle first converts the language signal into text information; then performs semantic recognition on the text information to obtain the semantically recognized intention; and finally, based on the intention, obtains the first instruction.
  • the first instruction is used to instruct an operation corresponding to the voice signal.
  • the vehicle can use the speech recognition (ASR) algorithm to convert the voice signal into the corresponding text information to obtain "What is the weather like tomorrow?"; further, the vehicle can use the NLU algorithm to extract the text information from the voice signal. Extracting the user intention, that is, the vehicle can extract the user intention corresponding to the voice signal from "How will the weather be tomorrow?" and the vehicle can then generate the first instruction corresponding to the voice signal based on the user intention.
  • the content of the first instruction may be "call the weather application to obtain tomorrow's weather conditions and display tomorrow's weather conditions through the card".
  • the vehicle can also determine the first dialogue state of this round based on the voice signal.
  • the first dialogue state may include indication information indicating whether the current round of dialogue is multiple rounds or a single round, and may also include historical intentions, texts, and entities, as well as current round intentions, texts, and entities, etc.
  • the voice assistant includes a device-cloud collaboration module, an ASR module, an NLU module and a DM module.
  • the device-cloud collaboration module sends the voice signal to the ASR module; after the ASR module obtains the text information corresponding to the language signal, it is
  • the NLU module will identify the intention based on the text information; finally, the DM module will obtain the first instruction and the first dialogue status.
  • the DM module may send the first instruction and the first conversation status to the terminal-cloud collaboration module, and the terminal-cloud collaboration module decides whether to use the first instruction, that is, step S306 is executed.
  • the voice assistant does not have the ASR module, NLU module and DM module.
  • the voice assistant can call the functions of the ASR, NLU and DM plug-ins to obtain the first instruction based on the voice signal.
  • the server After receiving the voice signal, the server determines the second instruction corresponding to the voice signal.
  • the server first converts the language signal into text information; then performs semantic recognition on the text information to obtain the semantically recognized intention; and finally, obtains the second instruction based on the intention.
  • the second instruction is used to instruct the operation corresponding to the voice signal.
  • the server has an ASR module, an NLU module and a DM module. After receiving the voice signal, the server sends the voice signal to the ASR module; after the ASR module obtains the text information corresponding to the language signal, the NLU module The intention will be identified based on the text information; finally, the DM module will obtain the second instruction and the second dialogue state.
  • the server obtains the second instruction based on the voice information and sends the second instruction to the vehicle.
  • the instruction obtained first is the first instruction
  • the vehicle can determine whether the first instruction can be directly executed based on the first decision information.
  • the first decision information may include any one of the last round of decision results, the status identifier of the previous round of dialogue, the end-side intention result of the current round, the end-side instruction result of the current round, and the end-side confidence level.
  • the end-side current-round intention result is the current-round intention result determined by the vehicle;
  • the end-side current-round command result is the current-round command result determined by the vehicle;
  • the end-side confidence is the confidence of the first command;
  • the previous round of decision-making The result is used to indicate the source of the instructions executed for the previous round of dialogue, including servers and terminal devices;
  • the status identifier is used to indicate a single round or multiple rounds;
  • the intent result is used to indicate new intentions and multi-round intentions;
  • the command results include normal and Exception, exception is used to indicate that the operation corresponding to the voice signal cannot be performed.
  • the command result When the command result is abnormal, it may include that the vertical domain corresponding to the first command is an unclosable vertical domain, that is, the vehicle does not support execution of the first command.
  • vertical domain refers to functional areas, such as vehicle control, music playback and weather query.
  • the first instruction is a weather query; since the weather query is a vertical domain that cannot be closed, the instruction result of the first instruction is abnormal.
  • the vehicle can determine that it can directly execute the first command when each of the last round of decision-making results, the status identification of the previous round of dialogue, the end-side current round intention results, and the end-side current round instruction results meet the preset conditions.
  • the preset conditions can be that when the decision result of the previous round is vehicle, the status identification of the previous round of dialogue is multi-round, the end-side intention result of this round is multi-round result, and the end-side instruction result of this round is normal, it can be determined Execute the first command.
  • the vehicle determines that the status identifier of the previous round of dialogue is multi-round, and the result of this round of intention obtained by the vehicle is a new intention, such as the vehicle control intention, and the result of this round of intention obtained by the server is multi-round intention. Then, if the vehicle determines that the status of the previous round of dialogue is multi-round, When the decision result of the round is the server, the vehicle can determine the second instruction as the target instruction.
  • the vehicle after the vehicle executes the instruction obtained first, it can store the instruction obtained first and the dialogue state corresponding to the instruction; further, it can also associate the instruction obtained first with the instruction when uploading the next voice signal to the server.
  • the conversation status is uploaded to the server.
  • the instruction obtained first is the first instruction
  • the dialogue state corresponding to the instruction is the first dialogue state. If the vehicle receives the next voice signal after executing the first instruction, the vehicle can upload the next voice signal to the server. When the first command and the first conversation state are uploaded to the server.
  • the vehicle may have different decision-making logic when making decisions on the obtained instructions in the above steps S306 and S307.
  • the decision-making method for the vehicle to make decisions on the obtained instructions will be described below, taking Figures 4 and 5 as examples.
  • the vehicle when the vehicle obtains the first instruction, it can query whether the second instruction has been executed; when determining that the second instruction has been executed, the vehicle can discard the first instruction, that is, not execute the first instruction; after determining that the second instruction has been executed, the vehicle can When the instruction is not executed, the vehicle can determine whether the first instruction can be directly executed. When it is determined that the first instruction can be directly executed, it executes the first instruction; when it is determined that the first instruction cannot be executed, it waits for the executability of the second instruction; The first instruction is discarded when the second instruction can be directly executed, and the target instruction is selected from the first instruction and the second instruction for execution when the second instruction cannot be directly executed.
  • the vehicle optimizes the command recognition capability of the vehicle based on the second command.
  • the vehicle can optimize the vehicle's instruction recognition capability based on the second instruction when it is determined that the first instruction and the second instruction are different; it can also optimize the vehicle's instruction recognition capability based on the second instruction when the first instruction and the second instruction are different and meet the preset conditions.
  • the second instruction optimizes the vehicle's instruction recognition capability, and the preset situation may include the end-side confidence of the first instruction being lower than a preset threshold, the second instruction being a target instruction, and determining that the second instruction is more accurate than the first instruction. For example, after the vehicle first obtains the first instruction and executes the first instruction, and then determines that the second instruction is more accurate than the first instruction when receiving the second instruction, the vehicle can optimize the vehicle's instruction recognition capability based on the second instruction and the voice signal.
  • the first instruction is obtained by the vehicle based on the end-side model and the voice signal.
  • the vehicle uses the voice signal and the second instruction as training samples to perform training on the end-side model. Training to continuously make up for the differences in terminal and cloud capabilities.
  • the second instruction may include text, intent, and entity; the end-side model may include an ASR model and an NLU model.
  • the vehicle can obtain text and intent from the second instruction; furthermore, the vehicle can train an ASR model based on the speech signal and the text; and train an NLU model based on the intent and the text.
  • the device-side model can be a speech recognition module trained based on the initial speech signal and the initial popular entities.
  • the end-side model presets the original popular entities. Taking music as an example, the initial preset entities include Top1w singers + song names.
  • the second instruction includes text, entity and intention
  • the vehicle can store the correspondence between the text, the entity and the intention, so that the text information corresponding to the voice signal obtained by the vehicle matches the text, and the vehicle can directly obtain the text. intention and the entity.
  • step S306 The decision-making method of step S306 will be described in detail below, taking Figures 6 and 7 as examples.
  • the vehicle obtains the first instruction and the second instruction at different times.
  • the vehicle can first determine whether to execute the instruction; after determining that the instruction cannot be directly During execution, it waits for the other of the first instruction and the second instruction to be obtained, and then determines the target instruction corresponding to the voice signal from the two instructions.
  • Figure 6 exemplarily shows a flowchart of the vehicle's decision-making method when it obtains the first instruction.
  • the decision-making method can include some or all of the following steps:
  • the device-cloud collaboration module obtains the first command corresponding to the voice signal.
  • the device-cloud collaboration module is deployed in the voice assistant of the vehicle and can receive the first instruction sent from the DM module in the voice assistant, or can receive the first instruction obtained by the voice assistant calling the DM plug-in.
  • the terminal-cloud collaboration module queries whether the voice signal has a decision result.
  • the voice signal has a decision result to indicate that the operation corresponding to the voice signal has been executed; there is no decision result to indicate that the operation corresponding to the voice signal has not been executed. That is to say, after receiving the first instruction, the device-cloud collaboration module first queries whether the operation corresponding to the voice signal has been executed. The decision result is used to indicate whether the operation corresponding to the voice signal is to execute an instruction generated by the server (ie, the second instruction) or an instruction generated by the vehicle (ie, the first instruction).
  • the decision result can be a first instruction or a second instruction, or it can be a server or a vehicle; understandably, the decision result is a vehicle, that is, the target instruction corresponding to the voice signal is the first instruction; the decision result is the server, that is, the target command corresponding to the voice signal is the second command.
  • the device-cloud collaboration module can query the storage to see whether the voice signal has a decision result. For example, the end-cloud collaboration module writes the decision result of the voice signal into a public queue. After receiving the first instruction, the end-cloud collaboration module can read the status field in the public queue indicating the decision result to determine whether the voice signal has Decision results. For example, when the device-cloud collaboration module receives the first instruction sent from the DM module, it reads the above-mentioned status field in the public queue and determines whether the voice signal has a decision result.
  • the terminal-cloud collaboration module when it queries that the voice signal does not have a decision result, it can execute step S603, that is, to further determine whether the first instruction can be directly executed; when it queries that the voice signal has a decision result, it can execute step S604, That is, the first instruction is discarded. It can be understood that if the terminal-cloud collaboration module has first obtained the second instruction and determined that the second instruction is the target instruction, and has executed the target instruction, the terminal-cloud collaboration module can discard the first instruction when it obtains the first instruction to avoid repeated execution. The operation corresponding to this voice signal.
  • the public queue may also include an instruction identifier.
  • the device-cloud collaboration module when it obtains the first instruction, it can write an instruction identifier indicating that the first instruction has been obtained in the public queue; when it obtains the second instruction, it can write in the public queue an instruction identifier indicating that the second instruction has been obtained. Instruction ID.
  • the device-cloud collaboration module determines whether to directly execute the first instruction.
  • the device-cloud collaboration module determines whether to directly execute the first instruction, that is, determines whether the first instruction is a target instruction corresponding to the voice signal. It can be understood that if the device-cloud collaboration module determines that the first command corresponds to the target signal of the voice signal, the device-cloud collaboration module can directly execute the first command; if the device-cloud collaboration module is not sure whether the first command corresponds to the voice signal target signal, the device-cloud collaboration module can wait for the second instruction, and then based on the first instruction and the The two instructions comprehensively determine which of the two is the target signal corresponding to the speech signal.
  • the device-cloud collaboration module when it determines that the voice signal does not have a decision result, it may determine whether the first instruction can be directly executed based on the decision rule and the first decision information.
  • the first decision information may include the last round of decision results, the status identifier of the previous round of dialogue, the end-side current round intention (NLU) result, and the end-side current round instruction (DM) result; the first decision information may also include the end-side current round instruction (DM) result.
  • Side confidence the end-side confidence can be obtained by the vehicle NLU module based on the speech signal of this round, or it can be obtained by the ASR module based on the speech signal of this round, or it can be obtained by the confidence obtained by the vehicle's integrated NLU module and ASR module.
  • the result of the previous round of decision-making refers to whether the target instruction of the previous round of decision-making was a server-generated instruction or a vehicle-generated instruction;
  • the status identifier of the previous round of dialogue refers to whether the previous round was in a single-round dialogue or a multi-round dialogue;
  • the terminal-side Current intention results can include new intentions or multiple rounds of intentions.
  • the results of the current round of instructions on the terminal side can be normal or abnormal. Normal can include results and partial results, and abnormal can be no results.
  • the device-cloud collaboration module can determine whether to execute the first instruction based on the decision rules, according to the last round of decision results, the status identifier of the previous round of dialogue, the end-side intention results of the current round, and the end-side current round of instruction results.
  • the decision rules include that when the decision result of the previous round is vehicle, the status identifier of the previous round of dialogue is multi-round, the end-side intention result of this round is multi-round, and the end-side instruction result of this round is normal, it is determined that the first command can be executed , then the device-cloud collaboration module determines that the first instruction can be executed when it determines that the above-mentioned first decision information satisfies the above-mentioned decision rule.
  • the device-cloud collaboration module discards the first instruction.
  • the business execution module executes the first instruction.
  • the service execution module when the service execution module receives the first instruction sent from the device-cloud collaboration module, it executes the first instruction.
  • the content of the first instruction may be "call the weather application to obtain tomorrow's weather conditions and display tomorrow's weather conditions through a card.”
  • the device-cloud collaboration module 261 determines to execute the first instruction, it may call the weather application through the business execution module 262 The app gets tomorrow's weather conditions and displays them through cards.
  • the device-cloud collaboration module determines the target command corresponding to the voice signal from the first command and the second command.
  • the cloud-side confidence can be obtained by the NLU module in the server based on the speech signal of this round, or the server ASR module can be obtained based on the speech signal of this round, or it can be obtained by the confidence obtained by the server based on the NLU module and the cloud-side ASR module. of.
  • the device-cloud collaboration module determines the first instruction as the target instruction, and then executes step S607.
  • the device-cloud collaboration module when the device-cloud collaboration module does not receive the second instruction from the server within the preset time, the device-cloud collaboration module does not execute the first instruction, does not respond to the voice signal, or reminds the user that it is unavailable.
  • the business execution module can also store the decision results of this round. For example, when the target command is the first command, the service execution module writes in the public queue that the command used in the voice message of this round of dialogue is the first command from the vehicle.
  • the device-cloud collaboration module obtains the second command corresponding to the voice signal.
  • the terminal-cloud collaboration module queries whether the voice signal has a decision result.
  • the device-cloud collaboration module may first query whether the operation corresponding to the voice signal has been executed. Furthermore, when the terminal-cloud collaboration module queries that the voice signal does not have a decision result, it can execute step S703, that is, to further determine whether the second instruction can be directly executed; when it queries that the voice signal has a decision result, it can execute step S704, That is, the second instruction is discarded. It can be understood that if the terminal-cloud collaboration module has first obtained the first instruction and determined that the first instruction is the target instruction, and has executed the target instruction, the terminal-cloud collaboration module can discard the second instruction when it obtains the second instruction to avoid repeated execution. The operation corresponding to this voice signal.
  • the device-cloud collaboration module determines whether to directly execute the second instruction.
  • the device-cloud collaboration module determines whether to directly execute the second command, that is, determines whether the second command is a target command corresponding to the voice signal.
  • the device-cloud collaboration module can determine whether to execute the second instruction based on the decision rules, according to the last round of decision results, the status identifier of the previous round of dialogue, the end-side intention results of the current round, and the end-side instruction results of the current round. .
  • step S705 is executed; if the second instruction cannot be executed directly, step S706 is executed, that is, step S707 is executed after waiting for the first instruction.
  • the device-cloud collaboration module discards the second command.
  • the business execution module executes the second instruction.
  • the service execution module when the service execution module receives the second instruction sent from the device-cloud collaboration module, it executes the second instruction.
  • the content of the second instruction may be "call the weather application to obtain tomorrow's weather conditions and display tomorrow's weather conditions through a card.”
  • the device-cloud collaboration module 261 determines to execute the second instruction, it may call the weather application through the business execution module 262 The app gets tomorrow's weather conditions and displays them through cards.
  • the service execution module can also update the decision result of the voice signal to have a decision result. Then, when the device-cloud collaboration module receives the first command from the server, it queries the decision result of the voice signal and sends that the voice signal has been executed. The device-cloud collaboration module can discard the first command to avoid unnecessary operations. For example, after executing the second instruction, the service execution module can write the existing decision result of the voice signal in the public queue.
  • the device-cloud collaboration module determines the target command corresponding to the voice signal from the first command and the second command.
  • the third decision information may include the last round of decision results, the status identifier of the previous round of dialogue, the end-side intention result of this round and the end-side instruction result of this round, the cloud side intention result of this round and the end-side instruction result of this round.
  • the third decision information may also include the above-mentioned client side confidence level and cloud side confidence level.
  • the device-cloud collaboration module may determine the second command as the target command and then perform step S707.
  • the business execution module executes the target instruction corresponding to the voice signal.
  • the target instruction is executed by the server, that is to say, the above step S306 is for the vehicle to send the target instruction to the server, so that the server executes the target instruction.
  • the server can control the smart home, and the vehicle can control the equipment in the car.
  • the server vertical statement and the vehicle vertical statement there is a serious conflict between the server vertical statement and the vehicle vertical statement, and it is easy to have simultaneous execution problems. For example, when the user When you say "turn on the air conditioning,” you may be turning on the air conditioning in your car and at home at the same time.
  • the embodiment of the present application selects the first command and the second command through the vehicle to ensure that only one command is executed.
  • the vehicle receives voice signals through the voice assistant.
  • S802 The vehicle simultaneously sends the voice signal to the terminal ASR and the server through the voice assistant.
  • the vehicle when the vehicle receives the voice signal, it sends the voice signal to the server; at the same time, the vehicle can execute S803 to process the voice signal and determine the first instruction corresponding to the voice signal.
  • the device-cloud collaboration module in the voice assistant sends the voice signal to the ASR module. After the ASR module converts the voice signal into text information, the NLU module obtains the first instruction based on the text information.
  • the vehicle may send the voice signal to the end-side ASR through the voice assistant, and the voice assistant calls the ASR interface to process the voice signal.
  • step S802 For the specific content of step S802, please refer to the relevant description in step S302 above, and will not be described again here.
  • the vehicle determines the first command corresponding to the voice signal through the end-side ASR/NLU/DM.
  • the device-cloud collaboration module in the voice assistant sends the voice signal to the ASR module, NLU module and DM module in sequence to obtain the first instruction.
  • the vehicle sends the first command to the voice assistant.
  • the DM module in the vehicle after obtaining the first instruction, sends the first instruction to the device-cloud collaboration module in the voice assistant.
  • the server After receiving the voice signal, the server determines the second instruction corresponding to the voice signal.
  • step S805 For the specific content of step S805, please refer to the relevant description in step S304 above, and will not be described again here.
  • the server sends the second command to the vehicle.
  • the DM module in the server stops execution before the second instruction is executed, and sends the second instruction to the vehicle.
  • step S806 For the specific content of step S806, please refer to the relevant description in step S305 above, and will not be described again here.
  • S807 The vehicle determines the target command corresponding to the voice signal from the first command and the second command.
  • the vehicle may determine the target instruction after obtaining the first instruction and the second instruction.
  • the execution notification is used to instruct the server to execute the second instruction
  • the execution instruction may include the decision result (i.e., the first 2 instructions); it can also include contextual information, that is, the content of this round of dialogue and the previous N rounds of dialogue.
  • the DM module in the server can load the context and call Hilink to implement home control.
  • the electronic device 100 may include: a processor 110, an external memory interface 120, an internal memory 121, a universal serial bus (USB) interface 130, a charging management module 140, a power management module 141, a battery 142, an antenna 1, an antenna 2.
  • Mobile communication module 150 wireless communication module 160, audio module 170, speaker 170A, receiver 170B, microphone 170C, headphone interface 170D, sensor module 180, button 190, motor 191, indicator 192, camera 193, display screen 194 and Subscriber identification module (SIM) card interface 195, etc.
  • SIM Subscriber identification module
  • the UART interface is a universal serial data bus used for asynchronous communication.
  • the bus can be a bidirectional communication bus. It converts the data to be transmitted between serial communication and parallel communication.
  • a UART interface is generally used to connect the processor 110 and the wireless communication module 160 .
  • the processor 110 communicates with the Bluetooth module in the wireless communication module 160 through the UART interface to implement the Bluetooth function.
  • the audio module 170 can transmit audio signals to the wireless communication module 160 through the UART interface to implement the function of playing music through a Bluetooth headset.
  • the MIPI interface can be used to connect the processor 110 with peripheral devices such as the display screen 194 and the camera 193 .
  • MIPI interfaces include camera serial interface (CSI), display serial interface (DSI), etc.
  • the processor 110 and the camera 193 communicate through the CSI interface to implement the shooting function of the electronic device 100 .
  • the processor 110 and the display screen 194 communicate through the DSI interface to implement the display function of the electronic device 100 .
  • the GPIO interface can be configured through software.
  • the GPIO interface can be configured as a control signal or as a data signal.
  • the GPIO interface can be used to connect the processor 110 with the camera 193, display screen 194, wireless communication module 160, audio module 170, sensor module 180, etc.
  • the GPIO interface can also be configured as an I2C interface, I2S interface, UART interface, MIPI interface, etc.
  • the charging management module 140 is used to receive charging input from the charger.
  • the charger can be a wireless charger or So a wired charger.
  • the wireless communication function of the electronic device 100 can be implemented through the antenna 1, the antenna 2, the mobile communication module 150, the wireless communication module 160, the modem processor and the baseband processor.
  • the mobile communication module 150 can provide solutions for wireless communication including 2G/3G/4G/5G applied on the electronic device 100 .
  • the mobile communication module 150 may include at least one filter, switch, power amplifier, low noise amplifier (LNA), etc.
  • the mobile communication module 150 can receive electromagnetic waves through the antenna 1, perform filtering, amplification and other processing on the received electromagnetic waves, and transmit them to the modem processor for demodulation.
  • the mobile communication module 150 can also amplify the signal modulated by the modem processor and convert it into electromagnetic waves through the antenna 1 for radiation.
  • at least part of the functional modules of the mobile communication module 150 may be disposed in the processor 110 .
  • at least part of the functional modules of the mobile communication module 150 and at least part of the modules of the processor 110 may be provided in the same device.
  • a modem processor may include a modulator and a demodulator.
  • the modulator is used to modulate the low-frequency baseband signal to be sent into a medium-high frequency signal.
  • the demodulator is used to demodulate the received electromagnetic wave signal into a low-frequency baseband signal.
  • the demodulator then transmits the demodulated low-frequency baseband signal to the baseband processor for processing.
  • the application processor outputs sound signals through audio devices (not limited to speaker 170A, receiver 170B, etc.), or displays images or videos through display screen 194.
  • the modem processor may be a stand-alone device.
  • the modem processor may be independent of the processor 110 and may be provided in the same device as the mobile communication module 150 or other functional modules.
  • the wireless communication module 160 can provide applications on the electronic device 100 including wireless local area networks (WLAN) (such as wireless fidelity (Wi-Fi) network), Bluetooth (bluetooth, BT), and global navigation satellites.
  • WLAN wireless local area networks
  • System global navigation satellite system, GNSS
  • frequency modulation frequency modulation, FM
  • near field communication technology near field communication, NFC
  • infrared technology infrared, IR
  • the wireless communication module 160 may be one or more devices integrating at least one communication processing module.
  • the wireless communication module 160 receives electromagnetic waves via the antenna 2 , frequency modulates and filters the electromagnetic wave signals, and sends the processed signals to the processor 110 .
  • the wireless communication module 160 can also receive the signal to be sent from the processor 110, frequency modulate it, amplify it, and convert it into electromagnetic waves through the antenna 2 for radiation.
  • the antenna 1 of the electronic device 100 is coupled to the mobile communication module 150, and the antenna 2 is coupled to the wireless communication module 160, so that the electronic device 100 can communicate with the network and other devices through wireless communication technology.
  • the wireless communication technology may include global system for mobile communications (GSM), general packet radio service (GPRS), code division multiple access (CDMA), broadband Code division multiple access (wideband code division multiple access, WCDMA), time-division code division multiple access (TD-SCDMA), long term evolution (long term evolution, LTE), BT, GNSS, WLAN, NFC , FM, and/or IR technology, etc.
  • GSM global system for mobile communications
  • GPRS general packet radio service
  • CDMA code division multiple access
  • WCDMA broadband Code division multiple access
  • TD-SCDMA time-division code division multiple access
  • LTE long term evolution
  • BT GNSS
  • WLAN wireless local area network
  • NFC long term evolution
  • FM FM
  • SBAS satellite based augmentation systems
  • the electronic device 100 implements display functions through a GPU, a display screen 194, an application processor, and the like.
  • the GPU is an image processing microprocessor and is connected to the display screen 194 and the application processor. GPUs are used to perform mathematical and geometric calculations for graphics rendering.
  • Processor 110 may include one or more GPUs that execute program instructions to generate or alter display information.
  • the display screen 194 is used to display images, videos, etc.
  • Display 194 includes a display panel.
  • the display panel can use a liquid crystal display (LCD), an organic light-emitting diode (OLED), an active matrix organic light emitting diode or an active matrix organic light emitting diode (active-matrix organic light emitting diode).
  • LCD liquid crystal display
  • OLED organic light-emitting diode
  • AMOLED organic light-emitting diode
  • FLED flexible light-emitting diode
  • Miniled MicroLed, Micro-oLed, quantum dot light emitting diode (QLED), etc.
  • the electronic device 100 may include 1 or N display screens 194, where N is a positive integer greater than 1.
  • the electronic device 100 can implement the shooting function through an ISP, a camera 193, a video codec, a GPU, a display screen 194, an application processor, and the like.
  • the ISP is used to process the data fed back by the camera 193. For example, when taking a photo, the shutter is opened, the light is transmitted to the camera sensor through the lens, the optical signal is converted into an electrical signal, and the camera sensor passes the electrical signal to the ISP for processing, and converts it into an image visible to the naked eye. ISP can also perform algorithm optimization on image noise, brightness, etc. ISP can also optimize the exposure, color temperature and other parameters of the shooting scene. In some embodiments, the ISP may be provided in the camera 193.
  • Camera 193 is used to capture still images or video.
  • the object passes through the lens to produce an optical image that is projected onto the photosensitive element.
  • the photosensitive element can be a charge coupled device (CCD) or a complementary metal-oxide-semiconductor (CMOS) phototransistor.
  • CMOS complementary metal-oxide-semiconductor
  • the photosensitive element converts the optical signal into an electrical signal, and then passes the electrical signal to the ISP to convert it into a digital image signal.
  • ISP outputs digital image signals to DSP for processing.
  • DSP converts digital image signals into standard RGB, YUV and other format image signals.
  • the electronic device 100 may include 1 or N cameras 193, where N is a positive integer greater than 1.
  • Digital signal processors are used to process digital signals. In addition to digital image signals, they can also process other digital signals. For example, when the electronic device 100 selects a frequency point, the digital signal processor is used to perform Fourier transform on the frequency point energy.
  • Video codecs are used to compress or decompress digital video.
  • Electronic device 100 may support one or more video codecs. In this way, the electronic device 100 can play or record videos in multiple encoding formats, such as moving picture experts group (MPEG) 1, MPEG2, MPEG3, MPEG4, etc.
  • MPEG moving picture experts group
  • MPEG2 MPEG2, MPEG3, MPEG4, etc.
  • NPU is a neural network (NN) computing processor.
  • NN neural network
  • Intelligent cognitive applications of the electronic device 100 can be implemented through the NPU, such as image recognition, face recognition, speech recognition, text understanding, etc.
  • the external memory interface 120 can be used to connect an external memory card, such as a Micro SD card, to expand the storage capacity of the electronic device 100.
  • the external memory card communicates with the processor 110 through the external memory interface 120 to implement the data storage function. Such as saving music, videos, etc. files in external memory card.
  • Internal memory 121 may be used to store computer executable program code, which includes instructions.
  • the processor 110 executes instructions stored in the internal memory 121 to execute various functional applications and data processing of the electronic device 100 .
  • the internal memory 121 may include a program storage area and a data storage area.
  • the stored program area can store the operating system and at least one application required for the function (such as face recognition function, fingerprint recognition function, mobile payment function etc.
  • the storage data area can store data created during the use of the electronic device 100 (such as face information template data, fingerprint information templates, etc.).
  • the internal memory 121 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, universal flash storage (UFS), etc.
  • the electronic device 100 can implement audio functions through the audio module 170, the speaker 170A, the receiver 170B, the microphone 170C, the headphone interface 170D, and the application processor. Such as music playback, recording, etc.
  • the audio module 170 is used to convert digital audio information into analog audio signal output, and is also used to convert analog audio input into digital audio signals. Audio module 170 may also be used to encode and decode audio signals. In some embodiments, the audio module 170 may be provided in the processor 110 , or some functional modules of the audio module 170 may be provided in the processor 110 .
  • Speaker 170A also called “speaker” is used to convert audio electrical signals into sound signals.
  • the electronic device 100 can listen to music through the speaker 170A, or listen to hands-free calls.
  • Receiver 170B also called “earpiece” is used to convert audio electrical signals into sound signals.
  • the electronic device 100 answers a call or a voice message, the voice can be heard by bringing the receiver 170B close to the human ear.
  • Microphone 170C also called “microphone” or “microphone” is used to convert sound signals into electrical signals. When making a call or sending a voice message, the user can speak close to the microphone 170C with the human mouth and input the sound signal to the microphone 170C.
  • the electronic device 100 may be provided with at least one microphone 170C. In other embodiments, the electronic device 100 may be provided with two microphones 170C, which in addition to collecting sound signals, may also implement a noise reduction function. In other embodiments, the electronic device 100 can also be provided with three, four or more microphones 170C to collect sound signals, reduce noise, identify sound sources, and implement directional recording functions, etc.
  • the headphone interface 170D is used to connect wired headphones.
  • the headphone interface 170D may be a USB interface 130, or may be a 3.5mm open mobile terminal platform (OMTP) standard interface, or a Cellular Telecommunications Industry Association of the USA (CTIA) standard interface.
  • OMTP open mobile terminal platform
  • CTIA Cellular Telecommunications Industry Association of the USA
  • the pressure sensor 180A is used to sense pressure signals and can convert the pressure signals into electrical signals.
  • pressure sensor 180A may be disposed on display screen 194 .
  • pressure sensors 180A there are many types of pressure sensors 180A, such as resistive pressure sensors, inductive pressure sensors, capacitive pressure sensors, etc.
  • a capacitive pressure sensor may include at least two parallel plates of conductive material.
  • the electronic device 100 determines the intensity of the pressure based on the change in capacitance.
  • the electronic device 100 detects the intensity of the touch operation according to the pressure sensor 180A.
  • the electronic device 100 may also calculate the touched position based on the detection signal of the pressure sensor 180A.
  • touch operations acting on the same touch location but with different touch operation intensities may correspond to different operation instructions. For example: when a touch operation with a touch operation intensity less than the first pressure threshold is applied to the short message application icon, an instruction to view the short message is executed. When a touch operation with a touch operation intensity greater than or equal to the first pressure threshold is applied to the short message application icon, an instruction to create a new short message is executed.
  • the gyro sensor 180B may be used to determine the motion posture of the electronic device 100 .
  • the angular velocity of electronic device 100 about three axes ie, x, y, and z axes
  • the gyro sensor 180B can be used for image stabilization. For example, when the shutter is pressed, the gyro sensor 180B detects the angle at which the electronic device 100 shakes, calculates the distance that the lens module needs to compensate based on the angle, and allows the lens to offset the shake of the electronic device 100 through reverse movement to achieve anti-shake.
  • the gyro sensor 180B can also be used for navigation and somatosensory game scenes.
  • Air pressure sensor 180C is used to measure air pressure. In some embodiments, the electronic device 100 calculates the altitude through the air pressure value measured by the air pressure sensor 180C to assist positioning and navigation.
  • Magnetic sensor 180D includes a Hall sensor.
  • the electronic device 100 may utilize the magnetic sensor 180D to detect opening and closing of the flip holster.
  • the electronic device 100 may detect the opening and closing of the flip according to the magnetic sensor 180D. Then, based on the detected opening and closing status of the leather case or the opening and closing status of the flip cover, features such as automatic unlocking of the flip cover are set.
  • Proximity light sensor 180G may include, for example, a light emitting diode (LED) and a light detector, such as a photodiode.
  • the light emitting diode may be an infrared light emitting diode.
  • the electronic device 100 emits infrared light outwardly through the light emitting diode.
  • Electronic device 100 uses photodiodes to detect infrared reflected light from nearby objects. When sufficient reflected light is detected, it can be determined that there is an object near the electronic device 100 . When insufficient reflected light is detected, the electronic device 100 may determine that there is no object near the electronic device 100 .
  • Fingerprint sensor 180H is used to collect fingerprints.
  • the electronic device 100 can use the collected fingerprint characteristics to achieve fingerprint unlocking, access to application locks, fingerprint photography, fingerprint answering of incoming calls, etc.
  • Touch sensor 180K also called “touch panel”.
  • the touch sensor 180K can be disposed on the display screen 194.
  • the touch sensor 180K and the display screen 194 form a touch screen, which is also called a "touch screen”.
  • the touch sensor 180K is used to detect a touch operation on or near the touch sensor 180K.
  • the touch sensor can pass the detected touch operation to the application processor to determine the touch event type.
  • Visual output related to the touch operation may be provided through display screen 194 .
  • the touch sensor 180K may also be disposed on the surface of the electronic device 100 at a location different from that of the display screen 194 .
  • the buttons 190 include a power button, a volume button, etc.
  • Key 190 may be a mechanical key. It can also be a touch button.
  • the electronic device 100 may receive key inputs and generate key signal inputs related to user settings and function control of the electronic device 100 .
  • the motor 191 can generate vibration prompts.
  • the motor 191 can be used for vibration prompts for incoming calls and can also be used for touch vibration feedback.
  • touch operations for different applications can correspond to different vibration feedback effects.
  • the motor 191 can also respond to different vibration feedback effects for touch operations in different areas of the display screen 194 .
  • Different application scenarios such as time reminders, receiving information, alarm clocks, games, etc.
  • the touch vibration feedback effect can also be customized.
  • the indicator 192 may be an indicator light, which may be used to indicate charging status, power changes, or may be used to indicate messages, missed calls, notifications, etc.
  • the SIM card interface 195 is used to connect a SIM card.
  • the SIM card can be connected to or separated from the electronic device 100 by inserting it into the SIM card interface 195 or pulling it out from the SIM card interface 195 .
  • the electronic device 100 can support 1 or N SIM card interfaces, where N is a positive integer greater than 1.
  • SIM card interface 195 can support Nano SIM card, Micro SIM card, SIM card, etc. Multiple cards can be inserted into the same SIM card interface 195 at the same time. The types of the plurality of cards may be the same or different.
  • the SIM card interface 195 is also compatible with different types of SIM cards.
  • the SIM card interface 195 is also compatible with external memory cards.
  • the electronic device 100 interacts with the network through the SIM card to implement functions such as calls and data communications.
  • FIG. 10 is a software structure block diagram of the electronic device 100 provided by the embodiment of the present application.
  • the layered architecture divides the software into several layers, and each layer has clear roles and division of labor.
  • the layers communicate through software interfaces.
  • the Android system is divided into four layers, from top to bottom: application layer, application framework layer, Android runtime and system libraries, and kernel layer.
  • the application layer can include a series of application packages.
  • the application package can include applications such as camera, gallery, calendar, call, map, navigation, WLAN, Bluetooth, music, video, short message, and voice assistant.
  • applications such as camera, gallery, calendar, call, map, navigation, WLAN, Bluetooth, music, video, short message, and voice assistant.
  • the user can communicate with other devices (such as the server above) through the voice assistant, such as sending voice signals to other devices or obtaining speech recognition results sent by other devices (such as the second instruction above).
  • other devices such as the server above
  • the voice assistant such as sending voice signals to other devices or obtaining speech recognition results sent by other devices (such as the second instruction above).
  • the display manager is used for system display management and is responsible for the management of all display-related transactions, including creation, destruction, orientation switching, size and status changes, etc.
  • display-related transactions including creation, destruction, orientation switching, size and status changes, etc.
  • main display module there will only be one default display module on a single device, which is the main display module.
  • the sensor manager is responsible for managing the status of sensors, managing applications to monitor sensor events, and reporting events to applications in real time.
  • the cross-device connection manager is used to establish a communication connection with the terminal device 200 and send voice signals to the terminal device 200 based on the communication connection.
  • the event manager is used for the event management service of the system. It is responsible for receiving the events uploaded by the underlying layer and distributing them to each window, and completing the reception and distribution of events.
  • the task manager is used for the management of task (Activity) components, including startup management, life cycle management, task direction management, etc.
  • a window manager is used to manage window programs.
  • the window manager can obtain the display size, determine whether there is a status bar, lock the screen, capture the screen, etc.
  • the window manager is also responsible for window display management, including window display mode, display size, display coordinate position, display level and other related management.
  • the resource manager provides various resources to applications, such as localized strings, icons, pictures, layout files, video files, etc.
  • the system library can include multiple functional modules. For example: surface manager (surface manager), media libraries (Media Libraries), 3D graphics processing libraries (for example: OpenGL ES), 2D graphics engines (for example: SGL) and event data, etc.
  • surface manager surface manager
  • media libraries Media Libraries
  • 3D graphics processing libraries for example: OpenGL ES
  • 2D graphics engines for example: SGL
  • event data etc.
  • the surface manager is used to manage the display subsystem and provides the fusion of 2D and 3D layers for multiple applications.
  • the media library supports playback and recording of a variety of commonly used audio and video formats, as well as static image files, etc.
  • the media library can support a variety of audio and video encoding formats, such as: MPEG4, H.264, MP3, AAC, AMR, JPG, PNG, etc.
  • 2D Graphics Engine is a drawing engine for 2D drawing.
  • the kernel layer is the layer between hardware and software.
  • the kernel layer contains at least display driver, camera driver, audio driver, and sensor driver.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Telephonic Communication Services (AREA)

Abstract

一种人机对话方法,包括:终端设备(200)接收语音信号;终端设备(200)基于语音信号,得到第一指令;终端设备(200)将语音信号或语音信号的文本发送至服务器(100),服务器(100)用于基于语音信号得到第二指令;语音信号的文本是终端设备(200)基于语音信号得到的;终端设备(200)接收服务器(100)发送的第二指令;终端设备(200)在先得到的指令不可直接执行时,从第一指令和第二指令选择目标指令执行,目标指令为第一指令或第二指令。终端设备(200)可以从两个指令中选择目标指令,可以提高语音响应的准确度。

Description

一种人机对话方法、设备及系统
本申请要求于2022年06月13日提交中国专利局、申请号为202210663567.1、申请名称为“一种人机对话方法、设备及系统”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请实施例涉及技术,尤其涉及一种人机对话方法、设备及系统。
背景技术
语音助手是当前大多数终端设备(如车辆、智能手机、智能音箱等)都会配备的应用程序,它能够通过智能对话与即时问答的方式,为用户解决问题,或者呈现一些终端设备的功能。
语音助手的应用及其多元化,例如,结合人工智能(Artificial Intelligence,AI)、自动语音识别(Automatic Speech Recognition,ASR)、自然语言理解(NaturalLanguage Understanding,NLU),透过无线连接,不仅能执行多媒体资源播放,还能串接到物联网装置去实现智能测试家居控制、智能测试家居以及车载语音控制和电话会议系统等。
目前,语音助手的响应往往会出现准确度不高的情况。
发明内容
本申请提供了一种人机对话方法、设备及系统,该方法中,终端设备将语音信号或该语音信号的文本发送至服务器,由服务器和终端设备同时基于该语音信号或该文本分别生成第一指令和第二指令;在终端设备先得到的指令不可执行时,从第一指令和第二指令中选择目标指令执行。该方法中终端设备可以从两个指令中选择目标指令,可以提高语音响应的准确度。
第一方面,本申请实施例提供了一种人机对话方法,应用于终端设备,该方法包括:
终端设备接收语音信号;
终端设备基于语音信号,得到第一指令;
终端设备将语音信号或语音信号的文本发送至服务器,服务器用于基于语音信号得到第二指令;语音信号的文本是终端设备基于语音信号得到的;
终端设备接收服务器发送的第二指令;
终端设备在先得到的指令不可直接执行时,从第一指令和第二指令选择目标指令执行,目标指令为第一指令或第二指令。
本申请实施例中,终端设备将语音信号或其文本发送至服务器,由服务器和终端设备同时基于该语音信号或文本分别生成第一指令和第二指令;在先得到的指令不可执行时,从第一指令和第二指令中选择目标指令执行。该方法中终端设备和服务器同时处理该语音信号,进而终端设备可以从两个指令中选择目标指令,可以提高语音响应的准确度。
其中,指令不可直接执行的含义是指,需要考虑另一个指令,而不能在不考虑另一个指令前执行该指令。例如,该指令为不可直接执行时,终端设备需要等待另一个指令到达后,再对比两个指令,从中选择要执行的目标指令;而不可以在未得到另一个指令或未确定另一 个指令的情况前执行。例如,终端设备在第一指令置信度低于阈值时确定第一指令不可直接执行,故等待第二指令;若第二指令置信度高于第一指令则执行第二指令,若第二指令低于第一指令则执行第一指令。又例如,终端设备在确定第一指令为不可直接执行后,等待第二指令;若超过预设时间仍未得到第二指令,则可判定第二指令的情况为异常,进而,执行第一指令。
指令可直接执行的含义是指,可以不考虑另一个指令直接执行该指令。例如,终端设备先得到第一指令,确定第一指令可直接执行,则终端设备可以不等待第二指令,立即执行第一指令。需要说明的是,上述该指令为第一指令时另一个指令为第二指令,上述该指令为第二指令时另一个指令为第一指令。
在一种可能的实现方式中,终端设备在先得到的指令不可直接执行时,从第一指令和第二指令选择目标指令执行,包括:
在目标指令为先得到的指令且先得到的指令不可执行时,不执行先得到的指令;
在目标指令为先得到的指令且先得到的指令可执行时,执行先得到的指令。
结合第一方面,在一种可能的实现方式中,该方法还包括:
终端设备在先得到的指令可直接执行时,执行先得到的指令;
终端设备不执行后得到的指令。
本申请实施例中,终端设备将语音信号或其文本发送至服务器,由服务器和终端设备同时基于该语音信号或文本分别生成第一指令和第二指令;在先得到的指令可执行时,先执行该指令。该方法中还可以提高语音响应的效率。
结合第一方面,在一种可能的实现方式中,从第一指令和第二指令选择目标指令执行之前,该方法还包括:
终端设备在得到第一指令时,若未接收到第二指令,确定第一指令的可执行情况;可执行情况包括可直接执行和不可直接执行;
终端设备在确定第一指令不可直接执行时,等待第二指令。
该方法通过确定第一指令的可执行情况,可以在第一指令不准确或不可执行时等待第二执行,保证响应的准确度。
结合第一方面,在一种可能的实现方式中,确定第一指令的可执行情况,包括:
终端设备基于第一决策信息,确定第一指令的可执行情况;第一决策信息包括上一轮决策结果、上一轮对话的状态标识、终端设备确定的本轮的意图结果、终端设备确定的本轮的指令结果和第一指令的置信度中的至少一种
上一轮决策结果用于指示针对上一轮对话执行的指令的来源,来源包括服务器和终端设备;状态标识用于指示单轮或多轮;意图结果用于指示新意图或多轮意图;指令结果包括正常或异常,异常用于指示无法执行语音信号对应的操作。
结合第一方面,在一种可能的实现方式中,从第一指令和第二指令选择目标指令执行之前,方法还包括:
终端设备在得到第二指令时,若未接收到第一指令,确定第二指令的可执行情况;
终端设备在确定第二指令不可直接执行时,等待第一指令。
该方法通过确定第二指令的可执行情况,可以在第二指令不准确或不可执行时等待第二执行,保证响应的准确度。
结合第一方面,在一种可能的实现方式中,确定第二指令的可执行情况,包括:
终端设备基于第二决策信息,确定第二指令的可执行情况;第二决策信息包括上一轮决策结果、上一轮对话的状态标识、服务器确定的本轮的意图结果和服务器确定的本轮的指令结果中的至少一种;
上一轮决策结果用于指示针对上一轮对话执行的指令的来源,来源包括服务器和终端设备;状态标识用于指示单轮或多轮;意图结果用于指示新意图或多轮意图;指令结果包括正常或异常,异常用于指示无法执行语音信号对应的操作。
结合第一方面,在一种可能的实现方式中,从第一指令和第二指令选择目标指令执行,包括:
终端设备基于第三决策信息,从第一指令和第二指令中确定目标指令;
第三决策信息包括上一轮决策结果、上一轮对话的状态标识、服务器确定的本轮的意图结果、服务器确定的本轮的指令结果、终端设备确定的本轮的意图结果、终端设备确定的本轮的指令结果和第一指令的置信度中的至少一种;
上一轮决策结果用于指示针对上一轮对话执行的指令的来源,来源包括服务器和终端设备;状态标识用于指示单轮或多轮;意图结果用于指示新意图或多轮意图;指令结果包括正常或异常,异常用于指示无法执行语音信号对应的操作。
结合第一方面,在一种可能的实现方式中,方法还包括:
终端设备将对话状态发送至服务器,对话状态用于服务器生成第二指令;对话状态包括上一轮决策结果和上一轮对话的状态标识中的至少一种。
该方法中,服务器可以基于上一轮的对话信息(决策结果和状态标识等)生成第二指令,提高第二指令的准确度。
结合第一方面,在一种可能的实现方式中,终端设备基于语音信号,得到第一指令,包括:
终端设备基于语音信号,得到意图;
终端设备基于意图确定第一指令。
结合第一方面,在一种可能的实现方式中,终端设备基于语音信号,得到第一指令,包括:
将语音信号输入到端侧模型,得到第一指令对应的意图;
终端设备基于第一指令对应的意图,得到第一指令。
结合第一方面,在一种可能的实现方式中,该方法还包括:
终端设备以语音信号和第二指令对应的意图为训练样本对端侧模型进行训练。
本申请实施例中,终端设备基于端侧模型识别语音信号的意图;终端设备可以基于服务器的结果对端侧模型进行训练,提高终端设备生成语音指令的能力。
结合第一方面,在一种可能的实现方式中,从第一指令和第二指令选择目标指令执行,包括:
终端设备执行目标指令;或者,终端设备将目标指令发送至服务器,服务器用于执行目标指令。
第二方面,本申请实施例提供了一种人机对话方法,应用于服务器,该方法包括:
服务器接收来自终端设备的语音信号或语音信号的文本,语音信号的文本是终端设备基于语音信号得到的;
服务器基于语音信号,得到第二指令;
服务器向终端设备发送第二指令;
服务器接收来自终端设备的目标指令;目标指令是终端设备基于第一指令和第二指令得到的,目标指令为第一指令或第二指令;第一指令是终端设备基于语音信号得到的;
服务器执行目标指令。
第三方面,本申请提供一种电子设备。该电子设备可包括存储器和处理器。其中,存储器可用于存储计算机程序。处理器可用于调用计算机程序,使得电子设备执行如第一方面或第一方面中任一可能的实现方式。
第四方面,本申请提供一种电子设备。该电子设备可包括存储器和处理器。其中,存储器可用于存储计算机程序。处理器可用于调用计算机程序,使得电子设备执行如第二方面或第二方面中任一可能的实现方式。
第五方面,本申请提供一种包含指令的计算机程序产品,其特征在于,当上述计算机程序产品在电子设备上运行时,使得该电子设备执行如第一方面或第一方面中任一可能的实现方式。
第六方面,本申请提供一种包含指令的计算机程序产品,其特征在于,当上述计算机程序产品在电子设备上运行时,使得该电子设备执行如第二方面或第二方面中任一可能的实现方式。
第七方面,本申请提供一种计算机可读存储介质,包括指令,其特征在于,当上述指令在电子设备上运行时,使得该电子设备执行如第一方面或第一方面中任一可能的实现方式。
第八方面,本申请提供一种计算机可读存储介质,包括指令,其特征在于,当上述指令在电子设备上运行时,使得该电子设备执行如第二方面或第二方面中任一可能的实现方式。
第九方面,本申请实施例提供了一种人机对话系统,该人机对话系统包括终端设备和服务器,该终端设备为第三方面描述的电子设备,该服务器为第四方面描述的电子设备。
可以理解地,上述第三方面和第四方面提供的电子设备、第五方面和第六方面提供的计算机程序产品、第七方面和第八方面提供的计算机可读存储介质均用于执行本申请实施例所提供的方法。因此,其所能达到的有益效果可参考对应方法中的有益效果,此处不再赘述。
附图说明
图1是本申请实施例提供的一种人机对话系统的系统架构图;
图2是本申请实施例提供的一种车辆002的功能框图;
图3是本申请实施例提供的一种人机对话方法的流程图;
图4是本申请实施例提供的一种决策方法的流程示意图;
图5是本申请实施例提供的一种决策方法的流程示意图;
图6是本申请实施例提供的一种车辆在得到第一指令时的决策方法的流程示意图;
图7是本申请实施例提供的一种车辆在接收到来自服务器的第二指令时的决策方法的流程示意图;
图8是本申请实施例提供的又一种人机对话方法;
图9是本申请实施例提供的终端设备200的一种硬件结构示意图;
图10是本申请实施例提供的电子设备100的软件结构框图。
具体实施方式
下面将结合附图对本申请实施例中的技术方案进行清楚、详尽地描述。其中,在本申请实施例的描述中,除非另有说明,“/”表示或的意思,例如,A/B可以表示A或B;文本中的“和/或”仅仅是一种描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况,另外,在本申请实施例的描述中,“多个”是指两个或多于两个。
以下,术语“第一”、“第二”仅用于描述目的,而不能理解为暗示或暗示相对重要性或者隐含指明所指示的技术特征的数量。由此,限定有“第一”、“第二”的特征可以明示或者隐含地包括一个或者更多个该特征,在本申请实施例的描述中,除非另有说明,“多个”的含义是两个或两个以上。
本申请以下实施例中的术语“用户界面(user interface,UI)”,是应用程序或操作系统与用户之间进行交互和信息交换的介质接口,它实现信息的内部形式与用户可以接受形式之间的转换。用户界面是通过java、可扩展标记语言(extensible markup language,XML)等特定计算机语言编写的源代码,界面源代码在电子设备上经过解析,渲染,最终呈现为用户可以识别的内容。用户界面常用的表现形式是图形用户界面(graphic user interface,GUI),是指采用图形方式显示的与计算机操作相关的用户界面。它可以是在电子设备的显示屏中显示的文本、图标、按钮、菜单、选项卡、文本框、对话框、状态栏、导航栏、Widget等可视的界面元素。
为了更加清楚、详细地介绍本申请实施例提供的人机对话方法,下面先介绍本申请实施例提供的人机对话系统。
请参见图1,图1是本申请实施例提供的一种人机对话系统的系统架构图。
如图1所示,该人机对话系统10可以包括服务器100和终端设备200。其中,服务器100可以为部署在云端的服务器;终端设备200可以为车辆(图1示例性示出了终端设备200为车辆)。
其中,服务器100为具备语音识别功能的服务器。本申请实施例中的服务器100可以用独立的服务器或者是多个服务器组成的服务器集群来实现。
终端设备200可以包括但不限于车辆、平板、显示器、电视机、桌面型计算机、膝上型计算机、手持计算机、笔记本电脑、超级移动个人计算机、上网本,增强现实设备、虚拟现实设备、人工智能设备、车载设备、智能家居设备等等。
例如,本申请实施例中的终端设备200可以为如图2所示的车辆002。车辆002是通过车载传感系统感知道路环境,自动规划行车路线并控制车辆到达预定目标的汽车。智能汽车集中运用了计算机、现代传感、信息融合、通讯、人工智能及自动控制等技术,是一个集环境感知、规划决策、多等级辅助驾驶等功能于一体的高新技术综合体。其中,本申请中的智能车辆可以是拥有以计算机系统为主的智能驾驶仪的智能车辆,其中,该智能驾驶仪用于使车辆实现无人驾驶,也可以是拥有辅助驾驶系统或者全自动驾驶系统的智能车辆,还可以是轮式移动机器人等。
在一些实施例中,终端设备200可以通过语音助手,对接收到的语言信号进行语音识别ASR,得到语言信号对应的文本信息;进而,通过自然语言理解对接收到的文本信息进行语 义识别,可以得到语义识别后的意图;通过对话管理(Dialog Manager,DM)根据输入的意图,获得任务,明确出任务所需要的信息,然后对接业务平台完成任务,或者要求进一步输入更多的语音指令信息,或者获取业务平台对应任务的业务逻辑,最后将执行结果返回给用户。
其中,上述语音助手可以是终端设备200中的嵌入式应用(即终端设备200的系统应用),也可以是可下载应用。其中,嵌入式应用是作为终端设备200(如车辆)实现的一部分提供的应用程序;可下载应用是一个可以提供自己的因特网协议多媒体子系统(Internet ProtocolMultimedia Subsystem,IMS)连接的应用程序。可下载应用可以预先安装在终端设备200中,也可是由用户下载并安装在终端设备200中的第三方应用。
本申请提供的实施例中,服务器100和终端设备200确定语音指令(如第一指令和第二指令)的过程的具体实现,可参考后续实施例的相关描述,在此不赘述。
可以理解的是,图1中的人机对话系统架构只是本申请实施例中的一种示例性的实施方式,本申请实施例中的人机对话系统架构包括但不仅限于以上人机对话系统架构。
基于上述人机对话系统架构,本申请实施例提供了一种应用于上述人机对话系统架构中的车辆002,请参见图2,图2为本申请实施例提供的一种车辆002的功能框图。
在一个实施例中,可以将车辆002配置为完全或部分地自动驾驶模式。例如,车辆002可以在处于自动驾驶模式中的同时控制自身,并且可通过人为操作来确定车辆及其周边环境的当前状态,确定周边环境中的至少一个其他车辆的可能行为,并确定该其他车辆执行可能行为的可能性相对应的置信水平,基于所确定的信息来控制车辆002。在车辆002处于自动驾驶模式中时,可以将车辆002置为在没有和人交互的情况下操作。
车辆002可包括各种子系统,例如行进系统202、传感器系统204、控制系统206、一个或多个外围设备208、语音系统260以及电源210、计算机系统212和用户接口216。可选地,车辆002可包括更多或更少的子系统,并且每个子系统可包括多个元件。另外,车辆002的每个子系统和元件可以通过有线或者无线互连。
语音系统260可包括为车辆002实现语音识别的组件。在一个实施例中,语音系统260可包括端云协同模块261、业务执行模块262、语音识别(ASR)模块263、自然语言理解(NLU)模块264以及对话管理(DM)模块265。
其中,端云协同模块261可以执行配置管理,例如更新配置文件,配置文件可以包括决策规则;执行端云决策,例如基于决策规则从服务器和车辆生成的指令中确定目标指令;执行端云分发,例如将接收到的语言信号同时发送至服务器和语音识别模块263;执行超时管理,例如车辆002在超过预设时间未接收到来自服务器的指令时,执行车辆002生成的指令。
语音识别模块263用于将语音信号转换为文本信息。
自然语言理解模块264,主要作用是对用户输入的句子或者语音识别的结果进行处理,提取用户的对话意图以及用户所传递的信息。例如,用户问“我想吃羊肉泡馍”,NLU模块就可以识别出用户的意图是“寻找餐馆”,而关键实体是“羊肉泡馍”。其中,实体指的是文本中具有特定意义或者指代性强的词汇,通常包括人名、地名、组织机构名、日期时间、专有名词等。例如若车辆的预置实体中没有羊肉泡馍,则车辆在识别该语音信号时无法识别到羊肉泡馍为一种食物。
对话管理模块265,其主要作用是根据NLU模块输出的意图、槽位和关键实体等结果 控制人机对话的过程,来更新系统的状态,并生成相应的系统动作。具体的,DM模块可以根据对话历史,维护当前对话状态,对话状态可以包括整个对话历史的累积的意图、槽位和关键实体集合等;根据当前对话状态输出下一步对话动作。例如,根据语音得出的文本,车辆可以生成对该文本的回应文本。
例如,车辆002接收到本次语音信号后,可先由语音识别模块将语音信号转换为对应的文本信息。进而,车辆002中的语音理解模块可使用NLU算法提取上述文本信息中的用户意图(intent)和槽位信息(slot)。例如,上述语音信号中的用户意图为:导航,语音信号中的槽位信息为:大雁塔。那么,对话管理模块可根据提取到的用户意图和槽位信息,向相关第三方应用的服务器请求对应的服务内容。例如,对话管理模可向百度地图APP的服务器请求目的地为大雁塔的导航服务。这样,百度地图APP的服务器可将目的地为大雁塔的导航路线发送给车辆002,车辆002中的语音助手APP可通过卡片等形式将上述导航路线显示在对话界面中,使得语音助手APP完成本次对语音信号的响应。
另外,语音系统260还可以包括语音合成(Text To Speech,TTS)模块、智慧存储服务(IDS)模块和全场景大脑模块等。其中,TTS模块可以对接收到的语音流进行处理,提取该语音流对应的用户的语音特征,存放在语音库中,例如可以识别多音色以及童声等;IDS模块可以存储对话状态等。
行进系统202可包括为车辆002提供动力运动的组件。在一个实施例中,行进系统202可包括引擎218、能量源219、传动装置220和车轮221。引擎218可以是内燃引擎、电动机、空气压缩引擎或其他类型的引擎组合,例如气油发动机和电动机组成的混动引擎,内燃引擎和空气压缩引擎组成的混动引擎。引擎218将能量源219转换成机械能量。
能量源219的示例包括汽油、柴油、其他基于石油的燃料、丙烷、其他基于压缩气体的燃料、乙醇、太阳能电池板、电池和其他电力来源。能量源219也可以为车辆002的其他系统提供能量。
传动装置220可以将来自引擎218的机械动力传送到车轮221。传动装置220可包括变速箱、差速器和驱动轴。在一个实施例中,传动装置220还可以包括其他器件,比如离合器。其中,驱动轴可包括可耦合到一个或多个车轮221的一个或多个轴。
传感器系统204可包括感测关于车辆002周边的环境的信息的若干个传感器。例如,传感器系统204可包括全球定位系统222(全球定位系统可以是GPS系统,也可以是北斗系统或者其他定位系统)、惯性测量单元(inertial measurement unit,IMU)224、雷达226、激光测距仪228以及相机230。传感器系统204还可包括被监视车辆002的内部系统的传感器(例如,车内空气质量监测器、燃油量表、机油温度表等)。来自这些传感器中的一个或多个的传感器数据可用于检测对象及其相应特性(位置、形状、方向、速度等)。这种检测和识别是自主车辆002的安全操作的关键功能。
全球定位系统222可用于估计车辆002的地理位置。IMU 224用于基于惯性加速度来感测车辆002的位置和朝向变化。在一个实施例中,IMU 224可以是加速度计和陀螺仪的组合。例如:IMU 224可以用于测量车辆002的曲率。
雷达226可利用无线电信号来感测车辆002的周边环境内的物体。在一些实施例中,除了感测物体以外,雷达226还可用于感测物体的速度和/或前进方向。
激光测距仪228可利用激光来感测车辆002所位于的环境中的物体。在一些实施例中,激光测距仪228可包括一个或多个激光源、激光扫描器以及一个或多个检测器,以及其他系 统组件。
相机230可用于捕捉车辆002的周边环境的多个图像。相机230可以是静态相机或视频相机,也可以是可见光相机或红外相机,可以是任一用来获取图像的相机,本申请实施例对此不作限定。
本申请实施例中,相机230可以安装在车辆002的前侧、后侧以及左右两侧,相机230可以是通过旋转以调节拍摄角度的相机。另外,本申请实施例中的相机也可以通过伸缩杆安装在智能车辆上的任何位置,当需要获取图像时,伸缩杆伸展,以获取图像;当不需要获取图像时,伸缩杆收缩。本申请实施例中,相机230可以根据第一车辆接收的第二工作指令的指示下开启和关闭,并按照第二工作指令中携带的拍摄角度进行拍摄。
控制系统206为控制车辆002及其组件的操作。控制系统206可包括各种元件,其中包括转向单元232、油门234、制动单元236、传感器融合算法单元238、计算机视觉系统240、路线控制系统242以及障碍物避免系统244。
转向单元232可操作来调整车辆002的前进方向。例如在一个实施例中可以为方向盘系统。
油门234用于控制引擎218的操作速度并进而控制车辆002的速度。
制动单元236用于控制车辆002减速。制动单元236可使用摩擦力来减慢车轮221。在其他实施例中,制动单元236可将车轮221的动能转换为电流。制动单元236也可采取其他形式来减慢车轮221转速从而控制车辆002的速度。
计算机视觉系统240可以操作来处理和分析由相机230捕捉的图像以便识别车辆002周边环境中的物体和/或特征。所述物体和/或特征可包括交通信号、道路边界和障碍物。计算机视觉系统240可使用物体识别算法、运动中恢复结构(Structure from Motion,SFM)算法、视频跟踪和其他计算机视觉技术。在一些实施例中,计算机视觉系统240可以用于为环境绘制地图、跟踪物体、估计物体的速度等等。
路线控制系统242用于确定车辆002的行驶路线。在一些实施例中,路线控制系统242可结合来自传感器融合算法单元238、GPS 222和一个或多个预定地图的数据以为车辆002确定行驶路线。
障碍物避免系统244用于识别、评估和避免或者以其他方式越过车辆002的环境中的潜在障碍物。
当然,在一个实例中,控制系统206可以增加或替换地包括除了所示出和描述的那些以外的组件。或者也可以减少一部分上述示出的组件。
车辆002通过外围设备208与外部传感器、其他车辆、其他计算机系统或用户之间进行交互。外围设备208可包括无线通信系统246、车载电脑248、麦克风250和/或扬声器252。
在一些实施例中,外围设备208提供车辆002的用户与用户接口216交互的手段。例如,车载电脑248可向车辆002的用户提供信息。用户接口216还可操作车载电脑248来接收用户的输入。车载电脑248可以通过触摸屏进行操作。在其他情况中,外围设备208可提供用于车辆002与位于车内的其它设备通信的手段。例如,麦克风250可从车辆002的用户接收音频(例如,语音命令或其他音频输入)。类似地,扬声器252可向车辆002的用户输出音频。
无线通信系统246可以直接地或者经由通信网络来与一个或多个设备无线通信。例如码分多址(code division multiple access,CDMA)、增强型多媒体盘片系统(Enhanced Versatile Disk,EVD)、全球移动通信系统(global system for mobile communications,GSM)/是通用分 组无线服务技术(general packet radio service,GPRS),或者4G蜂窝通信,例如长期演进(long term evolution,LTE),或者5G蜂窝通信,或者新无线(new radio,NR)系统,或者未来通信系统等。无线通信系统246可利用WiFi与无线局域网(wireless local area network,WLAN)通信。在一些实施例中,无线通信系统246可利用红外链路、蓝牙或无线个域网(ZigBee)与设备直接通信。其他无线协议,例如:各种车辆通信系统,例如,无线通信系统246可包括一个或多个专用短程通信(dedicated short range communications,DSRC)设备,这些设备可包括车辆和/或路边台站之间的公共和/或私有数据通信。
电源210可向车辆002的各种组件提供电力。在一个实施例中,电源210可以为可再充电锂离子或铅酸电池。这种电池的一个或多个电池组可被配置为电源为车辆002的各种组件提供电力。在一些实施例中,电源210和能量源219可一起实现,例如一些全电动车中那样。
车辆002的部分或所有功能受计算机系统212控制。计算机系统212可包括至少一个处理器213,处理器213执行存储在例如存储器214这样的非暂态计算机可读介质中的指令215。计算机系统212还可以是采用分布式方式控制车辆002的个体组件或子系统的多个计算设备。
处理器213可以是任何常规的处理器,诸如商业可获得的CPU。替选地,该处理器可以是诸如ASIC或其它基于硬件的处理器的专用设备。尽管图2功能性地图示了处理器、存储器、和在相同块中的计算机的其它元件,但是本领域的普通技术人员应该理解该处理器、计算机、或存储器实际上可以包括可以或者可以不存储在相同的物理外壳内的多个处理器、计算机、或存储器。例如,存储器可以是硬盘驱动器或位于不同于计算机的外壳内的其它存储介质。因此,对处理器或计算机的引用将被理解为包括对可以或者可以不并行操作的处理器或计算机或存储器的集合的引用。不同于使用单一的处理器来执行此处所描述的步骤,诸如转向组件和减速组件的一些组件每个都可以具有其自己的处理器,所述处理器只执行与特定于组件的功能相关的计算。
在此处所描述的各个方面中,处理器可以位于远离该车辆并且与该车辆进行无线通信。在其它方面中,此处所描述的过程中的一些在布置于车辆内的处理器上执行而其它则由远程处理器执行,包括采取执行单一操纵的必要步骤。
在一些实施例中,存储器214可包含指令215(例如,程序逻辑),指令215可被处理器213执行来执行车辆002的各种功能,包括以上描述的那些功能。存储器214也可包含额外的指令,包括向行进系统202、传感器系统204、控制系统206和外围设备208中的一个或多个发送数据、从其接收数据、与其交互和/或对其进行控制的指令。
除了指令215以外,存储器214还可存储数据,例如道路地图、路线信息,车辆的位置、方向、速度以及其它这样的车辆数据,以及其他信息。这种信息可在车辆002在自主、半自主和/或手动模式中操作期间被车辆002和计算机系统212使用。例如:可以根据目标路段的道路信息,和接收的车辆速度范围和车辆曲率范围内对车辆的当前速度和当前曲率进行微调,以使智能车辆的速度和曲率在车辆速度范围和车辆曲率范围内。
用户接口216,用于向车辆002的用户提供信息或从其接收信息。可选地,用户接口216可包括在外围设备208的集合内的一个或多个输入/输出设备,例如无线通信系统246、车载电脑248、麦克风250和扬声器252。
计算机系统212可基于从各种子系统(例如,行进系统202、传感器系统204和控制系统206)以及从用户接口216接收的输入来控制车辆002的功能。例如,计算机系统212可利用 来自控制系统206的输入以便控制转向单元232来避免由传感器系统204和障碍物避免系统244检测到的障碍物。在一些实施例中,计算机系统212可操作来对车辆002及其子系统的许多方面提供控制。
可选地,上述这些组件中的一个或多个可与车辆002分开安装或关联。例如,存储器214可以部分或完全地与车辆002分开存在。上述组件可以按有线和/或无线方式来通信地耦合在一起。
可选地,上述组件只是一个示例,实际应用中,上述各个模块中的组件有可能根据实际需要增添或者删除,图2不应理解为对本申请实施例的限制。
在道路行进的自动驾驶汽车,如上面的车辆002,可以识别其周围环境内的物体以确定对当前速度的调整。所述物体可以是其它车辆、交通控制设备、或者其它类型的物体。在一些示例中,可以独立地考虑每个识别的物体,并且基于物体的各自的特性,诸如它的当前速度、加速度、与车辆的间距等,可以用来确定自动驾驶汽车所要调整的速度。
可选地,自动驾驶汽车车辆002或者与自动驾驶车辆002相关联的计算设备(如图2的计算机系统212、计算机视觉系统240、存储器214)可以基于所识别的物体的特性和周围环境的状态(例如,交通、雨、道路上的冰、等等)来预测所述识别的物体的行为。可选地,每一个所识别的物体都依赖于彼此的行为,因此还可以将所识别的所有物体全部一起考虑来预测单个识别的物体的行为。车辆002能够基于预测的所述识别的物体的行为来调整它的速度。在这个过程中,也可以考虑其它因素来确定车辆002的速度,诸如,车辆002在行驶的道路中的横向位置、道路的曲率、静态和动态物体的接近度等等。
除了提供调整自动驾驶汽车的速度的指令之外,计算设备还可以提供修改车辆002的转向角的指令,以使得自动驾驶汽车遵循给定的轨迹和/或维持与自动驾驶汽车附近的物体(例如,道路上的相邻车道中的轿车)的安全横向和纵向距离。
上述车辆002可以为轿车、卡车、摩托车、公共汽车、船、飞机、直升飞机、割草机、娱乐车、游乐场车辆、施工设备、电车、高尔夫球车、火车、和手推车等,本申请实施例不做特别的限定。
可以理解的是,图2中的智能车辆功能图只是本申请实施例中的一种示例性的实施方式,本申请实施例中的智能车辆包括但不仅限于以上结构。
请参考图3,图3为本申请实施例提供的一种人机对话方法的流程图,该方法可应用于上述图1所述的人机对话系统中,其中的服务器100可以用于支持并执行图3中所示的方法流程步骤S304-步骤S305,终端设备200可以用于支持并执行图3中所示的方法流程步骤S301至S303以及步骤S306。该方法可以包括以下部分或全部步骤。
图3示例性示出了本申请实施例提供的人机对话方法流程。该人机对话方法可以包括以下部分或全部步骤:
S301、车辆通过语音助手接收语音信号。
其中,上述语音助手可以是安装在车辆中的应用程序(Application,APP),也可以是集成在车辆的操作系统中的系统功能。该语音助手可以是车辆中嵌入式应用程序(即车辆的系统应用)或者可下载应用程序。其中,嵌入式应用程序是作为车辆实现的一部分提供的应用程序。例如,嵌入式应用程序可以为“设置”应用、“短消息”应用和“相机”应用等。可下载应用程序是一个可以提供自己的因特网协议多媒体子系统(Internet Protocol Multimedia  Subsystem,IMS)连接的应用程序,该可下载应用程序可以预先安装在终端中的应用或可以由用户下载并安装在终端中的第三方应用。
在一些实施例中,语音助手被启动后,可以通过麦克风接收用户输入的语音命令,触发车辆执行该语音命令对应的操作。本申请实施例中的语音信号即为语音命令。例如用户在车辆内唤醒车辆的语音助手后,用户可以说“打开空调”,则“打开空调”为语音命令,语音助手在接收上述语音命令后,可以打开车内的空调。
S302、车辆向服务器发送该语音信号。
在一些实施例中,车辆在接收到语音信号时,向服务器发送该语音信号;同时,车辆可以执行S303,确定该语音信号对应的第一指令。
可选地,车辆还可以向服务器发送对话状态,该对话状态用于服务器得到第二指令。其中,对话状态可以包括本轮对话前N轮对话信息和N轮决策结果,还可以包括本轮待填的槽位,N为正整数。其中,某一轮的决策结果用于指示该轮语音信号对应的操作是采用服务器生成的指令还是车辆生成的指令。例如,对话状态可以包括最近多轮对话意图、槽位(key)、槽位值(value)、本轮待填的槽位和用户问题等信息。可以理解的,车辆向服务器发送对话状态,服务器可以结合前几轮的决策结果生成本轮的语音指令。需要说明的是,由于是由车辆进行决策,云端并不知道上一轮的决策结果,若不将决策结果上传至服务器,可能对服务器进行本轮决策造成误差。例如,上一轮的对话为,用户:帮我买一张从上海去北京的机票;本轮对话为,用户:那儿的天气怎么样?若服务器侧没有上一轮的对话信息,则服务器无法判别本轮对话中用户说的“那儿”所指示的地址。
在一种实现中,车辆设置有语音助手,语音助手中的端云协同模块在接收到该语音信号后,可以将该语言信号同时发送至服务器和语音助手中的ASR模块,或者,端云协同模块在接收到该语音信号后,将该语言信号发送至服务器的同时调用ASR接口对该语音信号进行文本转换。
S303、车辆基于语音信号,得到该语音信号对应的第一指令。
在一些实施例中,车辆先将该语言信号转换为文本信息;进而,对该文本信息进行语义识别,得到语义识别后的意图;最后,基于意图,得到第一指令。其中,第一指令用于指示该语音信号对应的操作。
例如,车辆接收到语音信号后,可使用语音识别(ASR)算法将语音信号转为对应的文本信息,得到“明天的天气怎么样”;进而,车辆可使用NLU算法从语音信号的文本信息中提取用户意图,即车辆可从“明天的天气怎么样”中提取到语音信号对应的用户意图为:查询明天的天气;进而,车辆可以基于该用户意图,生成该语音信号对应的第一指令,该第一指令的内容可以为“调用天气应用获取明天的天气情况并通过卡片显示明天的天气情况”。
可选地,车辆还可以基于该语音信号,确定本轮的第一对话状态。其中,第一对话状态可以包括用于指示本轮对话为多轮或单轮的指示信息,还可以包括历史的意图、文本和实体,以及本轮的意图、文本和实体等。
在一种实现中,语音助手中有端云协同模块、ASR模块、NLU模块和DM模块,则端云协同模块将该语音信号发送至ASR模块;ASR模块得到语言信号对应的文本信息后,由NLU模块将基于文本信息识别意图;最后由DM模块得到第一指令和第一对话状态。进而,DM模块可以将第一指令和第一对话状态发送至端云协同模块,由端云协同模块进行决策是否使用第一指令,也即是执行步骤S306。
在另一种实现中,语音助手中不具备ASR模块、NLU模块和DM模块,语音助手可以调用ASR、NLU和DM等插件的功能以实现基于语音信号得到第一指令。
S304、服务器在接收到该语音信号后,确定语音信号对应的第二指令。
在一些实施例中,服务器先将该语言信号转换为文本信息;进而,对该文本信息进行语义识别,得到语义识别后的意图;最后,基于意图,得到第二指令。其中,第二指令用于指示该语音信号对应的操作。
可选地,服务器还可以基于该语音信号,确定第二对话状态。其中,第二对话状态可以包括用于指示本轮对话为多轮或单轮的指示信息,还可以包括针对该语音信号的回复信息。需要说明的是,上述第一对话状态和第二对话状态均还可以包括整个对话历史的累积的意图、槽位和关键实体集合等。
在一种实现中,服务器具备ASR模块、NLU模块和DM模块,则服务器在接收到该语音信号后,将该语音信号发送至ASR模块;ASR模块得到语言信号对应的文本信息后,由NLU模块将基于文本信息识别意图;最后由DM模块得到第二指令和第二对话状态。
S305、服务器将该第二指令发送至车辆。
在一些实施例中,服务器在基于该语音信息得到第二指令,将该第二指令发送至车辆。
可选地,服务器向车辆发送第二对话状态。
在一种实现中,服务器通过DM模块得到第二指令和第二对话状态后,服务器可以将第二指令和第二对话状态发送至车辆,由车辆进行决策是否使用第二指令,也即是执行步骤S306。
S306、车辆在先得到的指令可直接执行时,执行先得到的指令,该先得到的指令为第一指令或第二指令。
在一些实施例中,车辆在确定先得到的指令可直接执行时,执行先得到的指令;不执行后得到的指令。例如,车辆在得到第一指令时查询第二指令是否到达,若第二指令未到达,即车辆未接收到第二指令,则第一指令为先得到的指令;进而,判断第一指令能否直接执行,在确定第一指令可直接执行时执行第一指令;进而,在车辆得到第二指令时,丢弃第二指令,即不执行第二指令。可以理解的,先得到的指令为第一指令时后得到的指令为第二指令;先得到的指令为第二指令后后得到的指令为第一指令。
在一种实现中,先得到的指令为第一指令,则车辆可以根据第一决策信息,确定是否可以直接执行第一指令。其中,第一决策信息可以包括上一轮决策结果、上一轮对话的状态标识、端侧本轮意图结果、端侧本轮指令结果和端侧置信度中的任一项。其中,端侧本轮意图结果为车辆确定的本轮的意图结果;端侧本轮指令结果为车辆确定的本轮的指令结果;端侧置信度为第一指令的置信度;上一轮决策结果用于指示针对上一轮对话执行的指令的来源,来源包括服务器和终端设备;状态标识用于指示单轮或多轮;意图结果用于指示新意图和多轮意图;指令结果包括正常和异常,异常用于指示无法执行语音信号对应的操作。
其中,指令结果为异常时可以包括第一指令对应的垂域为不可闭环垂域,也即是说,车辆不支持执行该第一指令。其中,垂域是指功能领域,如车辆控制、音乐播放和天气查询等领域。例如,第一指令为天气查询;由于天气查询为不可闭环垂域,则该第一指令的指令结果为异常。
可选地,上述第一决策信息还可以包括第一指令对应的垂域情况;所述垂域可以包括可闭环垂域和不可闭环垂域。
其中,端侧置信度可以是以下方法得到的:车辆中基于ASR识别结果可以得到第一置信度,基于NLU识别结果可以得到第二置信度;车辆可以将第一置信度或第二置信度作为上述端侧置信度,也可以基于第一置信度和第二置信度得到上述端侧置信度,例如将第一置信度和第二置信度进行加权求和后得到上述端侧置信度。
例如,车辆可以在上一轮决策结果、上一轮对话的状态标识、端侧本轮意图结果和端侧本轮指令结果中每一项均符合预设条件时,确定可以直接执行第一指令,其中,预设条件可以为上一轮决策结果为车辆、上一轮对话的状态标识为多轮、端侧本轮意图结果为多轮结果且端侧本轮指令结果为正常时,确定可以执行第一指令。
又例如,第一决策信息为端侧置信度,预设条件为端侧置信度大于预设值时确定该第一指令可直接执行。
例如,车辆确定上一轮对话的状态标识为多轮,车辆得到的本轮意图结果为新意图,如车控意图,服务器得到的本轮意图结果为多轮意图,那么,若车辆确定上一轮的决策结果为服务器时,车辆可以确定第二指令为目标指令。
在一种实现中,先得到的指令为第二指令,则车辆可以根据第二决策信息,确定是否可以直接执行第二指令。其中,第二决策信息可以包括上一轮决策结果、上一轮对话的状态标识、云侧本轮意图结果、云侧本轮指令结果和端侧置信度中的任一项。其中,云侧本轮意图结果为服务器确定的本轮的意图结果;云侧本轮指令结果为服务器确定的本轮的指令结果。
其中,上述预设条件也即是决策规则,该决策规则可以存储在如图2所示的端云协同模块261中,具体可以存储在端云协同模块261中的配置文件中。进而,在用户要对决策规则进行更新时,仅需更新该配置文件即可。
关于车辆确定指令是否可直接执行的具体实现可以参见图6和图7的实施例中的相关描述。
可选地,车辆在执行先得到的指令后,可以存储先得到的指令和该指令对应的对话状态;进而,还可以在将下一个语音信号上传至服务器时将先得到的指令和该指令对应的对话状态上传至服务器。例如,先得到的指令为第一指令,该指令对应的对话状态为第一对话状态,则车辆再执行第一指令后若接收到下一个语音信号时,车辆可以将下一个语音信号上传至服务器时将第一指令和第一对话状态上传至服务器。
可选地,车辆在执行先得到的指令后,可以在存储中写入该执行情况,该执行情况可以包括该语音信号对应的操作已被执行或已执行先得到的指令,以使车辆在得到后得到的指令时不执行后得到指令。其中,存储可以为公共队列。例如,先得到的指令为第一指令,后得到的指令为第二指令,则车辆在执行第一指令后,可以在公共队列中用于指示该语音信号对应的决策结果的状态字段更新为已执行。
S307、车辆在先得到的指令不可直接执行时,从第一指令和第二指令中选择目标指令执行,目标指令为第一指令或第二指令。
在一些实施例中,车辆在确定先得到的指令不可直接执行时,等待另一个指令;在车辆接收到另一个指令后,车辆再从第一指令和第二指令确定目标指令;最后,车辆或服务器执行该目标指令。
示例性的,在车辆不具备执行该目标指令的技能时,如该目标指令为“打开家里的空调”,则车辆可以将该目标指令发送至服务器,由服务器执行该目标指令。关于服务器执行目标指令的具体实现可以参见图8的相关内容。
需要说明的是,先得到的指令为第一指令时另一个指令为第二指令,先得到的指令为第二指令时另一个指令为第一指令。
例如,车辆可以在得到另一个指令时,先确定另一个指令能否直接执行;在另一个指令可直接执行时,执行另一个指令,也即是确定另一个指令为目标指令;在另一个指令不可执行时从第一指令和第二指令中选择目标指令执行。又例如,车辆可以将先得到的指令标记为等待中,进而,在得到另一个指令时从第一指令和第二指令中确定目标指令。关于上述两个示例的具体实现可以参见下文图4至图7的相关描述。
在一些实施例中,端云协同模块可以根据第三决策信息,从第二指令和第一指令中确定该语音信号对应的目标指令。其中,第三决策信息可以包括上一轮决策结果、上一轮对话的状态标识、端侧本轮意图结果和端侧本轮指令结果,云侧本轮意图结果和端侧本轮指令结果。第三决策信息还可以包括上述端侧置信度和云侧置信度。
可选地,车辆在执行目标指令后,可以存储目标指令和该目标指令对应的对话状态;进而,还可以在将下一个语音信号上传至服务器时将目标指令和该目标指令对应的对话状态上传至服务器。例如,目标指令为第一指令,该目标指令对应的对话状态为第一对话状态,则车辆再执行第一指令后若接收到下一个语音信号时,车辆可以将下一个语音信号上传至服务器时将第一指令和第一对话状态上传至服务器。
可选地,车辆在执行目标指令后,可以在存储中写入该语音信号的执行情况,该执行情况可以包括该语音信号对应的操作已被执行或已执行目标指令,以使车辆在得到另一个指令时不执行另一个指令。其中,存储可以为公共队列。例如,目标指令为第一指令,另一个指令为第二指令,则车辆在执行第一指令后,可以在公共队列中用于指示该语音信号对应的决策结果的状态字段更新为已执行。可以理解的,目标指令为第一指令时另一个指令为第二指令;目标指令为第二指令时另一个指令为第一指令。
其中,上述步骤S306和步骤S307中车辆对得到的指令进行决策时可以有不同的决策逻辑。以下示例性的以图4和图5为例,对车辆对得到的指令进行决策的决策方法进行说明。
请参见图4,图4是本申请实施例提供的一种决策方法的流程示意图。
如图4所示,车辆在得到第一指令时,可以查询第二指令是否被执行;在确定第二指令被执行时,车辆可以丢弃第一指令,即不执行第一指令;在确定第二指令未被执行时,车辆可以判断第一指令能否直接执行,在确定第一指令可直接执行时,执行第一指令;在确定第一指令不可执行时,等待第二指令的可执行情况;在第二指令可直接执行时丢弃第一指令,在第二指令不可直接执行时从第一指令和第二指令中选择目标指令执行。
车辆在得到第二指令时,可以查询第一指令是否被执行;在确定第一指令被执行时,车辆可以丢弃第二指令,即不执行第二指令;在确定第一指令未被执行时,车辆可以判断第二指令能否直接执行,在确定第二指令可直接执行时,执行第二指令;在确定第二指令不可执行时,等待第一指令的可执行情况;在第一指令可直接执行时丢弃第二指令,在第一指令不可直接执行时从第二指令和第一指令中选择目标指令执行。
请参见图5,图5是本申请实施例提供的一种决策方法的流程示意图。图5示例性以车辆得到第一指令为例进行说明,车辆得到第二指令时的决策方法与得到第一指令时的决策方法一致,此处不再赘述。
如图5所示,车辆在得到第一指令后,可以查询第二指令的状态,第二指令的状态可以 包括已执行、未到达和等待中。在第二指令的状态为已执行时,车辆可以丢弃第一指令,即不执行第一指令。在第二指令的状态为未到达时,车辆可以判断第一指令能否直接执行;在第二指令可直接执行时执行第一指令;在第二指令不可直接执行时,等待第二指令并存储第一指令的状态为等待中,直至车辆接收到第二指令,进而从第一指令和第二指令中选择目标指令执行。在第二指令的状态为等待中时,车辆从第一指令和第二指令中选择目标指令执行。
可见,图4所示的决策方法中,在第一指令和第二指令均不可直接执行时才会采用基于第一指令和第二指令的综合决策方案。而图5所示的决策方法中,车辆在一端指令到达时,会查询对端指令是否在等待中;若对端指令在等待中,则会执行从第一指令和第二指令中选择目标指令执行。可以理解的,图5所示的方法相对于图4所示的方法,优先选用了基于第一指令和第二指令决策的综合方案,由于该综合方案进行决策使用的数据更多,该综合方案的决策准确性更高,因此图5的方法可以提高决策的准确性。需要说明的是,一端指令为第一指令时对端指令为第二指令,一端指令为第二指令时对端指令为第一指令。
在本申请实施例中,车辆还可以基于服务器的识别结果进行自学习。也即是说,车辆在执行步骤S306或步骤S307后,还可以执行下述步骤S308。
S308、车辆在第一指令和第二指令不同时,基于第二指令优化车辆的指令识别能力。
在一些实施例中,车辆可以在确定第一指令和第二指令不同时,基于第二指令优化车辆的指令识别能力;也可以在第一指令和第二指令不同且满足预设情况时,基于第二指令优化车辆的指令识别能力,该预设情况可以包括第一指令的端侧置信度低于预设阈值、在第二指令为目标指令和确定第二指令比第一指令更准确。例如车辆先得到第一指令并执行第一指令后,接收到第二指令时确定第二指令比第一指令更准确,则车辆可以基于第二指令和语音信号优化车辆的指令识别能力。
在一种实现中,第一指令是车辆基于端侧模型和语音信号得到的,则车辆在确定第一指令和第二指令不同时,将语音信号和第二指令作为训练样本对端侧模型进行训练,以不断补齐端云能力差异。其中,第二指令可以包括文本、意图和实体;该端侧模型可以包括ASR模型和NLU模型。例如,车辆可以从第二指令中获取文本和意图;进而,车辆可以基于语音信号和该文本对ASR模型进行训练;基于该意图和该文本对NLU模型进行训练。
其中,端侧模型可以为基于初始语音信号和初始热门实体训练得到的语音识别模块。端侧模型预置原始热门实体,以音乐为例,初始预置实体有Top1w歌手+歌曲名。
在另一种实现中,车辆可以基于第二指令,增加车辆中的预置实体数量。例如语音信号对应的语音内容为播放歌曲A,A为该歌曲歌名,则A为实体。可以理解的,若车辆的预置实体中没有A,则车辆在识别该语音信号时无法识别到A指代的含义。
例如,车辆可以在第二指令命中实体且端侧预置实体中未包括该命中的实体时,可以将该命中的实体加载到预置实体中。例如,若第二指令命中实体,也即是服务器识别出用户的语音信号要求播放的是A;若车辆未在预置实体中查询到A,则车辆在确定第二指令命中实体A时,可以将实体A加载至预置实体中。
在又一种实现中,第二指令包括文本、实体和意图,车辆可以存储该文本与该实体和该意图的对应关系,以使在车辆得到语音信号对应的文本信息符合该文本,直接得到该意图和该实体。
以下以图6和图7为例,对步骤S306的决策方法进行详细说明。
在一些实施例中,车辆得到第一指令和第二指令的时间不同,则车辆在得到第一指令和第二指令中的一个指令时,可以先确定是否执行该指令;在确定该指令不可直接执行时,等待得到第一指令和第二指令中的另一个指令,再从两个指令确定该语音信号对应的目标指令。
图6示例性示出了车辆在得到第一指令时的决策方法的流程示意图。如图6所示,该决策方法可以包括以下部分或全部步骤:
S601、端云协同模块得到语音信号对应的第一指令。
其中,该端云协同模块部署在车辆的语音助手中,可以接收来自语音助手中DM模块发送的第一指令,或者,可以接收语音助手调用DM插件得到的第一指令。
S602、端云协同模块查询该语音信号是否有决策结果。
其中,该语音信号有决策结果用于指示该语音信号对应的操作已被执行;未有决策结果用于指示该语音信号对应的操作未被执行。也就是说,端云协同模块在得到第一指令后,先查询该语音信号对应的操作是否已经被执行。其中,该决策结果用于指示该语音信号对应的操作是执行服务器生成的指令(即第二指令)还是车辆生成的指令(即第一指令)。其中,该决策结果可以为第一指令或第二指令,也可以为服务器或车辆;可以理解的,该决策结果为车辆,也即是该语音信号对应的目标指令为第一指令;该决策结果为服务器,也即是该语音信号对应的目标指令为第二指令。
在一些实施例中,端云协同模块在得到第一指令后,可以从存储中查询该语音信号有无决策结果。例如端云协同模块将语音信号的决策结果写入公共队列中,则端云协同模块在得到第一指令后,可以读取公共队列中用于指示决策结果的状态字段,确定该语音信号有无决策结果。例如,端云协同模块接收来自DM模块发送的第一指令时,读取公共队列中的上述状态字段,确定该语音信号有无决策结果。
进而,端云协同模块在查询到该语音信号未有决策结果时,可以执行步骤S603,即进一步确定是否可以直接执行第一指令;在查询到该语音信号有决策结果时,可以执行步骤S604,即丢弃第一指令。可以理解的,若端云协同模块已经先获取第二指令并决策第二指令为目标指令,已执行该目标指令,则端云协同模块在得到第一指令时可以丢弃第一指令,以免重复执行该语音信号对应的操作。
可选地,公共队列还可以包括指令标识。例如端云协同模块可以在得到第一指令时,在公共队列中写入指示已得到第一指令的指令标识;可以在得到第二指令时,在公共队列中写入指示已得到第二指令的指令标识。
在另一些实施例中,端云协同模块在得到第一指令或第二指令时,可以在存储中存入指示已得到指令的指令标识。其中,该存储可以为上述公共队列。那么,端云协同模块在得到第一指令时,端云协同模块可以先查询指令标识,确定是否已得到第二指令;在查询到端云协同模块未接收到来自服务器的第二指令时,端云协同模块可以执行步骤S603;若查询到端云协同模块已接收到第二指令,则端云协同模块可以执行上述步骤S602。
S603、端云协同模块确定是否直接执行第一指令。
需要说明的是,端云协同模块确定是否直接执行第一指令,也即是,确定第一指令是否为该语音信号对应的目标指令。可以理解的,若端云协同模块确定第一指令为该语音信号对应的目标信号,则端云协同模块可以直接执行第一指令;若端云协同模块不确定第一指令是否为该语音信号对应的目标信号,则端云协同模块可以等待第二指令,再基于第一指令和第 二指令综合决策两者哪一个是语音信号对应的目标信号。
在一些实施例中,端云协同模块在确定该语音信号未有决策结果时,可以基于决策规则,根据第一决策信息,确定是否可以直接执行第一指令。其中,第一决策信息可以包括上一轮决策结果、上一轮对话的状态标识、端侧本轮意图(NLU)结果和端侧本轮指令(DM)结果;第一决策信息还可以包括端侧置信度,该端侧置信度可以是车辆NLU模块基于本轮语音信号得到,也可以是ASR模块基于本轮语音信号得到,还可以是车辆综合NLU模块和ASR模块得到的置信度得到的。
其中,上一轮决策结果是指上一轮决策的目标指令是服务器生成的指令还是车辆生成的指令;上一轮对话的状态标识是指上轮是处于单轮对话还是多轮对话;端侧本轮意图结果可以包括新意图或多轮意图。端侧本轮指令结果可以正常和异常,其中,正常可以包括有结果和部分结果,异常可为无结果,例如车辆不支持查询天气时,则端云协同模块得到的端侧本轮指令结果可以为“无结果”;又例如,该语音信号包括多意图时,若端云协同模块可以执行多意图中的部分意图对应的指令时,此时为“部分结果”。
在一种实现中,端云协同模块可以基于决策规则,根据上一轮决策结果、上一轮对话的状态标识、端侧本轮意图结果和端侧本轮指令结果,确定是否执行第一指令。例如,决策规则包括上一轮决策结果为车辆、上一轮对话的状态标识为多轮、端侧本轮意图结果为多轮且端侧本轮指令结果为正常时,确定可以执行第一指令,则端云协同模块在判定上述第一决策信息满足上述决策规则时,确定可以执行第一指令。
进而,若端云协同模块确定可以执行第一指令,则执行步骤S605;若不可直接执行第一指令,则执行步骤S606,即等待第二指令后再执行步骤S607。
S604、端云协同模块丢弃第一指令。
在一些实施例中,端云协同模块在得到第一指令后,可以从存储中查询该语音信号有无决策结果。进而,端云协同模块在查询到该语音信号有决策结果时,可以丢弃第一指令。可以理解的,若端云协同模块已经先获取第二指令并决策第二指令为目标指令,已执行该目标指令,则端云协同模块在得到第一指令时可以丢弃第一指令,以免重复执行该语音信号对应的操作。
S605、业务执行模块执行第一指令。
在一种实现中,业务执行模块接收来自端云协同模块发送的第一指令时,执行第一指令。例如,第一指令的内容可以为“调用天气应用获取明天的天气情况并通过卡片显示明天的天气情况”,则端云协同模块261在确定执行第一指令时,可以通过业务执行模块262调用天气应用获取明天的天气情况,并通过卡片显示明天的天气情况。
进一步的,业务执行模块在执行第一指令后,还可以将该语音信号的决策结果更新为有决策结果。那么,端云协同模块在接收到来自服务器的第二指令时查询该语音信号的决策结果发送该语音信号已被执行,则端云协同模块可以丢弃第二指令,以免进行不避要的操作。例如业务执行模块在执行第一指令后,可以在公共队列中写入该语音信号已有决策结果。
S606、端云协同模块从第一指令和第二指令中确定该语音信号对应的目标指令。
在一些实施例中,端云协同模块在确定第一指令不可直接执行时,等待第二指令的可执行情况;在第二指令不可直接执行时,根据第三决策信息,从第一指令和第二指令中确定该语音信号对应的目标指令。
其中,第三决策信息可以包括上一轮决策结果、上一轮对话的状态标识、端侧本轮意图 结果和端侧本轮指令结果,云侧本轮意图结果和端侧本轮指令结果;还可以包括云侧置信度。其中,云侧本轮意图结果、云侧本轮指令结果和云侧置信度可以是由服务器发送至车辆的。其中,该云侧置信度可以是服务器中NLU模块基于本轮语言信号得到,也可以是服务器ASR模块基于本轮语音信号得到,还可以是服务器基于NLU模块和云侧ASR模块得到的置信度得到的。
可选地,端云协同模块在预设时间内未接收到来自服务器的第二指令时,端云协同模块将第一指令确定为目标指令,进而执行步骤S607。
可选地,端云协同模块在预设时间内未接收到来自服务器的第二指令时,端云协同模块不执行第一指令,不响应该语音信号或提醒用户不可用。
S607、业务执行模块执行语音信号对应的目标指令。
在一些实施例中,业务执行模块在执行目标指令后,还可以存储本轮的决策结果。例如在目标指令为第一指令时,业务执行模块块在公共队列中写入本轮对话该语音信息采用的指令是来自车辆的第一指令。
图7示例性示出了车辆在接收到来自服务器的第二指令时的决策方法的流程示意图。如图7所示,该决策方法可以包括以下部分或全部步骤:
S701、端云协同模块得到语音信号对应的第二指令。
其中,该端云协同模块部署在车辆的语音助手中。
在一些实施例中,车辆接收来自服务器的第二指令。
S702、端云协同模块查询该语音信号是否有决策结果。
在一些实施例中,端云协同模块在得到第二指令后,可以先查询该语音信号对应的操作是否已经被执行。进而,端云协同模块在查询到该语音信号未有决策结果时,可以执行步骤S703,即进一步确定是否可以直接执行第二指令;在查询到该语音信号有决策结果时,可以执行步骤S704,即丢弃第二指令。可以理解的,若端云协同模块已经先获取第一指令并决策第一指令为目标指令,已执行该目标指令,则端云协同模块在得到第二指令时可以丢弃第二指令,以免重复执行该语音信号对应的操作。
在另一些实施例中,端云协同模块在得到第二指令或第一指令时,还可以在存储中存入指示已得到指令的指令标识。其中,该存储可以为上述公共队列。那么,端云协同模块在得到第二指令时,端云协同模块可以先查询指令标识,确定是否已得到第一指令;在查询到端云协同模块未接收到来自服务器的第一指令时,端云协同模块可以执行步骤S703;若查询到端云协同模块已接收到第一指令,则端云协同模块可以执行上述步骤S702。
关于步骤S703的具体内容可以参见上述步骤S603中的相关描述,此处不再赘述。
S703、端云协同模块确定是否直接执行第二指令。
需要说明的是,端云协同模块确定是否直接执行第二指令,也即是,确定第二指令是否为该语音信号对应的目标指令。
在一些实施例中,端云协同模块在确定该语音信号未有决策结果时,可以基于决策规则,根据第二决策信息,确定是否可以直接执行第二指令。其中,第二决策信息可以包括上一轮决策结果、上一轮对话的状态标识、云侧本轮意图(NLU)结果和云侧本轮指令(DM)结果;第二决策信息还可以包括云侧置信度。其中,上一轮决策结果、上一轮对话的状态标识可以从本地存储中获取;云侧本轮意图结果、云侧本轮指令结果和云侧置信度可以是由服务 器发送至车辆的。
在一种实现中,端云协同模块可以基于决策规则,根据上一轮决策结果、上一轮对话的状态标识、端侧本轮意图结果和端侧本轮指令结果,确定是否执行第二指令。例如,决策规则包括上一轮决策结果为车辆、上一轮对话的状态标识为多轮、端侧本轮意图结果为多轮且端侧本轮指令结果为正常时,确定可以执行第二指令,则端云协同模块在判定上述第二决策信息满足上述决策规则时,确定可以执行第二指令。
进而,若端云协同模块确定可以执行第二指令,则执行步骤S705;若不可直接执行第二指令,则执行步骤S706,即等待第一指令后再执行步骤S707。
S704、端云协同模块丢弃第二指令。
在一些实施例中,端云协同模块在得到第二指令后,可以从存储中查询该语音信号有无决策结果。进而,端云协同模块在查询到该语音信号有决策结果时,可以丢弃第二指令。可以理解的,若端云协同模块已经先获取第一指令并决策第一指令为目标指令,已执行该目标指令,则端云协同模块在得到第二指令时可以丢弃第二指令,以免重复执行该语音信号对应的操作。
S705、业务执行模块执行第二指令。
在一种实现中,业务执行模块接收来自端云协同模块发送的第二指令时,执行第二指令。例如,第二指令的内容可以为“调用天气应用获取明天的天气情况并通过卡片显示明天的天气情况”,则端云协同模块261在确定执行第二指令时,可以通过业务执行模块262调用天气应用获取明天的天气情况,并通过卡片显示明天的天气情况。
进一步的,业务执行模块在执行第二指令后,还可以将该语音信号的决策结果更新为有决策结果。那么,端云协同模块在接收到来自服务器的第一指令时查询该语音信号的决策结果发送该语音信号已被执行,则端云协同模块可以丢弃第一指令,以免进行不避要的操作。例如业务执行模块在执行第二指令后,可以在公共队列中写入该语音信号已有决策结果。
S706、端云协同模块从第一指令和第二指令中确定该语音信号对应的目标指令。
在一些实施例中,端云协同模块可以根据第三决策信息,从第二指令和第一指令中确定该语音信号对应的目标指令。
其中,第三决策信息可以包括上一轮决策结果、上一轮对话的状态标识、端侧本轮意图结果和端侧本轮指令结果,云侧本轮意图结果和端侧本轮指令结果。第三决策信息还可以包括上述端侧置信度和云侧置信度。
在另一些实施例中,端云协同模块在预设时间内未接收到第一指令时,端云协同模块可以将第二指令确定为目标指令,进而执行步骤S707。
S707、业务执行模块执行语音信号对应的目标指令。
在一些实施例中,业务执行模块在执行目标指令后,还可以存储本轮的决策结果。例如在目标指令为第二指令时,业务执行模块块在公共队列中写入本轮对话该语音信息采用的指令是来自车辆的第二指令。
在一些实施例中,目标指令是由服务器执行的,也即是说上述步骤S306为车辆将目标指令发送至服务器,以使服务器执行该目标指令。
在一些应用场景中,终端设备和服务器的垂域存在冲突,也就是,针对同一语音信号,终端设备得到的第一指令是由终端设备执行的;服务器得到的第一指令是由服务器执行的。 此时,易存在第一指令和第二指令均被执行的情况。例如在人车家互联的场景中,服务器可以控制智能家居,而车辆可以控制车内设备,此时,服务器垂域说法和车辆垂域说法存在冲突严重,易存在同时执行的问题,如当用户说“打开空调”时,可能会同时打开车上和家里的空调。针对上述情况,本申请实施例通过车辆对第一指令和第二指令进行选择,以保障只有一个指令被执行。
请参见图8,图8是本申请实施例提供的又一种人机对话方法,该方法包括以下部分或全部步骤:
S801、车辆通过语音助手接收语音信号。
步骤S801的具体内容可以参见上述步骤S301中的相关描述,此处不再赘述。
S802、车辆通过语音助手同时向端侧ASR和服务器发送该语音信号。
在一些实施例中,车辆在接收到语音信号时,向服务器发送该语音信号;同时,车辆可以执行S803,对语音信号进行处理,确定该语音信号对应的第一指令。如语音助手中的端云协同模块将该语音信号发送至ASR模块,由ASR模块将语音信号转换为文本信息后,由NLU模块基于该文本信息得到第一指令。其中,车辆通过语音助手向端侧ASR发送该语音信号也可以是,语音助手调用ASR接口对该语音信号进行处理。
步骤S802的具体内容可以参见上述步骤S302中的相关描述,此处不再赘述。
S803、车辆通过端侧ASR/NLU/DM,确定该语音信号对应的第一指令。
示例性的,语音助手中的端云协同模块,将语音信号依次发送至ASR模块、NLU模块和DM模块,得到第一指令。
步骤S802的具体内容可以参见上述步骤S303中的相关描述,此处不再赘述。
S804、车辆将第一指令发送至语音助手。
在一种实现中,车辆中DM模块在得到第一指令后,将第一指令发送至语音助手中的端云协同模块。
S805、服务器在接收到该语音信号后,确定语音信号对应的第二指令。
步骤S805的具体内容可以参见上述步骤S304中的相关描述,此处不再赘述。
S806、服务器将该第二指令发送至车辆。
在一些实施例中,服务器中DM模块在第二指令执行前停止执行,将第二指令发送至车辆。
步骤S806的具体内容可以参见上述步骤S305中的相关描述,此处不再赘述。
S807、车辆从第一指令和第二指令中确定该语音信号对应的目标指令。
在一些实施例中,车辆可以得到第一指令和第二指令后,确定目标指令。
其中,关于确定目标指令的具体内容可以参见上文中的相关描述,此处不再赘述。
S808、车辆在确定目标指令为第一指令时,通过业务执行模块执行第一指令。
例如第一指令为打开车内的空调,语音助手在确定目标指令为第一指令时,即确定用户意图为打开车内的空调,通过业务执行模块打开车内空凋。
车辆执行第一指令的具体内容可以参见上文中的相关描述,此处不再赘述。
在另一些实施例中,车辆在确定目标指令为第一指令时,还可以向服务器发送通知消息,该通知消息用于指示服务器不执行第二指令。
S809、车辆在确定目标指令为第二指令时,向服务器发送执行通知。
其中,该执行通知用于指示服务器执行第二指令,该执行指令可以包括决策结果(即第 二指令);还可以包括上下文信息,即本轮对话以及前N轮对话内容。
例如第一指令为打开家里的空调,语音助手在确定目标指令为第二指令时,即确定用户意图为打开家里的空调,向服务器发送执行通知。
S810、服务器响应于该执行通知,执行第二指令。
在一种实现中,第二指令为控制家居设备,则服务器中的DM模块可以加载上下文,调用Hilink以实现家居控制。
图9示例性示出了终端设备200的一种硬件结构示意图。
下面以电子设备100为例对实施例进行具体说明。应该理解的是,电子设备100可以具有比图中所示的更多的或者更少的部件,可以组合两个或多个的部件,或者可以具有不同的部件配置。图中所示出的各种部件可以在包括一个或多个信号处理和/或专用集成电路在内的硬件、软件、或硬件和软件的组合中实现。
电子设备100可以包括:处理器110,外部存储器接口120,内部存储器121,通用串行总线(universal serial bus,USB)接口130,充电管理模块140,电源管理模块141,电池142,天线1,天线2,移动通信模块150,无线通信模块160,音频模块170,扬声器170A,受话器170B,麦克风170C,耳机接口170D,传感器模块180,按键190,马达191,指示器192,摄像头193,显示屏194以及用户标识模块(subscriber identification module,SIM)卡接口195等。其中传感器模块180可以包括压力传感器180A,陀螺仪传感器180B,气压传感器180C,磁传感器180D,加速度传感器180E,距离传感器180F,接近光传感器180G,指纹传感器180H,温度传感器180J,触摸传感器180K,环境光传感器180L,骨传导传感器180M等。
可以理解的是,本申请实施例示意的结构并不构成对电子设备100的具体限定。在本申请另一些实施例中,电子设备100可以包括比图示更多或更少的部件,或者组合某些部件,或者拆分某些部件,或者不同的部件布置。图示的部件可以以硬件,软件或软件和硬件的组合实现。
处理器110可以包括一个或多个处理单元,例如:处理器110可以包括应用处理器(application processor,AP),调制解调处理器,图形处理器(graphics processing unit,GPU),图像信号处理器(image signal processor,ISP),控制器,存储器,视频编解码器,数字信号处理器(digital signal processor,DSP),基带处理器,和/或神经网络处理器(neural-network processing unit,NPU)等。其中,不同的处理单元可以是独立的器件,也可以集成在一个或多个处理器中。
其中,控制器可以是电子设备100的神经中枢和指挥中心。控制器可以根据指令操作码和时序信号,产生操作控制信号,完成取指令和执行指令的控制。
处理器110中还可以设置存储器,用于存储指令和数据。在一些实施例中,处理器110中的存储器为高速缓冲存储器。该存储器可以保存处理器110刚用过或循环使用的指令或数据。如果处理器110需要再次使用该指令或数据,可从所述存储器中直接调用。避免了重复存取,减少了处理器110的等待时间,因而提高了系统的效率。
在一些实施例中,处理器110可以包括一个或多个接口。接口可以包括集成电路(inter-integrated circuit,I2C)接口,集成电路内置音频(inter-integrated circuit sound,I2S)接口,脉冲编码调制(pulse code modulation,PCM)接口,通用异步收发传输器(universal asynchronous  receiver/transmitter,UART)接口,移动产业处理器接口(mobile industry processor interface,MIPI),通用输入输出(general-purpose input/output,GPIO)接口,用户标识模块(subscriber identity module,SIM)接口,和/或通用串行总线(universal serial bus,USB)接口等。
I2C接口是一种双向同步串行总线,包括一根串行数据线(serial data line,SDA)和一根串行时钟线(derail clock line,SCL)。在一些实施例中,处理器110可以包含多组I2C总线。处理器110可以通过不同的I2C总线接口分别耦合触摸传感器180K,充电器,闪光灯,摄像头193等。例如:处理器110可以通过I2C接口耦合触摸传感器180K,使处理器110与触摸传感器180K通过I2C总线接口通信,实现电子设备100的触摸功能。
I2S接口可以用于音频通信。在一些实施例中,处理器110可以包含多组I2S总线。处理器110可以通过I2S总线与音频模块170耦合,实现处理器110与音频模块170之间的通信。在一些实施例中,音频模块170可以通过I2S接口向无线通信模块160传递音频信号,实现通过蓝牙耳机接听电话的功能。
PCM接口也可以用于音频通信,将模拟信号抽样,量化和编码。在一些实施例中,音频模块170与无线通信模块160可以通过PCM总线接口耦合。在一些实施例中,音频模块170也可以通过PCM接口向无线通信模块160传递音频信号,实现通过蓝牙耳机接听电话的功能。所述I2S接口和所述PCM接口都可以用于音频通信。
UART接口是一种通用串行数据总线,用于异步通信。该总线可以为双向通信总线。它将要传输的数据在串行通信与并行通信之间转换。在一些实施例中,UART接口通常被用于连接处理器110与无线通信模块160。例如:处理器110通过UART接口与无线通信模块160中的蓝牙模块通信,实现蓝牙功能。在一些实施例中,音频模块170可以通过UART接口向无线通信模块160传递音频信号,实现通过蓝牙耳机播放音乐的功能。
MIPI接口可以被用于连接处理器110与显示屏194,摄像头193等外围器件。MIPI接口包括摄像头串行接口(camera serial interface,CSI),显示屏串行接口(display serial interface,DSI)等。在一些实施例中,处理器110和摄像头193通过CSI接口通信,实现电子设备100的拍摄功能。处理器110和显示屏194通过DSI接口通信,实现电子设备100的显示功能。
GPIO接口可以通过软件配置。GPIO接口可以被配置为控制信号,也可被配置为数据信号。在一些实施例中,GPIO接口可以用于连接处理器110与摄像头193,显示屏194,无线通信模块160,音频模块170,传感器模块180等。GPIO接口还可以被配置为I2C接口,I2S接口,UART接口,MIPI接口等。
SIM接口可以被用于与SIM卡接口195通信,实现传送数据到SIM卡或读取SIM卡中数据的功能。
USB接口130是符合USB标准规范的接口,具体可以是Mini USB接口,Micro USB接口,USB Type C接口等。USB接口130可以用于连接充电器为电子设备100充电,也可以用于电子设备100与外围设备之间传输数据。也可以用于连接耳机,通过耳机播放音频。该接口还可以用于连接其他电子设备,例如AR设备等。
可以理解的是,本申请实施例示意的各模块间的接口连接关系,只是示意性说明,并不构成对电子设备100的结构限定。在本申请另一些实施例中,电子设备100也可以采用上述实施例中不同的接口连接方式,或多种接口连接方式的组合。
充电管理模块140用于从充电器接收充电输入。其中,充电器可以是无线充电器,也可 以是有线充电器。
电源管理模块141用于连接电池142,充电管理模块140与处理器110。电源管理模块141接收电池142和/或充电管理模块140的输入,为处理器110,内部存储器121,外部存储器,显示屏194,摄像头193,和无线通信模块160等供电。
电子设备100的无线通信功能可以通过天线1,天线2,移动通信模块150,无线通信模块160,调制解调处理器以及基带处理器等实现。
天线1和天线2用于发射和接收电磁波信号。电子设备100中的每个天线可用于覆盖单个或多个通信频带。不同的天线还可以复用,以提高天线的利用率。例如:可以将天线1复用为无线局域网的分集天线。在另外一些实施例中,天线可以和调谐开关结合使用。
移动通信模块150可以提供应用在电子设备100上的包括2G/3G/4G/5G等无线通信的解决方案。移动通信模块150可以包括至少一个滤波器,开关,功率放大器,低噪声放大器(low noise amplifier,LNA)等。移动通信模块150可以由天线1接收电磁波,并对接收的电磁波进行滤波,放大等处理,传送至调制解调处理器进行解调。移动通信模块150还可以对经调制解调处理器调制后的信号放大,经天线1转为电磁波辐射出去。在一些实施例中,移动通信模块150的至少部分功能模块可以被设置于处理器110中。在一些实施例中,移动通信模块150的至少部分功能模块可以与处理器110的至少部分模块被设置在同一个器件中。
调制解调处理器可以包括调制器和解调器。其中,调制器用于将待发送的低频基带信号调制成中高频信号。解调器用于将接收的电磁波信号解调为低频基带信号。随后解调器将解调得到的低频基带信号传送至基带处理器处理。低频基带信号经基带处理器处理后,被传递给应用处理器。应用处理器通过音频设备(不限于扬声器170A,受话器170B等)输出声音信号,或通过显示屏194显示图像或视频。在一些实施例中,调制解调处理器可以是独立的器件。在另一些实施例中,调制解调处理器可以独立于处理器110,与移动通信模块150或其他功能模块设置在同一个器件中。
无线通信模块160可以提供应用在电子设备100上的包括无线局域网(wireless local area networks,WLAN)(如无线保真(wireless fidelity,Wi-Fi)网络),蓝牙(bluetooth,BT),全球导航卫星系统(global navigation satellite system,GNSS),调频(frequency modulation,FM),近距离无线通信技术(near field communication,NFC),红外技术(infrared,IR)等无线通信的解决方案。无线通信模块160可以是集成至少一个通信处理模块的一个或多个器件。无线通信模块160经由天线2接收电磁波,将电磁波信号调频以及滤波处理,将处理后的信号发送到处理器110。无线通信模块160还可以从处理器110接收待发送的信号,对其进行调频,放大,经天线2转为电磁波辐射出去。
在一些实施例中,电子设备100的天线1和移动通信模块150耦合,天线2和无线通信模块160耦合,使得电子设备100可以通过无线通信技术与网络以及其他设备通信。所述无线通信技术可以包括全球移动通讯系统(global system for mobile communications,GSM),通用分组无线服务(general packet radio service,GPRS),码分多址接入(code division multiple access,CDMA),宽带码分多址(wideband code division multiple access,WCDMA),时分码分多址(time-division code division multiple access,TD-SCDMA),长期演进(long term evolution,LTE),BT,GNSS,WLAN,NFC,FM,和/或IR技术等。所述GNSS可以包括全球卫星定位系统(global positioning system,GPS),全球导航卫星系统(global navigation satellite system,GLONASS),北斗卫星导航系统(beidou navigation satellite  system,BDS),准天顶卫星系统(quasi-zenith satellite system,QZSS)和/或星基增强系统(satellite based augmentation systems,SBAS)。
电子设备100通过GPU,显示屏194,以及应用处理器等实现显示功能。GPU为图像处理的微处理器,连接显示屏194和应用处理器。GPU用于执行数学和几何计算,用于图形渲染。处理器110可包括一个或多个GPU,其执行程序指令以生成或改变显示信息。
显示屏194用于显示图像,视频等。显示屏194包括显示面板。显示面板可以采用液晶显示屏(liquid crystal display,LCD),有机发光二极管(organic light-emitting diode,OLED),有源矩阵有机发光二极体或主动矩阵有机发光二极体(active-matrix organic light emitting diode,AMOLED),柔性发光二极管(flex light-emitting diode,FLED),Miniled,MicroLed,Micro-oLed,量子点发光二极管(quantum dot light emitting diodes,QLED)等。在一些实施例中,电子设备100可以包括1个或N个显示屏194,N为大于1的正整数。
电子设备100可以通过ISP,摄像头193,视频编解码器,GPU,显示屏194以及应用处理器等实现拍摄功能。
ISP用于处理摄像头193反馈的数据。例如,拍照时,打开快门,光线通过镜头被传递到摄像头感光元件上,光信号转换为电信号,摄像头感光元件将所述电信号传递给ISP处理,转化为肉眼可见的图像。ISP还可以对图像的噪点,亮度等进行算法优化。ISP还可以对拍摄场景的曝光,色温等参数优化。在一些实施例中,ISP可以设置在摄像头193中。
摄像头193用于捕获静态图像或视频。物体通过镜头生成光学图像投射到感光元件。感光元件可以是电荷耦合器件(charge coupled device,CCD)或互补金属氧化物半导体(complementary metal-oxide-semiconductor,CMOS)光电晶体管。感光元件把光信号转换成电信号,之后将电信号传递给ISP转换成数字图像信号。ISP将数字图像信号输出到DSP加工处理。DSP将数字图像信号转换成标准的RGB,YUV等格式的图像信号。在一些实施例中,电子设备100可以包括1个或N个摄像头193,N为大于1的正整数。
数字信号处理器用于处理数字信号,除了可以处理数字图像信号,还可以处理其他数字信号。例如,当电子设备100在频点选择时,数字信号处理器用于对频点能量进行傅里叶变换等。
视频编解码器用于对数字视频压缩或解压缩。电子设备100可以支持一种或多种视频编解码器。这样,电子设备100可以播放或录制多种编码格式的视频,例如:动态图像专家组(moving picture experts group,MPEG)1,MPEG2,MPEG3,MPEG4等。
NPU为神经网络(neural-network,NN)计算处理器,通过借鉴生物神经网络结构,例如借鉴人脑神经元之间传递模式,对输入信息快速处理,还可以不断的自学习。通过NPU可以实现电子设备100的智能认知等应用,例如:图像识别,人脸识别,语音识别,文本理解等。
外部存储器接口120可以用于连接外部存储卡,例如Micro SD卡,实现扩展电子设备100的存储能力。外部存储卡通过外部存储器接口120与处理器110通信,实现数据存储功能。例如将音乐,视频等文件保存在外部存储卡中。
内部存储器121可以用于存储计算机可执行程序代码,所述可执行程序代码包括指令。处理器110通过运行存储在内部存储器121的指令,从而执行电子设备100的各种功能应用以及数据处理。内部存储器121可以包括存储程序区和存储数据区。其中,存储程序区可存储操作系统,至少一个功能所需的应用(比如人脸识别功能,指纹识别功能、移动支付功能 等)等。存储数据区可存储电子设备100使用过程中所创建的数据(比如人脸信息模板数据,指纹信息模板等)等。此外,内部存储器121可以包括高速随机存取存储器,还可以包括非易失性存储器,例如至少一个磁盘存储器件,闪存器件,通用闪存存储器(universal flash storage,UFS)等。
电子设备100可以通过音频模块170,扬声器170A,受话器170B,麦克风170C,耳机接口170D,以及应用处理器等实现音频功能。例如音乐播放,录音等。
音频模块170用于将数字音频信息转换成模拟音频信号输出,也用于将模拟音频输入转换为数字音频信号。音频模块170还可以用于对音频信号编码和解码。在一些实施例中,音频模块170可以设置于处理器110中,或将音频模块170的部分功能模块设置于处理器110中。
扬声器170A,也称“喇叭”,用于将音频电信号转换为声音信号。电子设备100可以通过扬声器170A收听音乐,或收听免提通话。
受话器170B,也称“听筒”,用于将音频电信号转换成声音信号。当电子设备100接听电话或语音信息时,可以通过将受话器170B靠近人耳接听语音。
麦克风170C,也称“话筒”,“传声器”,用于将声音信号转换为电信号。当拨打电话或发送语音信息时,用户可以通过人嘴靠近麦克风170C发声,将声音信号输入到麦克风170C。电子设备100可以设置至少一个麦克风170C。在另一些实施例中,电子设备100可以设置两个麦克风170C,除了采集声音信号,还可以实现降噪功能。在另一些实施例中,电子设备100还可以设置三个,四个或更多麦克风170C,实现采集声音信号,降噪,还可以识别声音来源,实现定向录音功能等。
耳机接口170D用于连接有线耳机。耳机接口170D可以是USB接口130,也可以是3.5mm的开放移动电子设备平台(open mobile terminal platform,OMTP)标准接口,美国蜂窝电信工业协会(cellular telecommunications industry association of the USA,CTIA)标准接口。
压力传感器180A用于感受压力信号,可以将压力信号转换成电信号。在一些实施例中,压力传感器180A可以设置于显示屏194。压力传感器180A的种类很多,如电阻式压力传感器,电感式压力传感器,电容式压力传感器等。电容式压力传感器可以是包括至少两个具有导电材料的平行板。当有力作用于压力传感器180A,电极之间的电容改变。电子设备100根据电容的变化确定压力的强度。当有触摸操作作用于显示屏194,电子设备100根据压力传感器180A检测所述触摸操作强度。电子设备100也可以根据压力传感器180A的检测信号计算触摸的位置。在一些实施例中,作用于相同触摸位置,但不同触摸操作强度的触摸操作,可以对应不同的操作指令。例如:当有触摸操作强度小于第一压力阈值的触摸操作作用于短消息应用图标时,执行查看短消息的指令。当有触摸操作强度大于或等于第一压力阈值的触摸操作作用于短消息应用图标时,执行新建短消息的指令。
陀螺仪传感器180B可以用于确定电子设备100的运动姿态。在一些实施例中,可以通过陀螺仪传感器180B确定电子设备100围绕三个轴(即,x,y和z轴)的角速度。陀螺仪传感器180B可以用于拍摄防抖。示例性的,当按下快门,陀螺仪传感器180B检测电子设备100抖动的角度,根据角度计算出镜头模组需要补偿的距离,让镜头通过反向运动抵消电子设备100的抖动,实现防抖。陀螺仪传感器180B还可以用于导航,体感游戏场景。
气压传感器180C用于测量气压。在一些实施例中,电子设备100通过气压传感器180C测得的气压值计算海拔高度,辅助定位和导航。
磁传感器180D包括霍尔传感器。电子设备100可以利用磁传感器180D检测翻盖皮套的开合。在一些实施例中,当电子设备100是翻盖机时,电子设备100可以根据磁传感器180D检测翻盖的开合。进而根据检测到的皮套的开合状态或翻盖的开合状态,设置翻盖自动解锁等特性。
加速度传感器180E可检测电子设备100在各个方向上(一般为三轴)加速度的大小。当电子设备100静止时可检测出重力的大小及方向。还可以用于识别电子设备姿态,应用于横竖屏切换,计步器等应用。
距离传感器180F,用于测量距离。电子设备100可以通过红外或激光测量距离。在一些实施例中,拍摄场景,电子设备100可以利用距离传感器180F测距以实现快速对焦。
接近光传感器180G可以包括例如发光二极管(LED)和光检测器,例如光电二极管。发光二极管可以是红外发光二极管。电子设备100通过发光二极管向外发射红外光。电子设备100使用光电二极管检测来自附近物体的红外反射光。当检测到充分的反射光时,可以确定电子设备100附近有物体。当检测到不充分的反射光时,电子设备100可以确定电子设备100附近没有物体。电子设备100可以利用接近光传感器180G检测用户手持电子设备100贴近耳朵通话,以便自动熄灭屏幕达到省电的目的。接近光传感器180G也可用于皮套模式,口袋模式自动解锁与锁屏。
环境光传感器180L用于感知环境光亮度。电子设备100可以根据感知的环境光亮度自适应调节显示屏194亮度。环境光传感器180L也可用于拍照时自动调节白平衡。环境光传感器180L还可以与接近光传感器180G配合,检测电子设备100是否在口袋里,以防误触。
指纹传感器180H用于采集指纹。电子设备100可以利用采集的指纹特性实现指纹解锁,访问应用锁,指纹拍照,指纹接听来电等。
温度传感器180J用于检测温度。在一些实施例中,电子设备100利用温度传感器180J检测的温度,执行温度处理策略。例如,当温度传感器180J上报的温度超过阈值,电子设备100执行降低位于温度传感器180J附近的处理器的性能,以便降低功耗实施热保护。在另一些实施例中,当温度低于另一阈值时,电子设备100对电池142加热,以避免低温导致电子设备100异常关机。在其他一些实施例中,当温度低于又一阈值时,电子设备100对电池142的输出电压执行升压,以避免低温导致的异常关机。
触摸传感器180K,也称“触控面板”。触摸传感器180K可以设置于显示屏194,由触摸传感器180K与显示屏194组成触摸屏,也称“触控屏”。触摸传感器180K用于检测作用于其上或附近的触摸操作。触摸传感器可以将检测到的触摸操作传递给应用处理器,以确定触摸事件类型。可以通过显示屏194提供与触摸操作相关的视觉输出。在另一些实施例中,触摸传感器180K也可以设置于电子设备100的表面,与显示屏194所处的位置不同。
按键190包括开机键,音量键等。按键190可以是机械按键。也可以是触摸式按键。电子设备100可以接收按键输入,产生与电子设备100的用户设置以及功能控制有关的键信号输入。
马达191可以产生振动提示。马达191可以用于来电振动提示,也可以用于触摸振动反馈。例如,作用于不同应用(例如拍照,音频播放等)的触摸操作,可以对应不同的振动反馈效果。作用于显示屏194不同区域的触摸操作,马达191也可对应不同的振动反馈效果。不同的应用场景(例如:时间提醒,接收信息,闹钟,游戏等)也可以对应不同的振动反馈效果。触摸振动反馈效果还可以支持自定义。
指示器192可以是指示灯,可以用于指示充电状态,电量变化,也可以用于指示消息,未接来电,通知等。
SIM卡接口195用于连接SIM卡。SIM卡可以通过插入SIM卡接口195,或从SIM卡接口195拔出,实现和电子设备100的接触和分离。电子设备100可以支持1个或N个SIM卡接口,N为大于1的正整数。SIM卡接口195可以支持Nano SIM卡,Micro SIM卡,SIM卡等。同一个SIM卡接口195可以同时插入多张卡。所述多张卡的类型可以相同,也可以不同。SIM卡接口195也可以兼容不同类型的SIM卡。SIM卡接口195也可以兼容外部存储卡。电子设备100通过SIM卡和网络交互,实现通话以及数据通信等功能。
本实施例中,电子设备100可以通过处理器110执行所述人机对话方法。
图10是本申请实施例提供的电子设备100的软件结构框图。
分层架构将软件分成若干个层,每一层都有清晰的角色和分工。层与层之间通过软件接口通信。在一些实施例中,将Android系统分为四层,从上至下分别为应用程序层,应用程序框架层,安卓运行时(Android runtime)和系统库,以及内核层。
应用程序层可以包括一系列应用程序包。
如图10所示,应用程序包可以包括相机,图库,日历,通话,地图,导航,WLAN,蓝牙,音乐,视频,短信息和语音助手等应用程序。
在一些实施例中,用户可以通过语音助手与其他设备(如上文中服务器)进行通信连接,例如向其他设备发送语音信号或获取其他设备发送的语音识别结果(如上文中的第二指令)。
应用程序框架层为应用程序层的应用程序提供应用编程接口(application programming interface,API)和编程框架。应用程序框架层包括一些预先定义的函数。
如图10所示,应用程序框架层可以包括显示(display)管理器,传感器(sensor)管理器,跨设备连接管理器,事件管理器,任务(activity)管理器,窗口管理器,内容提供器,视图系统,资源管理器,通知管理器等。
显示管理器用于系统的显示管理,负责所有显示相关事务的管理,包括创建、销毁、方向切换、大小和状态变化等。一般来说,单设备上只会有一个默认显示模块,即主显示模块。
传感器管理器负责传感器的状态管理,并管理应用向其监听传感器事件,将事件实时上报给应用。
跨设备连接管理器用于和终端设备200建立通信连接,基于该通信连接向终端设备200发送语音信号。
事件管理器用于系统的事件管理服务,负责接收底层上传的事件并分发给各窗口,完成事件的接收和分发等工作。
任务管理器用于任务(Activity)组件的管理,包括启动管理、生命周期管理、任务方向管理等。
窗口管理器用于管理窗口程序。窗口管理器可以获取显示屏大小,判断是否有状态栏,锁定屏幕,截取屏幕等。窗口管理器还用于负责窗口显示管理,包括窗口显示方式、显示大小、显示坐标位置、显示层级等相关的管理。
以上各个实施例的具体执行过程可以参见下文中人机对话方法的相关内容。
内容提供器用来存放和获取数据,并使这些数据可以被应用程序访问。所述数据可以包括视频,图像,音频,拨打和接听的电话,浏览历史和书签,电话簿等。
视图系统包括可视控件,例如显示文字的控件,显示图片的控件等。视图系统可用于构建应用程序。显示界面可以由一个或多个视图组成的。例如,包括短信通知图标的显示界面,可以包括显示文字的视图以及显示图片的视图。
资源管理器为应用程序提供各种资源,比如本地化字符串,图标,图片,布局文件,视频文件等等。
通知管理器使应用程序可以在状态栏中显示通知信息,可以用于传达告知类型的消息,可以短暂停留后自动消失,无需用户交互。比如通知管理器被用于告知下载完成,消息提醒等。通知管理器还可以是以图表或者滚动条文本形式出现在系统顶部状态栏的通知,例如后台运行的应用程序的通知,还可以是以对话窗口形式出现在屏幕上的通知。例如在状态栏提示文本信息,发出提示音,电子设备振动,指示灯闪烁等。
Android Runtime包括核心库和虚拟机。Android runtime负责安卓系统的调度和管理。
核心库包含两部分:一部分是java语言需要调用的功能函数,另一部分是安卓的核心库。
应用程序层和应用程序框架层运行在虚拟机中。虚拟机将应用程序层和应用程序框架层的java文件执行为二进制文件。虚拟机用于执行对象生命周期的管理,堆栈管理,线程管理,安全和异常的管理,以及垃圾回收等功能。
系统库(也可称为数据管理层)可以包括多个功能模块。例如:表面管理器(surface manager),媒体库(Media Libraries),三维图形处理库(例如:OpenGL ES),2D图形引擎(例如:SGL)和事件数据等。
表面管理器用于对显示子系统进行管理,并且为多个应用程序提供了2D和3D图层的融合。
媒体库支持多种常用的音频,视频格式回放和录制,以及静态图像文件等。媒体库可以支持多种音视频编码格式,例如:MPEG4,H.264,MP3,AAC,AMR,JPG,PNG等。
三维图形处理库用于实现三维图形绘图,图像渲染,合成,和图层处理等。
2D图形引擎是2D绘图的绘图引擎。
内核层是硬件和软件之间的层。内核层至少包含显示驱动,摄像头驱动,音频驱动,传感器驱动。
本申请实施例还提供了一种电子设备,电子设备包括一个或多个处理器和一个或多个存储器;其中,一个或多个存储器与一个或多个处理器耦合,一个或多个存储器用于存储计算机程序代码,计算机程序代码包括计算机指令,当一个或多个处理器执行计算机指令时,使得电子设备执行上述实施例描述的方法。
本申请实施例还提供了一种包含指令的计算机程序产品,当计算机程序产品在电子设备上运行时,使得电子设备执行上述实施例描述的方法。
本申请实施例还提供了一种计算机可读存储介质,包括指令,当指令在电子设备上运行时,使得电子设备执行上述实施例描述的方法。
可以理解的是,本申请的各实施方式可以任意进行组合,以实现不同的技术效果。
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机程序指令时,全部或部分地产生按照本申请所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、 或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线)或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质,(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质(例如固态硬盘Solid State Disk)等。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,该流程可以由计算机程序来指令相关的硬件完成,该程序可存储于计算机可读取存储介质中,该程序在执行时,可包括如上述各方法实施例的流程。而前述的存储介质包括:ROM或随机存储记忆体RAM、磁碟或者光盘等各种可存储程序代码的介质。
总之,以上所述仅为本申请技术方案的实施例而已,并非用于限定本申请的保护范围。凡根据本申请的揭露,所作的任何修改、等同替换、改进等,均应包含在本申请的保护范围之内。

Claims (17)

  1. 一种人机对话方法,其特征在于,应用于终端设备,所述方法包括:
    所述终端设备接收语音信号;
    所述终端设备基于所述语音信号,得到第一指令;
    所述终端设备将所述语音信号或所述语音信号的文本发送至服务器,所述服务器用于基于所述语音信号得到第二指令;所述语音信号的文本是所述终端设备基于所述语音信号得到的;
    所述终端设备接收服务器发送的所述第二指令;
    所述终端设备在先得到的指令不可直接执行时,从所述第一指令和所述第二指令选择目标指令执行,所述目标指令为所述第一指令或所述第二指令。
  2. 根据权利要求1所述的方法,其特征在于,所述方法还包括:
    所述终端设备在先得到的指令可直接执行时,执行所述先得到的指令;
    所述终端设备不执行后得到的指令。
  3. 根据权利要求1或2所述的方法,其特征在于,所述从所述第一指令和所述第二指令选择目标指令执行之前,所述方法还包括:
    所述终端设备在得到所述第一指令时,若未接收到所述第二指令,确定所述第一指令的可执行情况;所述可执行情况包括可直接执行和不可直接执行;
    所述终端设备在确定所述第一指令不可直接执行时,等待所述第二指令。
  4. 根据权利要求3所述的方法,其特征在于,所述确定所述第一指令的可执行情况,包括:
    所述终端设备基于第一决策信息,确定所述第一指令的可执行情况;所述第一决策信息包括上一轮决策结果、上一轮对话的状态标识、所述终端设备确定的本轮的意图结果、所述终端设备确定的本轮的指令结果和所述第一指令的置信度中的至少一种;
    所述上一轮决策结果用于指示针对上一轮对话执行的指令的来源,所述来源包括所述服务器和所述终端设备;所述状态标识用于指示单轮或多轮;所述意图结果用于指示新意图或多轮意图;所述指令结果包括正常或异常,所述异常用于指示无法执行语音信号对应的操作。
  5. 根据权利要求1-4任一项所述的方法,其特征在于,所述从所述第一指令和所述第二指令选择目标指令执行之前,所述方法还包括:
    所述终端设备在得到所述第二指令时,若未接收到所述第一指令,确定所述第二指令的可执行情况;
    所述终端设备在确定所述第二指令不可直接执行时,等待所述第一指令。
  6. 根据权利要求5所述的方法,其特征在于,所述确定所述第二指令的可执行情况,包括:
    所述终端设备基于第二决策信息,确定所述第二指令的可执行情况;所述第二决策信息 包括上一轮决策结果、上一轮对话的状态标识、所述服务器确定的本轮的意图结果和所述服务器确定的本轮的指令结果中的至少一种;
    所述上一轮决策结果用于指示针对上一轮对话执行的指令的来源,所述来源包括所述服务器和所述终端设备;所述状态标识用于指示单轮或多轮;所述意图结果用于指示新意图或多轮意图;所述指令结果包括正常或异常,所述异常用于指示无法执行语音信号对应的操作。
  7. 根据权利要求1-6所述的方法,其特征在于,所述从所述第一指令和所述第二指令选择目标指令执行,包括:
    所述终端设备基于第三决策信息,从所述第一指令和所述第二指令中确定所述目标指令;
    所述第三决策信息包括上一轮决策结果、上一轮对话的状态标识、所述服务器确定的本轮的意图结果、所述服务器确定的本轮的指令结果、所述终端设备确定的本轮的意图结果、所述终端设备确定的本轮的指令结果和所述第一指令的置信度中的至少一种;
    所述上一轮决策结果用于指示针对上一轮对话执行的指令的来源,所述来源包括所述服务器和所述终端设备;所述状态标识用于指示单轮或多轮;所述意图结果用于指示新意图或多轮意图;所述指令结果包括正常或异常,所述异常用于指示无法执行语音信号对应的操作。
  8. 根据权利要求1-7任一项所述的方法,其特征在于,所述方法还包括:
    所述终端设备将对话状态发送至所述服务器,所述对话状态用于所述服务器生成所述第二指令;所述对话状态包括所述上一轮决策结果和所述上一轮对话的状态标识中的至少一种。
  9. 根据权利要求1-8所述的方法,其特征在于,所述终端设备基于所述语音信号,得到第一指令,包括:
    所述终端设备基于所述语音信号,得到所述语音信号的意图;
    所述终端设备基于所述语音信号的意图确定所述第一指令。
  10. 根据权利要求1-9任一项所述的方法,其特征在于,所述终端设备基于所述语音信号,得到第一指令,包括:
    将所述语音信号输入到端侧模型,得到所述第一指令对应的意图;
    所述终端设备基于所述语音信号的意图,得到所述第一指令。
  11. 根据权利要求10任一项所述的方法,其特征在于,所述方法还包括:
    所述终端设备以所述语音信号和所述第二指令对应的意图为训练样本对所述端侧模型进行训练。
  12. 根据权利要求1-11所述的方法,其特征在于,所述从所述第一指令和所述第二指令选择目标指令执行,包括:
    所述终端设备执行所述目标指令;或者,所述终端设备将所述目标指令发送至所述服务器,所述服务器用于执行所述目标指令。
  13. 一种人机对话方法,其特征在于,应用于服务器,所述方法包括:
    所述服务器接收来自终端设备的语音信号或所述语音信号的文本,所述语音信号的文本是所述终端设备基于所述语音信号得到的;
    所述服务器基于所述语音信号,得到第二指令;
    所述服务器向所述终端设备发送所述第二指令;
    所述服务器接收来自所述终端设备的目标指令;所述目标指令是所述终端设备基于第一指令和所述第二指令得到的,所述目标指令为所述第一指令或所述第二指令;所述第一指令是所述终端设备基于所述语音信号得到的;
    所述服务器执行所述目标指令。
  14. 一种电子设备,其特征在于,所述电子设备包括一个或多个处理器和一个或多个存储器;其中,所述一个或多个存储器与所述一个或多个处理器耦合,所述一个或多个存储器用于存储计算机程序代码,所述计算机程序代码包括计算机指令,当所述一个或多个处理器执行所述计算机指令时,使得所述电子设备执行如权利要求1-13中任一项所述的方法。
  15. 一种包含指令的计算机程序产品,其特征在于,当所述计算机程序产品在电子设备上运行时,使得所述电子设备执行如权利要求1-13中任一项所述的方法。
  16. 一种计算机可读存储介质,包括指令,其特征在于,当所述指令在电子设备上运行时,使得所述电子设备执行如权利要求1-13中任一项所述的方法。
  17. 一种人机对话系统,其特征在于,所述人机对话系统包括终端设备和服务器,所述终端设备用于执行如权利要求1-12中任一项所述的方法,所述服务器用于执行如权利要求13中任一项所述的方法。
PCT/CN2023/099440 2022-06-13 2023-06-09 一种人机对话方法、设备及系统 WO2023241482A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210663567.1A CN117275470A (zh) 2022-06-13 2022-06-13 一种人机对话方法、设备及系统
CN202210663567.1 2022-06-13

Publications (1)

Publication Number Publication Date
WO2023241482A1 true WO2023241482A1 (zh) 2023-12-21

Family

ID=89192198

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/099440 WO2023241482A1 (zh) 2022-06-13 2023-06-09 一种人机对话方法、设备及系统

Country Status (2)

Country Link
CN (1) CN117275470A (zh)
WO (1) WO2023241482A1 (zh)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2013064777A (ja) * 2011-09-15 2013-04-11 Ntt Docomo Inc 端末装置、音声認識プログラム、音声認識方法および音声認識システム
CN104134442A (zh) * 2014-08-15 2014-11-05 广东欧珀移动通信有限公司 一种启动语音服务的方法及装置
CN104681026A (zh) * 2013-11-27 2015-06-03 夏普株式会社 语音识别终端及系统、服务器及其控制方法、非易失性存储介质
CN105931645A (zh) * 2016-04-12 2016-09-07 深圳市京华信息技术有限公司 虚拟现实设备的控制方法、装置及虚拟现实设备、系统
CN106847291A (zh) * 2017-02-20 2017-06-13 成都启英泰伦科技有限公司 一种本地和云端相结合的语音识别系统及方法
CN111312253A (zh) * 2018-12-11 2020-06-19 青岛海尔洗衣机有限公司 语音控制方法、云端服务器及终端设备
CN113053369A (zh) * 2019-12-26 2021-06-29 青岛海尔空调器有限总公司 智能家电的语音控制方法及装置、智能家电
CN114550719A (zh) * 2022-02-21 2022-05-27 青岛海尔科技有限公司 语音控制指令的识别方法和装置、存储介质

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2013064777A (ja) * 2011-09-15 2013-04-11 Ntt Docomo Inc 端末装置、音声認識プログラム、音声認識方法および音声認識システム
CN104681026A (zh) * 2013-11-27 2015-06-03 夏普株式会社 语音识别终端及系统、服务器及其控制方法、非易失性存储介质
CN104134442A (zh) * 2014-08-15 2014-11-05 广东欧珀移动通信有限公司 一种启动语音服务的方法及装置
CN105931645A (zh) * 2016-04-12 2016-09-07 深圳市京华信息技术有限公司 虚拟现实设备的控制方法、装置及虚拟现实设备、系统
CN106847291A (zh) * 2017-02-20 2017-06-13 成都启英泰伦科技有限公司 一种本地和云端相结合的语音识别系统及方法
CN111312253A (zh) * 2018-12-11 2020-06-19 青岛海尔洗衣机有限公司 语音控制方法、云端服务器及终端设备
CN113053369A (zh) * 2019-12-26 2021-06-29 青岛海尔空调器有限总公司 智能家电的语音控制方法及装置、智能家电
CN114550719A (zh) * 2022-02-21 2022-05-27 青岛海尔科技有限公司 语音控制指令的识别方法和装置、存储介质

Also Published As

Publication number Publication date
CN117275470A (zh) 2023-12-22

Similar Documents

Publication Publication Date Title
WO2021027267A1 (zh) 语音交互方法、装置、终端及存储介质
CN110910872B (zh) 语音交互方法及装置
WO2020168929A1 (zh) 对特定路线上的特定位置进行识别的方法及电子设备
US20220310095A1 (en) Speech Detection Method, Prediction Model Training Method, Apparatus, Device, and Medium
CN110138959B (zh) 显示人机交互指令的提示的方法及电子设备
WO2020177619A1 (zh) 终端充电提醒方法、装置、设备及存储介质
CN110825469A (zh) 语音助手显示方法及装置
WO2022242699A1 (zh) 一种信息推荐方法以及相关设备
CN110716776A (zh) 一种显示用户界面的方法及车载终端
KR20170061489A (ko) 전자 기기 및 이의 운송 기기 제어 방법
WO2021088393A1 (zh) 确定位姿的方法、装置和系统
US20220116758A1 (en) Service invoking method and apparatus
US20230410809A1 (en) Method for interaction between mobile terminal and in-vehicle terminal, terminal, and system
WO2022037398A1 (zh) 一种音频控制方法、设备及系统
CN111835904A (zh) 一种基于情景感知和用户画像开启应用的方法及电子设备
CN111222836A (zh) 一种到站提醒方法及相关装置
US20230169467A1 (en) Reminding Method and Related Apparatus
WO2021238371A1 (zh) 生成虚拟角色的方法及装置
WO2023169448A1 (zh) 一种感知目标的方法和装置
CN112269939B (zh) 自动驾驶的场景搜索方法、装置、终端、服务器及介质
WO2024001940A1 (zh) 寻车的方法、装置和电子设备
WO2023071940A1 (zh) 跨设备的导航任务的同步方法、装置、设备及存储介质
WO2023241482A1 (zh) 一种人机对话方法、设备及系统
CN113380240A (zh) 语音交互方法和电子设备
WO2023104075A1 (zh) 一种分享导航信息的方法、电子设备和系统

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23823050

Country of ref document: EP

Kind code of ref document: A1