WO2022252946A1

WO2022252946A1 - Voice control method, voice control device, server, and storage medium

Info

Publication number: WO2022252946A1
Application number: PCT/CN2022/092246
Authority: WO
Inventors: 赵耀; 易晖; 翁志伟
Original assignee: 广州小鹏汽车科技有限公司
Priority date: 2021-06-03
Filing date: 2022-05-11
Publication date: 2022-12-08
Also published as: CN113421561A; CN113421561B

Abstract

A voice control method, a voice control device (100), a server (500), and a storage medium (800). The voice control method comprises: receiving a voice instruction of a current round, receiving graphical user interface information, and fusing the graphical user interface information and voice dialogue information of a historical round to generate a dynamic scene (01); generating a scene semantic document according to the dynamic scene (02); determining, according to the scene semantic document, a semantic understanding corresponding to the voice instruction of the current round (03); determining a semantic understanding result according to the semantic understanding corresponding to the voice instruction of the current round or a global semantic understanding (04); and controlling, according to the semantic understanding result, a vehicle (1000) to perform a corresponding operation (05).

Description

Voice control method, voice control device, server and storage medium

priority information

This application claims priority and rights to the patent application No. 202110619459.X filed with the State Intellectual Property Office of China on June 3, 2021, and is hereby incorporated by reference in its entirety.

technical field

The present application relates to the technical field of voice recognition, in particular to a voice control method, a voice control device, a server and a storage medium.

Background technique

In the related art, when the voice control device handles complex tasks, it will ask the user as many task details as possible, and the voice control device can understand the specific wishes of the user only after multiple rounds of voice dialogues with the user. However, this kind of multi-round speech dialogue needs to complete the description of the user's intention by fusing the semantic understanding results of a single round and the information of multiple rounds. However, it is difficult for the voice control device to expand to each vertical domain in the scenario of multiple vertical domains (such as multiple rounds of voice dialogue). And as the vertical domain increases, the accuracy of semantic understanding decreases, which ultimately leads to poor user experience.

Contents of the invention

Embodiments of the present application provide a voice control method, a voice control device, a server, and a storage medium.

The voice control method of the embodiment of the present application includes: receiving the voice command of the current round, receiving the GUI information, fusing the GUI information and the voice dialogue information of the previous round to generate a dynamic scene; generating a scene according to the dynamic scene Semantic documents; determine the semantic understanding corresponding to the current round of voice commands according to the scene semantic documents; determine the semantic understanding results according to the semantic understanding or global semantic understanding corresponding to the current round of voice commands; control the semantic understanding according to the semantic understanding results The vehicle performs the appropriate action.

In some implementations, the receiving the voice instruction of the current round, receiving the graphical user interface information, and fusing the graphical user interface information and the voice dialogue information of the previous round to generate a dynamic scene include: after receiving the current round In the case of a round of voice commands, the semantic space is determined according to the voice dialogue information of the historical round, and the semantic space is used to represent the semantic understanding direction of the current round of voice commands; the dynamic scene is determined according to the semantic space and GUI information .

In some implementations, the receiving the voice instruction of the current round, receiving the graphical user interface information, and fusing the graphical user interface information and the voice dialogue information of the previous round to generate a dynamic scene include: after receiving the current round In the case of a round voice command, load and analyze the dynamic scene elements included in the historical round voice dialogue information; generate a dynamic scene according to the dynamic scene elements and the historical round voice dialogue information.

In some implementations, the similarity between the document data of the scene semantic document and the dynamic scene element is greater than a similarity threshold.

In some implementations, the determining the semantic understanding result according to the semantic understanding corresponding to the voice instruction of the current round or the global semantic understanding includes: using the semantic understanding corresponding to the voice instruction of the current round to search the database; When there is a result matching the semantic understanding corresponding to the voice command of the current round, use the semantic understanding corresponding to the voice command of the current round as the semantic understanding result; if there is no search result matching the voice command of the current round When the semantic understanding corresponding to the instruction matches the result, the global semantic understanding is taken as the semantic understanding result.

In some implementations, the controlling the vehicle to perform corresponding operations according to the semantic understanding result includes: when the semantic understanding corresponding to the voice command of the current round is used as the semantic understanding result, the historical round The voice dialogue information is updated, and an operation instruction is sent to enable the vehicle to perform a corresponding operation; when the global semantic understanding is the result of the semantic understanding, the vehicle is controlled to initiate a new round of dialogue tasks.

In some implementations, the updating the voice dialogue information of the historical round includes: querying the dialogue action information output by the user and the dialogue action information output by the system according to the voice dialogue information of the historical round, to obtain user slot parameters and system slot parameters; use the user slot parameters and the system slot parameters to execute slot actions, update trusted slot parameters, and update dialogue state information.

In some implementations, the slot execution action includes at least one of a continuation action, a delete action, an update action, and an invalidation action.

In some implementations, the updating the voice dialogue information of the historical round includes: judging the priority order of multiple scene pages in the dynamic scene; according to the priority order of the multiple scene pages Pushing the high-priority scene page node into the low-priority scene page stack; controlling the vehicle to perform corresponding operations corresponding to the high-priority scene page.

The voice control device in the embodiment of the present application includes: a first generation module, a second generation module, a first determination module, a second determination module and a control module. The first generation module is used to receive the current round of voice instructions, receive the GUI information, and fuse the GUI information and the voice dialogue information of the historical round to generate a dynamic scene; the second generation module is used to generate a dynamic scene according to The dynamic scene generates a scene semantic document; the first determination module is used to determine the semantic understanding corresponding to the current round voice command according to the scene semantic document; the second determination module is used to determine the semantic understanding corresponding to the current round voice command according to the current round The semantic understanding or global semantic understanding corresponding to the voice command determines the semantic understanding result; the control module is used to control the vehicle to perform corresponding operations according to the semantic understanding result.

The server in the embodiments of the present application includes a memory and a processor. A computer program is stored in the memory, and when the computer program is executed by the processor, the voice control method in any one of the above-mentioned embodiments is implemented.

The non-volatile computer-readable storage medium of the computer program in the embodiment of the present application, when the computer program is executed by one or more processors, implements the voice control method in any of the above embodiments.

The voice control method, voice control device, server, and storage medium of the embodiments of the present application can fuse the graphical user interface information and the voice dialogue information of historical rounds to generate a dynamic scene, generate a scene semantic document according to the dynamic scene, and can generate a scene semantic document according to the scene semantic document. Limit the semantic understanding process within the task and manage the speech in multiple rounds, so as to improve the accuracy of semantic understanding of multiple rounds of dialogue in this vertical domain.

Description of drawings

The above and/or additional aspects and advantages of the present application will become apparent and easy to understand from the following description of the embodiments in conjunction with the accompanying drawings, wherein:

FIG. 1 is a schematic flow diagram of a voice control method according to an embodiment of the present application;

FIG. 2 is a block diagram of a voice control device according to an embodiment of the present application;

FIG. 3 is a schematic diagram of modules of a server according to an embodiment of the present application;

Fig. 4 is a schematic diagram of a vehicle according to an embodiment of the present application;

FIG. 5 is an interactive schematic diagram of a voice control method according to an embodiment of the present application;

FIG. 6 is a schematic flowchart of a voice control method according to an embodiment of the present application;

7 to 9 are schematic diagrams of scenarios of voice control methods according to embodiments of the present application;

FIG. 10 is a schematic flowchart of a voice control method according to an embodiment of the present application;

FIG. 11 and FIG. 12 are schematic diagrams of scenarios of voice control methods according to embodiments of the present application;

FIG. 13 to FIG. 15 are schematic flowcharts of voice control methods according to embodiments of the present application;

FIG. 16 is a schematic diagram of a scene of a voice control method according to an embodiment of the present application;

FIG. 17 is a schematic flowchart of a voice control method according to an embodiment of the present application;

FIG. 18 is a schematic diagram of a scene of a voice control method according to an embodiment of the present application;

Fig. 19 is a schematic diagram of connection between a processor and a computer-readable storage medium according to an embodiment of the present application.

Detailed ways

Embodiments of the present application are described in detail below, examples of which are shown in the drawings, wherein the same or similar reference numerals denote the same or similar elements or elements having the same or similar functions throughout. The embodiments described below by referring to the figures are exemplary, and are intended to explain the present application, and should not be construed as limiting the present application.

In the description of the embodiments of the present application, the terms "first" and "second" are used for description purposes only, and cannot be understood as indicating or implying relative importance or implicitly indicating the quantity of indicated technical features. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of said features. In the description of the embodiments of the present application, "plurality" means two or more, unless otherwise specifically defined.

Please refer to FIG. 1 , the present application provides a voice control method. include:

Step 01: Receive the voice command of the current round, receive the GUI information, fuse the GUI information and the voice dialogue information of the previous round to generate a dynamic scene;

Step 02: Generate scene semantic documents according to dynamic scenes;

Step 03: Determine the semantic understanding corresponding to the voice command of the current round according to the scene semantic document;

Step 04: Determine the semantic understanding result according to the semantic understanding corresponding to the current round of voice commands or the global semantic understanding;

Step 05: Control the vehicle 1000 to perform corresponding operations according to the semantic understanding result.

Referring to FIG. 2 , the voice control device 100 according to the embodiment of the present application includes a first generation module 10 , a second generation module 20 , a first determination module 30 , a second determination module 40 and a control module 50 . The voice control method of the present application can be realized by the voice control device 100 of the embodiment of the present application, wherein, step 01 can be realized by the first generating module 10, step 02 can be realized by the second generating module 20, and step 03 can be realized by the first determining module 30, step 04 can be realized by the second determination module 40, and step 05 can be realized by the control module 50, that is to say, the first generation module 10 is used to receive the current round of voice instructions, receive the graphical user interface information, and fuse the graphics User interface information and voice dialogue information of historical rounds to generate dynamic scenes. The second generating module 20 is used for generating scene semantic documents according to dynamic scenes. The first determining module 30 is configured to determine the semantic understanding corresponding to the voice command of the current round according to the scene semantic document. The second determination module 40 is configured to determine the semantic understanding result according to the semantic understanding corresponding to the current round of voice instruction or the global semantic understanding. The control module 50 is used for controlling the vehicle 1000 to perform corresponding operations according to the result of semantic understanding.

Please refer to FIG. 3 and FIG. 4 together. The server 500 in the embodiment of the present application includes a memory 200 and a processor 300 . Server 500 is used to control vehicle 1000 . The voice control method in the embodiment of the present application may be realized by the server 500 in the embodiment of the present application. The server 500 may include a system end, and a computer program is stored in the memory 200. When the computer program is executed by the processor 300, the above-mentioned voice control method is implemented. Among them, step 01, step 02, step 03, step 04 and step 05 can all be implemented by the processor 300, that is to say, the processor 300 can be used to: receive the current round of voice commands, receive graphical user interface information, and integrate graphics User interface information and historical rounds of voice dialogue information to generate dynamic scenes; generate scene semantic documents according to dynamic scenes; determine the semantic understanding corresponding to the current round of voice commands according to the scene semantic documents; The global semantic understanding determines the semantic understanding result; the vehicle 1000 is controlled to perform corresponding operations according to the semantic understanding result.

The processor 300 may include a driver board. The driver board can be a central processing unit (Central Processing Unit, CPU), or other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), off-the-shelf programmable Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.

Specifically, the voice dialogue information of the historical rounds includes the historical dialogue information between the user and the system, the voice command of the current round can be an action of the user, and the graphical user interface information (Graphical User Interface, GUI) includes the on-board system running on the vehicle 1000 Use a graphical user interface for presenting displayed content to users.

In an embodiment, the voice command of the current round is received, and the voice command of the previous round may be: the command "OK to close" issued by the user. In the case of receiving the voice command of "OK to close", at the same time receive the graphical user interface information, the voice dialogue information of the historical rounds, including: the user's command "turn off the low-speed analog tone" and the system confirms "low-speed analog tone Can remind pedestrians, reduce safety risks, are you sure to close it?" These two historical rounds of voice dialogue information. Fusion of GUI information and historical rounds of voice dialogue information to generate dynamic scenes.

It is worth mentioning that, in some implementations, the voice control device 100 can be used to control the vehicle 1000, and the vehicle 1000 includes a display area, an electro-acoustic element, a communication element, a processor, and the like. The display area of the vehicle 1000 may include the instrument screen, the large screen at the end of the vehicle, and the head-up display that can be realized on the windshield of the vehicle 1000. limited. Specifically, referring to FIG. 5 , the vehicle 1000 includes a dynamic scene generator and a large screen at the vehicle end. The large screen at the car end can receive user requests, and can also display the responses generated by the system to users. The presentation methods include display presentation and voice presentation, which are not limited here. The large screen at the car end can understand the received user request in natural language, and at the same time pass the graphical user interface information on the large screen at the car end to the dynamic scene generator. The dynamic scene generator can combine the graphical user interface information and the history of rounds Voice dialogue information to generate dynamic scenes.

Dynamic scenes can generate scene semantic documents. The scene semantic documents can be understood as a searchable space. The scene semantic documents include multiple semantic understandings. In this way, the semantic understanding corresponding to the current round of voice commands can be queried according to the scene semantic documents. It is worth mentioning that, in some implementations, in the scene semantic document generated by the dynamic scene, if the semantic understanding corresponding to the voice command of the current round cannot be found, the global semantic understanding can be generated by combining the global information. That is to say, both the semantic understanding corresponding to the current round of voice commands and the global semantic understanding can obtain semantic understanding results. The difference is that the semantic understanding corresponding to the voice command of the current round is determined according to the scene semantic document search or other methods, while the global semantic understanding cannot be searched in the scene semantic document. Both the semantic understanding and global semantic understanding corresponding to the voice command of the current round are semantic understanding results, so that the vehicle 1000 can be controlled to perform corresponding operations according to the semantic understanding results.

Specifically, in the case of the semantic understanding result determined by the semantic understanding corresponding to the voice command of the previous round, the voice dialogue information of the previous round can be updated, and then the vehicle 1000 is controlled to perform the corresponding operation. The operation can be "open the window", Actions such as "close navigation" and "open music interface" are not limited here. In the case of the semantic understanding result determined by the global semantic understanding, the voice dialogue information of the historical round is not updated. If the semantic understanding result determined by the global semantic understanding is received, a new round of dialogue tasks can be initiated.

The voice control method, the voice control device 100 and the server 500 of the embodiment of the present application can fuse the graphical user interface information and the voice dialogue information of the historical rounds to generate a dynamic scene, generate a scene semantic document according to the dynamic scene, and limit the The semantic understanding process within the task manages the speech in multiple rounds, so as to improve the accuracy of semantic understanding of multiple rounds of dialogue in this vertical domain.

Referring to Figure 6, in some embodiments, step 01 includes the steps of:

Step 012: In the case of receiving the voice command of the current round, determine the semantic space according to the voice dialogue information of the historical round, and the semantic space is used to represent the semantic understanding direction of the voice command of the current round;

Step 014: Determine the dynamic scene according to the semantic space and GUI information.

In some implementations, the voice control device 100 includes a third determining module, and the third determining module includes a first determining subunit and a second determining subunit. The voice control method of the present application can be realized by the voice control device 100 of the embodiment of the present application, wherein step 012 can be realized by the first determination subunit, and step 014 can be realized by the second determination subunit, that is to say, the first determination subunit The unit is used to determine the semantic space according to the voice dialogue information of the historical round when the voice command of the current round is received, and the semantic space is used to represent the semantic understanding direction of the voice command of the current round. The second determining subunit is used for determining the dynamic scene according to the semantic space and the GUI information.

In some implementations, the voice control method of the implementation of the present application can be implemented by the server 500 of the implementation of the application, wherein both step 012 and step 014 can be implemented by the processor 300, that is to say, the processor 300 can be used to : In the case of receiving the voice command of the current round, determine the semantic space according to the voice dialogue information of the historical round, and the semantic space is used to represent the semantic understanding direction of the voice command of the current round; determine the dynamic according to the semantic space and GUI information Scenes.

Specifically, the semantic space is determined by the speech dialogue information of the historical round, and the semantic space is used to represent the semantic understanding direction of the current round of speech instructions. Semantic space can be understood as a certain semantic range. Semantic space can include static semantic space and dynamic semantic space.

Please refer to FIG. 7, which includes a dialog system round (ie, multiple rounds of dialog on a vertical domain), that is, dialog one. The dialogue system in Figure 7 is asking the user whether he understands the security risks brought by the operation. If the user confirms, perform the corresponding action. The latent semantic space for the next round is Figure 7. Please refer to Fig. 8, which also includes a dialog system round (ie, multiple rounds of dialog on a vertical domain), that is, dialog 2. The dialogue system in Figure 8 is in turn asking the user if he understands the consequences of the action. If the user confirms, perform the corresponding action. The latent semantic space for the next round is Figure 8. In this way, according to the understanding, the potential dialogue action information of the two dialogues in the confirmation round (the next round) is the same, but the latent semantic space is different. And the potential semantic space of the two can be determined in the system confirmation round (the next round), that is, the static semantic space.

Please refer to Fig. 9, which includes a dialog system round (ie, multiple rounds of dialog on a vertical domain), that is, dialog three. The dialog system in Fig. 9 is asking the user for the result of the selection in turn. The latent semantic space in the user reply round (next round) cannot be determined in the system inquiry round, the latent semantic space is shown in Figure 9, so it is a dynamic semantic space.

The static semantic space can be understood as: the reply in the semantic space does not depend on various factors such as time, scene, space and user, while the dynamic semantic space has many variables. In one embodiment, different locations of the current user will result in different content in the semantic space. For example, when a user says to navigate to Peking University in Zhongguancun, and a user says to navigate to Peking University in Shenzhen, the list of optional routes formed is different, so the dynamic semantic space will change according to the region where the user is located.

In some implementations, the dynamic scene can be understood as: converting the semantic space into a readable tree structure, and retaining all information in the semantic space. In this way, dynamic scenes can be determined based on semantic space and GUI information.

Referring to Figure 10, in some embodiments, step 01 includes the steps of:

Step 016: In the case of receiving the voice command of the current round, load and analyze the dynamic scene elements included in the voice dialogue information of the previous round;

Step 018: Generate a dynamic scene according to the dynamic scene elements and the voice dialogue information of the historical rounds.

In some implementations, the voice control device 100 includes a first processing module and a second generation module, the voice control method of the present application can be realized by the voice control device 100 of the embodiment of the present application, wherein step 016 can be realized by the first processing module, Step 018 can be realized by the second generation module, that is to say, the first processing module is used to load and analyze the dynamic scene elements included in the voice dialogue information of the previous round when receiving the voice command of the current round . The second generation module is used to generate dynamic scenes according to dynamic scene elements and voice dialogue information of historical rounds.

In some implementations, the voice control method of the implementation of the present application can be implemented by the server 500 of the implementation of the application, wherein both step 016 and step 018 can be implemented by the processor 300, that is to say, the processor 300 can be used to : In the case of receiving the voice command of the current round, load and analyze the dynamic scene elements included in the voice dialogue information of the historical rounds; generate the dynamic scene according to the dynamic scene elements and the voice dialogue information of the historical rounds.

Specifically, dynamic scene elements have different presentation types, such as buttons, sliders, status buttons, text input boxes, checkboxes, radio buttons, group buttons, switch buttons, views, groups, dialog boxes, and Interactive and operable controls, etc. In some implementations, tags can also be obtained, and the tags include dialogue action information and/or slot parameters. In this way, the dynamic scene elements included in the voice dialogue information of the historical rounds can be loaded and analyzed, and the dynamic scene can be generated according to the dynamic scene elements and the voice dialogue information of the historical rounds.

Specifically, the scene semantic document includes multiple document data, and the similarities between the multiple document data and the dynamic scene elements are all greater than a similarity threshold. In this way, whether a certain document data is the document data of the scene semantic document can be determined according to the similarity threshold of the dynamic scene element. If the similarity is less than the similarity threshold, it is considered that the document data does not belong to the document data of the scene semantic document; if the similarity is greater than or equal to the similarity threshold, the document data is considered to belong to the document data of the scene semantic document. It is worth mentioning that other methods can also be used to determine the document data of the scene semantic document, such as: template matching, sentence similarity calculation, model reading comprehension, etc., which are not limited here.

In one embodiment, the generation process of the dynamic scene includes: loading the dialogue state information of the voice dialogue information of the historical rounds, including slot parameters, system dialogue action information and other information; and then inferring potential users according to the system dialogue action information Dialogue action information; finally through the generalization of synonyms, generalize slot parameters, labels of dialogue action information, etc.

Please refer to FIG. 11 , which includes user actions and system actions. User actions include: notify, cancel, confirm, deny, and ask for more actions; system actions include: ask, select, confirm, guide, deny, succeed, and fail. The list in Figure 11 can be updated according to the dialog state information. Figure 11 includes multiple 1s and 0s, 1 can be considered as the dialogue between the system and the user is associated. For example: the last round is an inquiry action of the system, and the next round is a reply action of the user, if the inquiry action of the system is associated with the reply action of the user, it is considered as 1; if the inquiry action of the system is not related to the reply action of the user Correlation is considered to be 0. In one example, the system asks the user: "Do you want to close the window?" The user answers: "OK to close the window." It can be determined that the two sentences are related and have a contextual relationship, which is recorded as 1 in the table. In another example, the system asks the user: "Do you want to close the car window?" and the user answers: "The weather is so nice." It can be determined that the two sentences are irrelevant and have no contextual relationship, and are recorded as 0 in the table. When the record is 0, the system can consider the user's reply as an error reply, which can be treated as noise, or the user can be asked again, for example: "What did you say? Let me describe the problem to you again, whether to close the car window ”, so that the continuity of the dialogue can be proved to facilitate the generation of dynamic scenes.

Please refer to FIG. 12 , in one embodiment, the user can speak an instruction: "Navigate to Starbucks", wherein the slot parameters include the destination and Starbucks, and the dialog action information includes the user and the notification. The system can reply: "I found multiple Starbucks for you, which one do you want to go to?", where the slot parameters include the destination search results (i.e. Starbucks at the North Gate of Peking University, Starbucks Zhongguancun Store, and Starbucks Starbucks Store), and the dialog action information includes the system and select. The user replied to the system: "The one on the north side of Peking University". The large screen in the car can display the search results of the destination (that is, Starbucks at the North Gate of Peking University, Starbucks Zhongguancun Store, and Starbucks Starbucks Store). The slot parameters include the destination and Starbucks at the North Gate of Peking University. Dialogue Action information includes users and notifications. It is worth mentioning that the large screen in the car can also display a variety of other operations, such as: exit, re-navigate, etc.

Referring to Figure 13, in some embodiments, step 04 includes the steps of:

Step 041: Use the semantic understanding corresponding to the current round of voice commands to search the database;

Step 042: When there is a search result that matches the semantic understanding corresponding to the voice command of the current round, take the semantic understanding corresponding to the voice command of the current round as the semantic understanding result;

Step 043: When there is no search result that matches the semantic understanding corresponding to the voice command of the current round, use the global semantic understanding as the semantic understanding result.

In some implementations, the voice control device 100 includes a second processing module, a third processing module, and a fourth processing module. The voice control method of the present application can be realized by the voice control device 100 of the embodiment of the present application, wherein step 041 can be implemented by the first The second processing module is implemented, step 042 can be implemented by the third processing module, and step 043 can be implemented by the fourth processing module, that is to say, the second processing module is used to use the semantic understanding corresponding to the current round of voice commands to search the database. The third processing module is configured to take the semantic understanding corresponding to the voice command of the current round as the semantic understanding result when there is a search result matching the semantic understanding corresponding to the voice command of the current round. The fourth processing module is configured to use the global semantic understanding as the semantic understanding result when there is no search result that matches the semantic understanding corresponding to the voice command of the current round.

In some embodiments, the voice control method in the embodiment of the present application can be implemented by the server 500 in the embodiment of the application, wherein, step 041, step 042 and step 043 can all be implemented by the processor 300, that is to say, the processor 300 may be used to: use the semantic understanding corresponding to the current round of voice commands to search the database; when there is a search result that matches the semantic understanding corresponding to the current round of voice commands, use the semantic understanding corresponding to the current round of voice commands as the semantic understanding Result; when there is no search result matching the semantic understanding corresponding to the current round of voice instructions, the global semantic understanding is taken as the semantic understanding result.

Specifically, the database records the historical data of multiple rounds of dialogues, for example, including the context content of the historical dialogues, number of dialogue rounds, task tree diagram and other information. In one embodiment, as shown in FIG. 5 , the database includes a context memory, and the GUI information is uploaded to the context memory in real time while the current round of voice commands is uploaded. In some implementations, when performing semantic understanding, natural language understanding can be performed according to correlations in the database.

After the database is searched, if there is a result matching the semantic understanding corresponding to the voice command of the current round, the semantic understanding corresponding to the voice command of the current round is taken as the semantic understanding result.

After the database is searched, if there is no result matching the semantic understanding corresponding to the current round of voice commands, the global semantic understanding is taken as the semantic understanding result.

That is to say, there are two kinds of semantic understanding results: one is to use the semantic understanding corresponding to the voice command of the current round as the semantic understanding result, and the other is to use the global semantic understanding as the semantic understanding result.

It is worth mentioning that the data in the database will be updated during each round of dialogue. Wherein, the basis for updating may include but not limited to semantic understanding results, historical dialogue state information, etc., which are not limited here.

Referring to Figure 14, in some embodiments, step 05 includes the steps of:

Step 051: When the semantic understanding corresponding to the voice command of the current round is taken as the semantic understanding result, update the voice dialogue information of the previous round, and send an operation command to make the vehicle 1000 perform a corresponding operation;

Step 052: When the global semantic understanding is the semantic understanding result, the control vehicle 1000 initiates a new round of dialogue tasks.

In some embodiments, the voice control device 100 includes a fifth processing module and a sixth processing module. The voice control method of the present application can be realized by the voice control device 100 of the embodiment of the present application, wherein step 051 can be realized by the fifth processing module, Step 052 can be realized by the sixth processing module, that is to say, the fifth processing module is used to update the voice dialogue information of the historical round when the semantic understanding corresponding to the voice command of the current round is taken as the semantic understanding result, and send The operation instructions cause the vehicle 1000 to perform corresponding operations. The sixth processing module is used to control the vehicle 1000 to initiate a new round of dialog tasks when the global semantic understanding is the semantic understanding result.

In some implementations, the voice control method of the implementation of the present application can be implemented by the server 500 of the implementation of the application, wherein both step 051 and step 052 can be implemented by the processor 300, that is to say, the processor 300 can be used to : When the semantic understanding corresponding to the current round of voice commands is taken as the semantic understanding result, the voice dialogue information of the previous round is updated, and the operation command is sent to enable the vehicle 1000 to perform the corresponding operation; when the global semantic understanding is taken as the semantic understanding result, The control vehicle 1000 initiates a new round of dialog tasks.

Specifically, when the semantic understanding corresponding to the voice command of the current round is taken as the semantic understanding result, the voice dialogue information of the previous round can be updated, and the updating process can be realized by the dialogue state information update module and the dialogue strategy optimization module. In some implementations, the dialogue state information update module and the dialogue policy optimization module can be combined together, that is, the dialogue management module. Updating dialog information includes updating dialog action information and dialog state information. After updating and optimization, reply information (operation instruction) can be generated, so that the operation instruction can be sent to make the vehicle 1000 perform the corresponding operation.

Referring to Figure 15, in some embodiments, step 051 includes the steps of:

Step 0511: Query the dialogue action information output by the user and the dialogue action information output by the system according to the voice dialogue information of the historical rounds, so as to obtain the user slot parameters and the system slot parameters;

Step 0512: Use the user slot parameters and the system slot parameters to perform slot actions, update the trusted slot parameters, and update the dialog state information.

In some implementations, the voice control device 100 includes a seventh processing module and an eighth processing module. The voice control method of the present application can be realized by the voice control device 100 of the embodiment of the present application, wherein step 0511 can be realized by the seventh processing module, Step 0512 can be implemented by the eighth processing module, that is to say, the seventh processing module is used to query the dialogue action information output by the user and the dialogue action information output by the system according to the voice dialogue information of the historical rounds, so as to obtain the user slot parameters and system slot parameters. The eighth processing module is used to execute slot actions by using user slot parameters and system slot parameters, and update trusted slot parameters to update dialog state information.

In some implementations, the voice control method of the implementation of the present application can be implemented by the server 500 of the implementation of the application, wherein both step 0511 and step 0512 can be implemented by the processor 300, that is to say, the processor 300 can be used to : According to the voice dialogue information of historical rounds, query the dialogue action information output by the user and the dialogue action information output by the system to obtain the user slot parameters and system slot parameters; use the user slot parameters and system slot parameters to execute the slot Action, update the trusted slot parameter to update the dialog state information.

Specifically, please refer to FIG. 16. FIG. 16 includes a group of multi-round dialogues, and user slot parameters and system slot parameters can be obtained according to user dialogue action information and system dialogue action information, and task parameters can also be acquired. User Slot Parameters and System Slot Parameters row slot actions to update trusted slot parameters, thereby updating dialogue state information in each task. In some embodiments, the user slot parameter refers to the slot parameter requested by the user in each round, the system query slot refers to the slot parameter or candidate slot parameter that the system needs to inquire, select, and confirm, and the trusted slot Bit refers to the slot parameter of the final output. In some embodiments, executing the slot action includes at least one of a continuation action, a delete action, an update action, and an invalidation action.

Specifically, the slot execution actions include, but are not limited to, continuation actions, deletion actions, update actions, and invalidation actions. The continuation action is that the slot parameters are the same as those of the previous round, and the slot parameters are not updated in the current round. The delete action is to delete the existing slot parameters. The update action is to update the existing slot parameters. The invalidation action means that some slot parameters related to the task are no longer concerned in subsequent dialogues.

Referring to Figure 17, in some embodiments, step 051 includes the steps of:

Step 0513: Determine the priority sequence of multiple scene pages in the dynamic scene;

Step 0514: According to the priority order of multiple scene pages, push the high-priority scene page nodes into the low-priority scene page stack;

Step 0515: Control the vehicle 1000 to perform corresponding operations corresponding to high-priority scene pages.

In some embodiments, the voice control device 100 includes a judging module, a ninth processing module, and a tenth processing module. The voice control method of the present application can be realized by the voice control device 100 in the embodiment of the present application, and step 0513 can be realized by the judging module , Step 0514 can be implemented by the ninth processing module, and step 0515 can be implemented by the tenth processing module, that is to say, the judging module is used to judge the priority order of multiple scene pages in the dynamic scene. The ninth processing module is used to push the high-priority scene page nodes into the low-priority scene page stack according to the priority order of the multiple scene pages. The tenth processing module is used to control the vehicle 1000 to perform corresponding operations corresponding to high-priority scene pages.

In some implementations, the voice control method of the implementation of the present application can be implemented by the server 500 of the implementation of the application, wherein, step 0513, step 0514 and step 0515 can all be implemented by the processor 300, that is to say, the processor 300 can be used to: determine the priority order of multiple scene pages in a dynamic scene; push high-priority scene page nodes into the low-priority scene page stack according to the priority order of multiple scene pages; control the vehicle 1000 to execute high-priority The corresponding operation corresponding to the scene page.

Specifically, a dynamic scene may include multiple scene pages, and multiple scene pages may be prioritized, and high-priority scene page nodes may be pushed into a low-priority scene page stack. Please refer to Figure 18. In Figure 18, there are 3 scene pages, the 3 scene pages are respectively the first scene page A1, the second scene page A2 and the third scene page A3. The 3 scene pages can be regarded as a class stack, Each scene page corresponds to a dialogue task, that is to say, each stack can be regarded as a dialogue task. In some implementations, the priority includes the page depth and the priority of each element within the task. Wherein, the page depth is X, the priority of each element in the task is Y, and the higher the priority Y, the higher the page depth. In this way, in the case of dealing with relatively complicated multi-round dialogues, the logical relationship of the scene page can be used to update the dynamic scene. In Figure 18, if you hit the "Details" button on the first scene page A1, the third scene page A3 will pop up; if you hit the "Details" button on the first scene page A1 on the current second scene page A2, then A third scene page A3 will pop up and cover the second scene page A2. In this way, it can be understood that among multiple scene pages, the scene page entered first will pop up later, and the scene page entered last will pop up first.

Referring to FIG. 19 , the embodiment of the present application also provides a computer program non-volatile computer-readable storage medium 800, on which a computer program is stored. When the computer program is executed by one or more processors 300, the processing The controller 300 executes the steps of the control method in any of the above-mentioned implementation manners.

For example, when the program is executed by the processor 20, the steps of the following voice control method are realized:

Step 02: Generate scene semantic documents according to dynamic scenes;

In this way, the non-volatile computer-readable storage medium 800 of the computer program in the embodiment of the present application can fuse the graphical user interface information and the voice dialogue information of historical rounds to generate a dynamic scene, generate a scene semantic document according to the dynamic scene, and generate a scene semantic document according to the scene. Semantic documents can limit the semantic understanding process within a task and manage speech in multiple rounds, thereby improving the accuracy of semantic understanding of multiple rounds of conversations in this vertical domain.

It can be understood that a computer program includes computer program code. The computer program code may be in source code form, object code form, executable file or some intermediate form, etc. The computer-readable storage medium may include: any entity or device capable of carrying computer program code, recording medium, U disk, removable hard disk, magnetic disk, optical disk, computer memory, read-only memory (ROM, Read-Only Memory), random memory Access memory (RAM, Random Access Memory), and software distribution media, etc.

In the description of this specification, reference to the terms "one embodiment", "some embodiments", "exemplary embodiments", "example", "specific examples" or "some examples" etc. The specific features, structures, materials or features described in the manner or example are included in at least one embodiment or example of the present application. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiment or example. Furthermore, the described specific features, structures, materials or characteristics may be combined in any suitable manner in any one or more embodiments or examples.

Any process or method descriptions in flowcharts or otherwise described herein may be understood to represent modules, segments or portions of code comprising one or more executable instructions for implementing specific logical functions or steps of the process , and the scope of preferred embodiments of the present application includes additional implementations in which functions may be performed out of the order shown or discussed, including in substantially simultaneous fashion or in reverse order depending on the functions involved, which shall It should be understood by those skilled in the art to which the embodiments of the present application belong.

The logic and/or steps represented in the flowcharts or otherwise described herein, for example, can be considered as a sequenced listing of executable instructions for implementing logical functions, can be embodied in any computer-readable medium, For use with instruction execution systems, devices, or devices (such as computer-based systems, systems including processing modules, or other systems that can fetch instructions from instruction execution systems, devices, or devices and execute instructions), or in conjunction with these instruction execution systems, devices or equipment used. For the purposes of this specification, a "computer-readable medium" may be any device that can contain, store, communicate, propagate or transmit a program for use in or in conjunction with an instruction execution system, device or device. More specific examples (non-exhaustive list) of computer-readable media include the following: electrical connection with one or more wires (electronic device), portable computer disk case (magnetic device), random access memory (RAM), Read Only Memory (ROM), Erasable and Editable Read Only Memory (EPROM or Flash Memory), Fiber Optic Devices, and Portable Compact Disc Read Only Memory (CDROM). In addition, the computer-readable medium may even be paper or other suitable medium on which the program can be printed, since the program can be read, for example, by optically scanning the paper or other medium, followed by editing, interpretation or other suitable processing if necessary. The program is processed electronically and stored in computer memory.

It should be understood that each part of the embodiments of the present application may be realized by hardware, software, firmware or a combination thereof. In the embodiments described above, various steps or methods may be implemented by software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, it can be implemented by any one or combination of the following techniques known in the art: Discrete logic circuits, ASICs with suitable combinational logic gates, programmable gate arrays (PGAs), field programmable gate arrays (FPGAs), etc.

Those of ordinary skill in the art can understand that all or part of the steps carried by the method of the above-mentioned embodiments can be completed by instructing related hardware through a program, and the program can be stored in a computer-readable storage medium, and the program is executed When, one or a combination of the steps of the method embodiment is included.

In addition, each functional unit in each embodiment of the present application may be integrated into one processing module, each unit may exist separately physically, or two or more units may be integrated into one module. The above-mentioned integrated modules can be implemented in the form of hardware or in the form of software function modules. If the integrated modules are realized in the form of software function modules and sold or used as independent products, they can also be stored in a computer-readable storage medium.

The storage medium mentioned above may be a read-only memory, a magnetic disk or an optical disk, and the like.

Although the embodiments of the present application have been shown and described above, it can be understood that the above embodiments are exemplary and should not be construed as limitations on the present application, and those skilled in the art can make the above-mentioned The embodiments are subject to changes, modifications, substitutions and variations.

Claims

A voice control method, characterized in that, comprising:

receiving the current round of voice commands, receiving GUI information, and fusing the GUI information with historical round voice dialogue information to generate a dynamic scene;

Generating a scene semantic document according to the dynamic scene;

Determining the semantic understanding corresponding to the voice command of the current round according to the scene semantic document;

Determine the semantic understanding result according to the semantic understanding corresponding to the current round of voice instructions or the global semantic understanding;

The vehicle is controlled to perform corresponding operations according to the semantic understanding result.
The voice control method according to claim 1, characterized in that, receiving the voice command of the current round, receiving the graphical user interface information, and fusing the graphical user interface information and the voice dialogue information of the historical rounds to generate a dynamic scene, include:

In the case of receiving the voice command of the current round, determine the semantic space according to the voice dialogue information of the historical round, and the semantic space is used to represent the semantic understanding direction of the voice command of the current round;

A dynamic scene is determined according to the semantic space and GUI information.
The voice control method according to claim 1, characterized in that, receiving the voice command of the current round, receiving the graphical user interface information, and fusing the graphical user interface information and the voice dialogue information of the historical rounds to generate a dynamic scene, include:

In the case of receiving the voice command of the current round, load and analyze the dynamic scene elements included in the voice dialogue information of the historical round;

A dynamic scene is generated according to the dynamic scene elements and the voice dialogue information of historical rounds.
The voice control method according to claim 3, wherein the similarity between the document data of the scene semantic document and the dynamic scene element is greater than a similarity threshold.
The voice control method according to claim 1, wherein the determining the semantic understanding result according to the semantic understanding corresponding to the current round of voice instructions or the global semantic understanding comprises:

Using the semantic understanding corresponding to the current round of voice commands to search the database;

When there is a search result that matches the semantic understanding corresponding to the voice command of the current round, use the semantic understanding corresponding to the voice command of the current round as the semantic understanding result;

When there is no search result matching the semantic understanding corresponding to the voice instruction of the current round, the global semantic understanding is used as the semantic understanding result.
The voice control method according to claim 5, wherein the controlling the vehicle to perform corresponding operations according to the semantic understanding result comprises:

When the semantic understanding corresponding to the voice command of the current round is taken as the result of the semantic understanding, updating the voice dialogue information of the historical round, and sending an operation instruction to enable the vehicle to perform a corresponding operation;

When the global semantic understanding is the result of the semantic understanding, the vehicle is controlled to initiate a new round of dialogue tasks.
The voice control method according to claim 6, wherein updating the voice dialogue information of the historical rounds comprises:

According to the voice dialogue information of the historical rounds, query the dialogue action information output by the user and the dialogue action information output by the system, so as to obtain the user slot parameter and the system slot parameter;

Using the user slot parameter and the system slot parameter to perform a slot action, update the trusted slot parameter, and update the dialog state information.
The voice control method according to claim 7, wherein the slot execution action includes at least one of a continuation action, a deletion action, an update action and an invalidation action.
The voice control method according to claim 6, wherein updating the voice dialogue information of the historical rounds comprises:

judging the priority sequence of multiple scene pages in the dynamic scene;

According to the priority order of the plurality of scene pages, the high priority scene page nodes are pushed into the low priority scene page stack;

The vehicle is controlled to perform a corresponding operation corresponding to the high priority scene page.
A voice control device, characterized in that it comprises:

A first generation module, the first generation module is used to receive the voice command of the current round, receive the graphical user interface information, and fuse the graphical user interface information and the voice dialogue information of the historical round to generate a dynamic scene;

A second generating module, the second generating module is used to generate a scene semantic document according to the dynamic scene;

A first determination module, the first determination module is used to determine the semantic understanding corresponding to the current round of voice instructions according to the scene semantic document;

A second determination module, the second determination module is used to determine the semantic understanding result according to the semantic understanding or global semantic understanding corresponding to the voice command of the current round;

A control module, the control module is used to control the vehicle to perform corresponding operations according to the semantic understanding result.
A server, characterized in that it includes a memory and a processor, and a computer program is stored in the memory, and when the computer program is executed by the processor, the voice control method described in any one of claims 1-9 is realized .
A non-volatile computer-readable storage medium of a computer program, characterized in that, when the computer program is executed by one or more processors, the voice control method according to any one of claims 1-9 is realized .