CN111767021A

CN111767021A - Voice interaction method, vehicle, server, system and storage medium

Info

Publication number: CN111767021A
Application number: CN202010596817.5A
Authority: CN
Inventors: 孙仿逊; 胡梓垣; 翁志伟
Original assignee: Guangzhou Xiaopeng Internet of Vehicle Technology Co Ltd
Current assignee: Guangzhou Xiaopeng Motors Technology Co Ltd
Priority date: 2020-06-28
Filing date: 2020-06-28
Publication date: 2020-10-13
Also published as: WO2022001013A1; CN113031905A

Abstract

The invention relates to the technical field of voice, in particular to a voice interaction method, a vehicle, a server and a storage medium, wherein the method comprises the following steps: the vehicle receives a voice request of a user and sends the voice request and the context information of the current vehicle-mounted system graphical user interface to the server; the server finishes the natural language understanding processing of the voice request according to the context information; the server understands the processing result by using the natural language, generates a vehicle executable instruction and sends the vehicle executable instruction to the vehicle; the vehicle receives and executes the instruction, and simultaneously feeds back an execution result to the user through voice. The server can fully utilize the context information to finish natural language understanding processing in the voice interaction process, and due to the fact that more dimensionality information is added, a user can operate any content on a graphical user interface in a vehicle through voice, and interaction quality of a man-machine interaction system is improved.

Description

Voice interaction method, vehicle, server, system and storage medium

Technical Field

The present invention relates to the field of voice technology, and in particular, to a voice interaction method, a vehicle, a server, a system, and a storage medium.

Background

With the development of automobile intelligence and voice technology, the application of voice on automobiles is more and more extensive. In the process of driving the vehicle by the user, the control of the vehicle or the vehicle-mounted system on the vehicle by the user can be realized in a non-contact manner, and the use experience of the user can be enhanced under the condition of ensuring the driving safety.

The automobile intelligence brings stronger car machine chips and graphic chips, the computing power of the new generation car machine chips and the performance of the graphic chips, so that realization of richer interfaces and more interesting animations on a vehicle-mounted system like a mobile phone becomes possible. The way of using voice on the vehicle is often to set up a separate voice assistant, and after receiving the voice request of the user, the feedback is given by the server. The use mode and the interface of the vehicle-mounted system are completely independent, and the interaction quality of the man-machine interaction system is difficult to satisfy due to the fact that only voice signals are utilized and more dimensional information is lacked.

Disclosure of Invention

In view of the above, embodiments of the present invention are proposed in order to provide a voice interaction method, a vehicle, a server, a system and a storage medium that overcome or at least partially solve the above problems.

In order to solve the above problem, an embodiment of the present invention discloses a voice interaction method, which is applied to a voice interaction system including a vehicle and a server capable of communicating with the vehicle, and is characterized by comprising:

the vehicle receives a voice request of a user and sends the voice request and the context information of the current vehicle-mounted system graphical user interface to the server;

the server finishes the natural language understanding processing of the voice request according to the context information;

the server understands the processing result by using the natural language, generates a vehicle executable instruction and sends the vehicle executable instruction to the vehicle;

the vehicle receives and executes the instruction, and simultaneously feeds back an execution result to the user through voice.

Further, the context information includes the name and type of the operable control in the current vehicle-mounted system graphical user interface, the action supported by the operable control, the value range of the action, and the current state of the operable control.

Further, the server completes natural language understanding processing of the voice request according to the context information, and the processing comprises the following steps:

creating a scene semantic space according to the context information;

performing semantic understanding on the voice request and outputting a semantic understanding result;

in the scene semantic space, retrieving, recalling, sequencing and matching operable controls by using semantic understanding results;

and outputting the operation of the operable control responding to the voice request as a natural language understanding processing result.

Further, creating a scene semantic space according to the context information, comprising:

receiving context information sent by a vehicle;

loading and analyzing scene elements included in the context information;

and generating a scene semantic document according to the scene elements.

Further, performing semantic understanding on the voice request and outputting a semantic understanding result, including:

performing text preprocessing and text normalization processing on a text in the voice request, and then extracting a sentence backbone;

and understanding the intention of the voice request of the user according to the sentence backbone and outputting a semantic understanding result.

Further, understanding the intention of the user voice request according to the sentence backbone and outputting a semantic understanding result, comprising:

and determining a preliminary result for understanding the intention of the user voice request according to the sentence backbone, correcting the preliminary result by using a negative word in the sentence backbone, and outputting a corrected semantic understanding result.

Further, in the scene semantic space, the operable controls are retrieved, recalled, sorted and matched by using the semantic understanding result, which includes:

extracting a text in the voice request to retrieve in a scene semantic document;

recalling the retrieval result by using a preset recall strategy, and scoring the matching degree;

sorting the scored retrieval results according to a preset sorting strategy;

outputting a matching result according to the sorting result; wherein the matching result comprises the operation intention of the operable control, the name of the operable control and the execution action of the operable control.

Further, the text in the voice request includes all text or partial text in the voice request, and the text in the voice request is extracted and retrieved from the scene semantic document, including any one of the following:

extracting entity words in the voice request to search in the scene semantic document;

extracting texts including entity words and action words in the voice request and searching in the scene semantic documents;

or the like, or, alternatively,

all text in the voice request is extracted and retrieved in the scene semantic document.

Further, recalling the retrieval result by using a preset recall strategy, comprising:

and according to the retrieval result, recalling by utilizing one or more preset recalling strategies including text omission based on a preset list of negligible words, core word must hit, threshold value setting for recalling, and verification of action words or negative intentions in the text.

The embodiment of the invention also discloses a vehicle, which comprises: a processor, a memory and a computer program stored on the memory and capable of running on the processor, the computer program, when executed by the processor, implementing the steps of the voice interaction method described above.

The embodiment of the invention also discloses a server, which comprises: a processor, a memory and a computer program stored on the memory and capable of running on the processor, the computer program, when executed by the processor, implementing the steps of the voice interaction method described above.

The embodiment of the invention also discloses a voice interaction system, which comprises a vehicle and a server capable of communicating with the vehicle, wherein the vehicle is provided with a request receiving module, an information sending module, an instruction receiving module and an execution feedback module, and the server is provided with a natural language understanding module and an instruction sending module;

the request receiving module is used for receiving a voice request of a user;

the information sending module is used for sending the voice request and the context information of the current vehicle-mounted system graphical user interface to the server;

the natural language understanding module is used for finishing natural language understanding processing of the voice request according to the context information;

the instruction sending module is used for understanding the processing result by the server through the natural language, generating an instruction executable by the vehicle and then sending the instruction to the vehicle;

and the instruction receiving module is used for receiving and executing the instruction, and simultaneously feeding back an execution result to the user through voice by the execution feedback module.

The embodiment of the invention also discloses a computer-readable storage medium, which is characterized in that a computer program is stored on the computer-readable storage medium, and the computer program is executed by a processor to realize the voice interaction method.

The embodiment of the invention has the following advantages:

context information of a Graphical User Interface (GUI) of a current vehicle-mounted system is sent to a server, so that the server can fully utilize the context information to finish natural language understanding processing in a voice interaction process, and a User can operate any content on the GUI when seeing the GUI in a vehicle through voice due to the fact that more dimensionality information is added, and interaction quality of a human-computer interaction system is improved.

Drawings

FIG. 1 is a flow chart of the steps of a voice interaction method embodiment of the present invention;

fig. 2 is a schematic diagram of a navigation broadcast graphical user interface of the vehicle system of the present invention;

FIG. 3 is a flow chart of the steps of natural language understanding in a method of voice interaction of the present invention;

FIG. 4 is a code diagram of context information in one embodiment of a voice interaction method of the present invention;

FIG. 5 is a block diagram of a voice interaction system according to an embodiment of the present invention.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.

Referring to fig. 1, a flowchart illustrating steps of an embodiment of a voice interaction method of the present invention is shown, which may specifically include the following steps:

and S1, the vehicle receives the voice request of the user and sends the voice request and the context information of the current vehicle-mounted system graphical user interface to the server.

S2, the server completes the natural language understanding process of the voice request according to the context information.

And S3, the server uses the natural language to understand the processing result, generates the vehicle executable instruction and sends the vehicle executable instruction to the vehicle.

And S4, the vehicle receives and executes the instruction, and simultaneously feeds back the execution result to the user through voice.

The voice interaction method is applied to a voice interaction system which comprises a vehicle and a server capable of communicating with the vehicle. Specifically, a communication module is arranged on the vehicle, and can communicate with a server based on an operator network including 3G, 4G or 5G or other communication connection modes to complete data interaction.

In a vehicle, the Display area of the vehicle may include an instrument panel, an on-vehicle center control screen, and a HUD (Head Up Display) that can be implemented on a windshield of the vehicle. The display area of the vehicle-mounted system running on the vehicle using a Graphical User Interface (GUI) includes a plurality of UI elements, and different display areas may display different UI elements or the same UI element. UI elements may include, among other things, card objects, application icons or interfaces, folder icons, multimedia file icons, and controls for making interactive actionable.

In the step S1, the context information includes names and types of the operable controls in the current vehicle-mounted system graphical user interface, actions supported by the operable controls, value ranges of the actions, and current states of the operable controls.

Taking fig. 2 as an example, when viewing fig. 2, the user may directly send out voice requests such as "navigation broadcast volume is set to 18", "system prompt tone is turned off", and the like. The operable controls referred to in fig. 2 include three, the first being a slide type control named "navigation broadcast volume", the second being a SelectTab type control named "vehicle alert tone", and the third being a Switch type control named "system alert tone". Each control has a supported action, a value range of the action and a current state of the operable control.

For example, a control named "navigation broadcast volume" may drag a value of the volume to be adjusted, that is, a supported action is a Set (Set), a value range of the action is 0 to 30, and the current state is that the volume is Set to 16.

Continuing with the example of a control named "vehicle alert tone", this control may be set to "small", "medium", "large"; that is, the supported action is Set, the range of values of this action is "small", "medium", and "large", and the current state is that the vehicle warning tone is Set to be small.

Taking a control named as "system prompt tone" as an example, the control can be opened and closed. That is, the supported actions include two actions of turning On (Turn On) and turning Off (Turn Off), and the current state is that the system alert tone is turned On.

Specifically, as shown in fig. 3, the step of S2 includes:

s20, creating scene semantic space according to the context information;

s21, carrying out semantic understanding on the voice request and outputting a semantic understanding result;

s22, in the scene semantic space, the operable control is searched, recalled, sorted and matched by the semantic understanding result;

s23, outputting an operation of the operable control in response to the voice request as a result of the natural language understanding process.

The scene semantic space is a semantic space that is created to be understandable based on contextual information of the GUI. In step S20, based on fig. 2, the server creates an example of scene semantic space from the context information, such as table 1 below:

TABLE 1

Specifically, the step of S20 includes:

s201, receiving context information sent by a vehicle;

s202, loading and analyzing scene elements included in the context information;

and S203, generating a scene semantic document according to the scene elements.

In step S201, the vehicle sends the context information to the server in the form of a Json file through a communication network including, but not limited to, an operator network. Fig. 4 is an example of a Json file, and in this embodiment, other file formats may be used to send the context information, which is not limited herein. In fig. 4, label represents the name of an operable control, and type represents the type of the operable control.

In the step of S202, the server loads and parses the Json file to obtain scene elements recorded in the file, where the scene elements include a plurality of operable controls and other UI elements.

In step S203, the server generates a scene semantic document in which a scene semantic space is described, based on the scene element.

Further, the step of S21 includes:

s211, performing text preprocessing and text normalization processing on the text in the voice request, and then extracting a sentence backbone;

s212, understanding the intention of the user voice request according to the sentence backbone and outputting a semantic understanding result.

In step S211, text preprocessing is performed on the text in the voice request, including performing chinese word segmentation and removing vocalic words ("kayi" and "ba") and the like. The text normalization process includes normalization of numbers and entities, for example, "one dot five seconds" becomes "1.5 seconds" after the normalization process; the large screen brightness is changed into the central control brightness after normalization processing. The extraction of the sentence skeleton is to extract entity words, action words and numerical values in the sentence, and the extracted sentence skeleton is mainly used for subsequent retrieval.

In the step of S212, the intention of the user can be understood by using the extracted action words in the sentence skeleton, which facilitates subsequent verification of the operable control.

Further, the step of S212 includes: and determining a preliminary result for understanding the intention of the user voice request according to the sentence backbone, correcting the preliminary result by using a negative word in the sentence backbone, and outputting a corrected semantic understanding result. For example, if the text corresponding to the voice request of the user is "do not open system alert tone", the preliminary result including the action word of "open" and the entity word of "system alert tone" may be obtained, but if "open system alert tone" is used as the semantic understanding result, the meaning is opposite to the real meaning of the user, so after the preliminary result is obtained, whether the main stem of the sentence has a negative word is determined, and the text includes "do not open" this time, and may be extracted to correct the preliminary result, that is, "do not open" is understood as "close". The semantic understanding result after the current correction is 'system shutdown prompt tone'.

In the step of S22, the method specifically includes:

s221, extracting the text in the voice request to retrieve in the scene semantic document;

s222, recalling the retrieval result by using a preset recall strategy, and scoring the matching degree;

s223, sorting the scored retrieval results according to a preset sorting strategy;

s224, outputting a matching result according to the sorting result; wherein the matching result comprises the operation intention of the operable control, the name of the operable control and the execution action of the operable control.

In step S221, a vocabulary of segmented words is created in advance from scenes such as navigation and music, and then a search is performed based on the vocabulary. When searching, various searching strategies can be used according to the utilization modes of different texts. That is, the text in the voice request includes all or part of the text in the voice request, the step S221 includes any one of the following steps:

or the like, or, alternatively,

The three search strategies listed above, including partial texts including entity words, combinations of entity words and action words, and all texts in the voice request, can determine what search strategy is used according to specific needs. When the search is implemented, the search can be implemented using, for example, an inverted index and a search based on words and pinyin, and the specific implementation is not limited herein.

In step S222, a preset recall policy is used to recall the search result, where the preset recall policy includes multiple types, specifically as follows:

recall strategy 1: text omission based on preset list of negligible words

Example 1, Label ═ rock >, text Query of a voice request ═ switch to rock mode, and "mode" can be ignored in the current scenario.

Recall policy 2: the core word must be hit

Example 2, Label ═ open map setting >, text Query of voice request ═ open system setting ", map" must hit in the current scene, otherwise, a result of a false recall is generated.

Recall policy 3: setting thresholds for recalls

Example 3, set threshold X%, reach threshold and recall.

Recall policy 4: checking action words or negative intentions in text

Example 4, Label ═ connect first bluetooth >, text Query of voice request ═ disconnect first bluetooth ═ connect first bluetooth ", do not check action words or negative intentions can be recalled by mistake.

In the step S222, scoring may adopt multiple scoring modes such as Query matching degree or document matching degree.

The Query matching degree is as follows: matching length/Query stem length (word length), matching length ═ Query and document matching word length (word length).

The document matching degree is as follows: matching length/document length (word length), a specific matching policy is used for a specific control, such as a Point Of Interest (POI) list that is frequently found in navigation. A specific matching policy, such as document matching, may be used for specific controls like POI lists.

In the step of S223, the preset ordering policy may include:

strategy one: sequencing the scene semantic documents according to the highest scores of all the retrieval strategies;

and (2) strategy two: sequencing the scene semantic documents according to the sum of scores of all retrieval strategies;

strategy three: and sequencing the scene semantic documents according to the sum of the weighted scores of all the retrieval strategies.

The score is calculated in the following mode: the score α is document matching degree + (1- α) Query matching degree. α represents a preset score weight parameter.

Namely, the sorting strategy is selected according to the requirement, and then the corresponding sorting result is obtained.

In the step S224, matching includes cases of exact matching and fuzzy matching, where the exact matching refers to a complete matching scene semantic document, and if there is an action word in the speech request Query and it conforms to the control operation, it is regarded as a complete matching; fuzzy matching refers to selecting a document with the highest score (if a plurality of results with the same score exist in the sorting results, multiple selections are performed on the plurality of results with the same score), and judging whether the selected control is correct or not by combining action words when the action words exist. The matching result comprises the operation intention of the operable control, the name of the operable control and the execution action of the operable control. An example of the matching result is that the operational intention of the operable control is "set the gesture direction to be inward", the name of the operable control is "gesture touch rotation direction", and the execution action of the operable control is "set to be inward". If a voice request Query is issued for a control operable in the displayed GUI interface, named "gesture touch rotation direction", the gesture direction is set to inward ", then the step S22 is performed.

In step S23, the operation of the operable control in response to the voice request is: the operable control named "gesture touch rotation direction" is executed with an action of "set to inward", that is, this operation may be output as a result of the natural language understanding process.

In step S3, the server generates and transmits instructions executable by the vehicle to the vehicle using the natural language understanding processing result output in step S23.

In step S4, the vehicle receives and executes the command, and after the command is executed, the current state of the control named "gesture touch rotation direction" is "inward", and the execution result can be fed back To the user by voice in a TTS (Text-To-Speech) manner.

From the above, the user realizes 'visible' to the graphical user interface on the vehicle-mounted system, and does not need to touch the screen, press the keys and other physical operations in the whole process, and the full voice operation in the vehicle driving process enables the sight and attention of the user to be completely concentrated on driving, so that the vehicle driving safety can be fully ensured. And the context information of the current vehicle-mounted system graphical user interface is sent to the server, so that the server can fully utilize the context information to finish natural language understanding processing in the voice interaction process, and because more dimensional information is added, a user can operate any content on the graphical user interface in a vehicle through voice, and the interaction quality of the human-computer interaction system is improved.

It should be noted that, for simplicity of description, the method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the illustrated order of acts, as some steps may occur in other orders or concurrently in accordance with the embodiments of the present invention. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred and that no particular act is required to implement the invention.

Referring to fig. 5, a block diagram illustrating a structure of an embodiment of a voice interaction system of the present invention may specifically include: the system comprises a vehicle and a server capable of communicating with the vehicle, wherein the vehicle is provided with a request receiving module, an information sending module, an instruction receiving module and an execution feedback module, and the server is provided with a natural language understanding module and an instruction sending module.

The request receiving module is used for receiving a voice request of a user;

In the voice interaction system, the context information comprises the name and the type of an operable control in the graphical user interface of the current vehicle-mounted system, an action supported by the operable control, a value range of the action and the current state of the operable control.

Specifically, the natural language understanding module includes:

the creating submodule is used for creating a scene semantic space according to the context information;

the understanding submodule is used for carrying out semantic understanding on the voice request and outputting a semantic understanding result;

the processing submodule is used for retrieving, recalling, sequencing and matching the operable control by using the semantic understanding result in the scene semantic space;

and the output submodule is used for outputting the operation of the operable control responding to the voice request as a natural language understanding processing result.

Wherein creating the sub-module comprises:

the receiving unit is used for receiving the context information sent by the vehicle;

a loading unit for loading and analyzing scene elements included in the context information;

and the generating unit is used for generating a scene semantic document according to the scene elements.

Wherein the understanding submodule comprises:

the processing unit is used for performing text preprocessing and text normalization processing on the text in the voice request and then extracting a sentence backbone;

and the output unit is used for understanding the intention of the voice request of the user according to the sentence backbone and outputting a semantic understanding result.

Further, the output unit is further configured to determine a preliminary result of understanding the intention of the user voice request according to the sentence skeleton, correct the preliminary result by using a negative word in the sentence skeleton, and output a corrected semantic understanding result.

Wherein, the processing submodule includes:

the retrieval unit is used for extracting the text in the voice request to retrieve in the scene semantic document;

the recall unit is used for recalling the retrieval result by using a preset recall strategy and then scoring the matching degree;

the sorting unit is used for sorting the scored retrieval results according to a preset sorting strategy;

the matching unit is used for outputting a matching result according to the sorting result; wherein the matching result comprises the operation intention of the operable control, the name of the operable control and the execution action of the operable control.

Further, the text in the voice request includes all or part of the text in the voice request, and the retrieving unit is specifically configured to any one of:

or the like, or, alternatively,

Further, the recall unit is specifically configured to recall, for the search result, by using one or more preset recall policies including text omission based on a preset list of negligible words, core word hit necessity, recall by setting a threshold, and checking an action word or a negative intention in the text.

For the device embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, refer to the partial description of the method embodiment.

An embodiment of the present invention further provides a vehicle, including:

the voice interaction method comprises a processor, a memory and a computer program which is stored on the memory and can run on the processor, wherein when the computer program is executed by the processor, each process of the voice interaction method embodiment is realized, the same technical effect can be achieved, and the details are not repeated here to avoid repetition.

An embodiment of the present invention further provides a server, including:

The embodiment of the present invention further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when being executed by a processor, the computer program implements each process of the voice interaction method embodiment, and can achieve the same technical effect, and in order to avoid repetition, details are not repeated here.

The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, apparatus, or computer program product. Accordingly, embodiments of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

Embodiments of the present invention are described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing terminal to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing terminal to cause a series of operational steps to be performed on the computer or other programmable terminal to produce a computer implemented process such that the instructions which execute on the computer or other programmable terminal provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present invention have been described, additional variations and modifications of these embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the embodiments of the invention.

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or terminal that comprises the element.

The voice interaction method, the vehicle, the server and the storage medium provided by the invention are described in detail, and the principle and the implementation of the invention are explained by applying specific examples, and the description of the embodiments is only used for helping to understand the method and the core idea of the invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. A voice interaction method is applied to a voice interaction system which comprises a vehicle and a server capable of communicating with the vehicle, and is characterized by comprising the following steps:

2. The voice interaction method of claim 1, wherein the context information comprises a name and a type of an operable control in the current vehicle-mounted system graphical user interface, an action supported by the operable control, a value range of the action, and a current state of the operable control.

3. The voice interaction method of claim 2, wherein the server performs the natural language understanding processing of the voice request according to the context information, comprising:

creating a scene semantic space according to the context information;

4. The voice interaction method of claim 3, wherein creating the scene semantic space based on the context information comprises:

receiving context information sent by a vehicle;

loading and analyzing scene elements included in the context information;

and generating a scene semantic document according to the scene elements.

5. The voice interaction method of claim 4, wherein semantically understanding the voice request and outputting a semantic understanding result comprises: