CN115376513A - Voice interaction method, server and computer readable storage medium - Google Patents

Voice interaction method, server and computer readable storage medium Download PDF

Info

Publication number
CN115376513A
CN115376513A CN202211276398.2A CN202211276398A CN115376513A CN 115376513 A CN115376513 A CN 115376513A CN 202211276398 A CN202211276398 A CN 202211276398A CN 115376513 A CN115376513 A CN 115376513A
Authority
CN
China
Prior art keywords
information
state
state machine
template
matching
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202211276398.2A
Other languages
Chinese (zh)
Other versions
CN115376513B (en
Inventor
韩传宇
易晖
翁志伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Xiaopeng Motors Technology Co Ltd
Original Assignee
Guangzhou Xiaopeng Motors Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Xiaopeng Motors Technology Co Ltd filed Critical Guangzhou Xiaopeng Motors Technology Co Ltd
Priority to CN202211276398.2A priority Critical patent/CN115376513B/en
Publication of CN115376513A publication Critical patent/CN115376513A/en
Application granted granted Critical
Publication of CN115376513B publication Critical patent/CN115376513B/en
Priority to PCT/CN2023/125013 priority patent/WO2024083128A1/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B60VEHICLES IN GENERAL
    • B60RVEHICLES, VEHICLE FITTINGS, OR VEHICLE PARTS, NOT OTHERWISE PROVIDED FOR
    • B60R16/00Electric or fluid circuits specially adapted for vehicles and not otherwise provided for; Arrangement of elements of electric or fluid circuits specially adapted for vehicles and not otherwise provided for
    • B60R16/02Electric or fluid circuits specially adapted for vehicles and not otherwise provided for; Arrangement of elements of electric or fluid circuits specially adapted for vehicles and not otherwise provided for electric constitutive elements
    • B60R16/037Electric or fluid circuits specially adapted for vehicles and not otherwise provided for; Arrangement of elements of electric or fluid circuits specially adapted for vehicles and not otherwise provided for electric constitutive elements for occupant comfort, e.g. for automatic adjustment of appliances according to personal settings, e.g. seats, mirrors, steering wheel
    • B60R16/0373Voice control
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command

Abstract

The invention discloses a voice interaction method, which comprises the following steps: receiving a user voice request which is forwarded by a vehicle and is awakened after a vehicle voice function is awakened; loading a state machine configuration template according to a user voice request to analyze the state machine configuration template to obtain an analyzer; performing logic calculation according to the analyzer to obtain a matching state; and updating the rejection processing of each sound zone in the vehicle cabin according to the matching state so as to finish voice interaction. In the invention, the vehicle cabin is divided into a plurality of sound zones, and the state machine configuration template is loaded aiming at the received voice request forwarded by the vehicle, so that the state machine configuration template can be analyzed to obtain the analyzer. The analyzer can judge the matching condition of the current state and the rule of the state machine configuration template, so that the switching or changing of the state machine is confirmed according to the matching condition. The configurable template in the state machine is convenient for a user to set or change according to specific requirements, and has strong flexibility and better user experience.

Description

Voice interaction method, server and computer readable storage medium
Technical Field
The present invention relates to the field of voice technologies, and in particular, to a voice interaction method, a server, and a computer-readable storage medium.
Background
With the development of the automatic driving technology, the vehicle can support voice control services, such as voice control of window opening and the like. In an actual car scene, a user may send out voices from a plurality of sound zones in a car, and the sent voices are not all requests for a car-mounted system, so that the car-mounted voice processor is required to reject to identify useless information in all voices, extract a voice request for the car-mounted voice processor and respond to the voice request.
In the related art, the rejection processing of the voice request can only be performed on a single-tone-zone scene, and the rejection of the irrelevant voice request in the single-tone-zone scene is realized by combining the current text information, the automatic voice recognition technology, the confidence-level-representing voice feature and the like, so that the requirement on the voice interaction of multiple tone zones in the vehicle cannot be met.
Disclosure of Invention
The invention provides a voice interaction method, a server and a computer readable storage medium.
The voice interaction method comprises the following steps:
receiving a user voice request forwarded by the vehicle after the vehicle voice function is awakened;
loading a state machine configuration template according to the user voice request to analyze the state machine configuration template to obtain an analyzer;
performing logic calculation according to the analyzer to obtain a matching state;
and updating the rejection processing of each sound zone in the vehicle cabin according to the matching state so as to finish voice interaction.
Therefore, in the invention, the vehicle cabin is divided into a plurality of sound zones, and the state machine configuration template is loaded aiming at the received voice request forwarded by the vehicle, so that the state machine configuration template can be analyzed to obtain the analyzer. The analyzer can judge the matching condition of the current state and the rule of the state machine configuration template, so that the switching or changing of the state machine is confirmed according to the matching condition. The configurable template in the state machine is convenient for a user to set or change according to specific requirements, and has strong flexibility and better user experience.
The loading of the state machine configuration template according to the user voice request to analyze the state machine configuration template to obtain an analyzer includes:
determining a target state machine configuration template in the pre-compiled state machine configuration templates according to the user voice request;
and loading the target state machine template through a template analysis class and analyzing the target state machine template to obtain the analyzer.
Thus, the template loading class can fill each item of specific information of the voice request into the state machine configuration template, and define the loading and processing method to obtain the parser under the corresponding state and logic configuration, so as to facilitate the subsequent logic calculation or the introduction of more templates.
The determining a target state machine configuration template in the pre-written state machine configuration templates according to the user voice request comprises:
determining matching turn information, awakening tone zone information, dialogue tone zone information, rejection sub-tag confidence information and current rejection mode state information of a state machine corresponding to the user voice request;
and matching in the pre-written state machine configuration template according to the matching round information, the awakening sound zone information, the conversation sound zone information, the rejection sub-tag confidence information and the current rejection mode state information to determine the target state machine configuration template.
Thus, matching turn information, awakening tone zone information, conversation tone zone information, rejection sub-tag confidence information and current rejection mode state information of the state machine corresponding to the voice request of the user are determined, and matching is performed in a pre-written state machine configuration template according to the information of the voice request so as to determine the state machine configuration template which is in accordance with the current state information.
The matching according to the matching round information, the awakening tone region information, the dialogue tone region information, the rejection sub-tag confidence information and the current rejection mode state information in the pre-written state machine configuration template to determine the target state machine configuration template comprises:
and matching the pre-written state description template according to the matching round information, the awakening sound zone information, the dialogue sound zone information, the rejection sub-tag confidence information and the current rejection mode state information to determine a target state description template.
In this way, the information related to a specific voice request is matched with a pre-written state description template to determine a state description template that matches the current state information.
The matching according to the matching round information, the awakening tone region information, the dialogue tone region information, the rejection sub-tag confidence information and the current rejection mode state information in the pre-written state machine configuration template to determine the target state machine configuration template comprises:
and matching in a pre-written logic description template according to the matching round information, the awakening sound zone information, the conversation sound zone information, the rejection sub-tag confidence information and the current rejection mode state information to determine a target logic description template.
In this way, the information related to a specific voice request is matched with a pre-written state description template to determine a logic description template that conforms to the current state information.
The obtaining of the matching state through logic calculation according to the analyzer comprises the following steps:
and mapping the state description template and the logic description template analyzed by the analyzer through a logic calculation class and calculating to obtain the matching state.
Therefore, the logic calculation module can compare and calculate the current actual state description template analyzed by the analyzer with the constructed logic description template to obtain a matching state so as to facilitate the jump of a subsequent state machine.
The updating of the rejection processing of each sound zone in the vehicle cabin according to the matching state to complete the voice interaction comprises:
and updating rejection processing of each sound zone in the vehicle cabin through the action class of the state machine under the condition that the matching state is successful so as to finish voice interaction.
Therefore, the action class of the state machine determines that the current state information is matched with the logic rule according to the output of the logic calculation class, the state of the state machine can be converted, the rejection processing of each sound zone in the vehicle cabin is updated, and the voice interaction process is completed.
The updating of the rejection processing of each sound zone in the vehicle cabin according to the matching state to complete the voice interaction comprises:
and keeping rejection processing of each sound zone in the vehicle cabin through a state machine action class under the condition that the matching state is not matched successfully so as to finish voice interaction.
Therefore, the state machine action type determines that the current state information is not matched with the logic rule according to the logic calculation type output, the state of the state machine can not be converted, the rejection processing of each sound zone in the vehicle cabin is kept, and the voice interaction process is completed.
The server of the present invention comprises a memory and a processor, wherein the memory stores a computer program, and the computer program realizes the method when being executed by the processor.
The computer-readable storage medium of the present invention stores a computer program that, when executed by one or more processors, implements the above-described method.
Additional aspects and advantages of embodiments of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of embodiments of the invention.
Drawings
The above and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:
FIG. 1 is a flow chart diagram of a voice interaction method of the present invention;
figure 2 is a schematic view of a vehicle cabin of the present invention.
Detailed Description
Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the accompanying drawings are exemplary only for the purpose of illustrating the embodiments of the present invention and are not to be construed as limiting the embodiments of the present invention.
Referring to fig. 1, the present invention provides a voice interaction method, including:
01: receiving a user voice request which is forwarded by a vehicle and is awakened after a vehicle voice function is awakened;
02: loading a state machine configuration template according to a user voice request to analyze the state machine configuration template to obtain an analyzer;
03: performing logic calculation according to the analyzer to obtain a matching state;
04: and updating the rejection processing of each sound zone in the vehicle cabin according to the matching state so as to finish voice interaction.
The invention also provides a server, which comprises a memory and a processor. The voice interaction method of the invention can be realized by the server of the invention. Specifically, the memory stores a computer program, and the processor is configured to receive a user voice request forwarded by the vehicle after the vehicle voice function is awakened, load a state machine configuration template according to the user voice request to analyze the state machine configuration template to obtain an analyzer, perform logical calculation according to the analyzer to obtain a matching state, and update rejection processing of each sound zone in the vehicle cabin according to the matching state to complete voice interaction.
Particularly, the voice assistant of the vehicle-mounted system provides a lot of convenience for users in the cockpit, and the users can realize control over software or vehicle parts in the cockpit through voice interaction. For interactive convenience, the voice assistant may support continuous conversations, and since the in-vehicle space belongs to a shared environment, the voice assistant may be faced with receiving conversations from different users with the voice assistant, between different users, and so on. The semantic rejection rule is set, so that the voice assistant can give the same feedback to the reappeared same voice request, and meanwhile, the voice assistant is expected to generate the feedback rule for the specific voice request and modify the rule according to the user requirement as conveniently as possible, so that the voice interaction service can be better served for the user, and the use experience of the user in voice interaction is improved.
It will be appreciated that in the scenario of a multitone continuous conversation, i.e. a scenario in which the user at different locations within the cockpit has multiple rounds of conversation with the voice assistant in common after the voice assistant has awakened. Multiple users may interact with each other with higher degree of freedom around the same theme, which is more complicated than the case of a single speech area, and the setting of the semantic rejection rule needs to be more detailed.
The voice function of waking up the vehicle is also a voice assistant for waking up the vehicle, and the request for waking up the vehicle may be a wake-up word set by a manufacturer or customized by a user. After the voice assistant is awakened, the user in the cockpit can have a continuous number of turns with the voice assistant. And ending the conversation after the conversation reaches a set turn threshold value or a voice request of a user is not received within a preset time.
The wake-up zone is also the zone location where the user who sent the wake-up voice request is located. If the main driving awakens the voice assistant, the awakening sound zone is the main driving sound zone. The wakeup tone zone information is also the tone zone position information corresponding to the wakeup tone zone.
The dialogue sound zone is the sound zone position where the user who is performing the voice interaction is acquired by the voice assistant, and the sound zone that is performing the dialogue is the dialogue sound zone. If in a certain scene, after the voice assistant is awakened, the main driving user and the assistant driving user interact with the voice assistant in sequence, voice requests sent by the main driving user and the assistant driving user are acquired by the voice assistant in sequence in the scene, and the sound zones where the main driving user and the assistant driving user are located belong to the dialogue sound zone. The dialogue range and the wakeup range may be the same or different.
The rejection process is used for screening out which voice requests of the user are spoken by the voice assistant, recalling the voice requests and executing the voice requests, and which are not spoken by the voice assistant and filtering the voice requests as noise in the interaction process. The invention provides two rejection treatments with different rejection degrees, wherein the rejection treatment with high rejection degree and only recalling the voice request with high correlation degree is the first rejection treatment, and the rejection treatment with low rejection degree is the second rejection treatment.
In the invention, a state machine is introduced and used for recording the rejection mode of each sound zone in the voice interaction process and continuously updating the state machine according to the received corresponding sound zone information and the voice request of the user in the voice interaction process. In an actual vehicle usage scenario, the rejection rules of the voice assistant for the user are not necessarily invariable. When the voice assistant is awakened, the rejection processing of each sound zone needs to be updated along with the progress of voice interaction. The user can modify the rejection rules of the voice assistant according to the change of the user's own requirements, and the modularized state machine configuration template ensures that the user can conveniently add, delete or modify the specific rejection rules of the voice assistant.
In summary, in the present invention, the vehicle cabin is divided into a plurality of sound zones, and the state machine configuration template is loaded for the received voice request forwarded by the vehicle, so that the state machine configuration template can be analyzed to obtain the analyzer. The analyzer can judge the matching condition of the current state and the rule of the state machine configuration template, so that the switching or changing of the state machine is confirmed according to the matching condition. The configurable template in the state machine is convenient for a user to set or change according to specific requirements, and has strong flexibility and better user experience.
Step 02 comprises:
021: determining a target state machine configuration template in a pre-compiled state machine configuration template according to a user voice request;
022: and loading the target state machine template through the template analysis class and analyzing the target state machine template to obtain an analyzer.
The processor is used for determining a target state machine configuration template in the pre-written state machine configuration templates according to the voice request of the user, and is used for loading the target state machine template through the template analysis class and analyzing the target state machine template to obtain an analyzer.
Specifically, in the present invention, a state machine configuration template is provided for a user to configure, including a state description template and a logic description template. And after the state machine configuration template is completed, loading the target state machine template by the template analysis variable and analyzing the target state machine template to obtain an analyzer. The corresponding logic module and the state machine skip module can be loaded from the computer memory, so that the state machine can complete subsequent logic judgment.
Taking the first reject processing as an example, the configuration item "light _ state _ template" is a key value queue table (dit) type condition set, i.e. a state template, where there are various types of tag information about a voice request that can be filled in, including information such as a service rule, a response round number, a reject sub-tag and a confidence thereof. The configuration item "light _ local _ template" is a key value queue (fact) type condition set, i.e., a logical template, in which there is a conditional judgment rule statement that can be filled in with partial relevant information about a voice request. The filled state template and logic template are put into the self-state _ template and self-state _ template modules which can be analyzed by the template analysis class, the template analysis class defines functions "load _ state _ template" and "load _ local _ template" to load the state template "light _ state _ template" and logic template "light _ local _ template" input by the user, and finally the analyzer which processes the corresponding relation of the state and the logic and defines the processing function "process _ local _ template" as the output.
It is to be understood that the parser formed by the template parsing class can facilitate the logic processing class to perform further computation. Besides the state template and the logic template, the analyzer can be used for packaging more than two or even more types of templates, so that the analysis of the state and the logic is more convenient.
Thus, the template loading class can fill each item of specific information of the voice request into the state machine configuration template, and define the loading and processing method to obtain the parser under the corresponding state and logic configuration, so as to facilitate the subsequent logic calculation or the introduction of more templates.
Step 021 includes:
0211: determining matching turn information, awakening tone zone information, conversation tone zone information, rejection sub-tag confidence information and current rejection mode state information of a state machine corresponding to the user voice request;
0212: and matching the matching round information, the awakening tone area information, the dialogue tone area information, the rejection sub-tag confidence degree information and the current rejection mode state information in a pre-compiled state machine configuration template to determine a target state machine configuration template.
The processor is used for determining matching turn information, awakening tone zone information, conversation tone zone information, rejection sub-tag confidence degree information and current rejection mode state information of the state machine corresponding to the user voice request, and matching the matching turn information, the awakening tone zone information, the conversation tone zone information, the rejection sub-tag confidence degree information and the current rejection mode state information in a pre-written state machine configuration template to determine a target state machine configuration template.
Referring to fig. 2, taking five vehicles as an example, the cabin of the vehicle can be divided into 5 sound zones including a main driving sound zone, a sub driving sound zone, a rear row left side namely left rear sound zone, a rear row middle namely middle sound zone, a rear row right side namely right rear sound zone, and the like. When the state machine template is configured, one or more sound zones can be selected as state condition configuration content, and a plurality of voice pickup devices can be arranged in the cabin, so that the sound zone position information of a user sending a voice request is judged according to the acquired state information of the voice request.
In particular, the presence of condition variables is required in the state machine configuration template to fill in the static description of specific variables, forming state triggers. The state trigger name may be set to "triggerName," which is a string (str) class. A key value queue table (dit) type condition set can be established, the name can be set to be 'triggerDetail', and unordered and parallel state variable information can be filled in the table.
And the matching round information represents the times of voice requests sent by the user in the postphonic region after the voice assistant wakes up. The variable name may be set to "turns" and the data type is integer (int).
The wake-up sound zone information is sound zone position information corresponding to the wake-up sound zone, and the wake-up sound zone is a sound zone position where a user sending a wake-up voice request is located. The variable name may be set to "soundLocation," and as noted above, the type is integer (int) class.
The dialogue sound zone is the sound zone position where the user who is performing the voice interaction is acquired by the voice assistant, and the sound zone that is performing the dialogue is the dialogue sound zone. The variable name may be set to "soundArea" and, as noted above, the type is a string (str) class.
The rejection sub-label information comprises an effective voice request and an ineffective voice request, and the judgment of the effectiveness or the ineffectiveness of the voice request is determined by the rejection mode of the state machine. The variable name may be set to "rejSublabel," which is a string (str) class.
The rejection sub-label confidence information characterizes the confidence level of the rejection sub-label. The variable name may be set to "rejSublabel" and the type is floating point (float) type.
The reject mode state information is information indicating a reject processing state of the state machine for any voice request, and includes a current state and a target state. The variable names can be set to "source" and "dest", respectively, and the type is string (str) class.
And matching the acquired voice request with a pre-written state machine configuration template, and determining a target state machine configuration template corresponding to the current voice request.
In one example, the user requirement is "wake up in front row, enter first reject processing in back row", the variables that the state machine setting template needs to be specifically configured are wake-up zone information "soundLocation", dialogue zone information "soundArea", and state information "dest" of target reject processing, and the reject sub-tag, the confidence of reject sub-tag, matching round information, and current reject processing may be set to be in any state or not. Specifically, { "source": "," ' -triggerDetail ": {" turns ": null, ' rejSublabel": null, ' rejConf ": null } }. Wherein "source" represents a state that does not define the current rejection pattern; "turns": null, "rej Sublabel": null, "rejConf": null represents that none of the matching turns, the recognition rejection subtags, and the recognition rejection subtag confidence information rules are set.
Thus, matching turn information, awakening tone zone information, conversation tone zone information, rejection sub-tag confidence information and current rejection mode state information of the state machine corresponding to the voice request of the user are determined, and matching is performed in a pre-written state machine configuration template according to the information of the voice request so as to determine the state machine configuration template which is in accordance with the current state information.
Step 0212 includes:
02121: and matching according to the matching round information, the awakening tone area information, the dialogue tone area information, the rejection sub-tag confidence information and the current rejection mode state information in a pre-written state description template to determine a target state description template.
The processor is used for matching according to the matching round information, the awakening zone information, the dialogue zone information, the rejection sub-tag confidence degree information and the current rejection mode state information in a pre-written state description template so as to determine a target state description template.
Specifically, the state machine configuration template needs to fill in the specific static description of each state variable in the current scene to form a state trigger, that is, after the state trigger is filled with the specific static description condition of the state variable, it can be determined whether the current scene state meets the state machine jump condition. The name of the state trigger can be set as 'triggerName', and the type is a character string (str) class. A key-value pair list (ditt) type data set can also be established, the name can be set as 'triggerDetail', and unordered parallel state variable information can be filled in the table.
The matching round information represents the times of voice requests sent by users in the sound zone after the voice assistant wakes up. The variable name may be set to "turns" and the data type is integer (int), i.e., the variable may take all natural numbers.
Particularly, in order to distinguish the awakening sound zone and the dialogue sound zone, different identification methods can be used for the awakening sound zone and the dialogue sound zone, for example, in the invention, for five sound zones of a main driver, a copilot, a left rear sound zone, a middle sound zone and a right rear sound zone, if the sound zones are the awakening sound zones, the sound zones can be respectively represented by integer (int) 1, 2, 3, 4 and 5; if the sound zone is the current dialogue sound zone, the sound zone can be respectively represented by character strings (str) LF, RF, LR, MR and RR.
The wake-up sound zone information is the sound zone position information corresponding to the wake-up sound zone, and the wake-up sound zone is the sound zone position where the user sending the wake-up voice request is located. The variable name may be set to "soundLocation," and as noted above, the type is integer (int) type. Specifically, if the primary driving wake voice assistant is used, the wake-up zone is the primary driving zone and may be represented as "soundLocation": 1 "in the state machine. It can be understood that when configuring the same state machine template, multiple sound zones are also selected as the wake-up sound zone conditions, for example, if the condition needs to be set as the main drive or the secondary drive as the wake-up sound zone, that is, if all front row wakens can meet the user requirement, it can be represented as "soundLocation": 1/2 "in the state machine.
The dialogue sound zone is the sound zone position where the user who is performing the voice interaction is acquired by the voice assistant, and the sound zone that is performing the dialogue is the dialogue sound zone. The variable name may be set to "soundArea" and, as noted above, the type is a string (str) class. Specifically, if the left rear, middle and right rear sound zones are simultaneously conversed, the converse sound zone is the rear row all sound zone, which can be expressed as "sounda" in the state machine: LR/MR/RR.
The rejection sub-label information comprises an effective voice request and an ineffective voice request, and the judgment of the effectiveness or the ineffectiveness of the voice request is determined by the rejection processing of the state machine. The variable name may be set to "rejSublabel," which is a string (str) class. As in the present invention, there are two kinds of valid voice requests "clear" and invalid voice requests "noise".
The rejected subtag confidence information characterizes a confidence level of the rejected subtags. The variable name may be set to "rejSublabel" and the type is floating point (float) type. In the present invention, floating point numbers of 0.00 to 1.00 may be taken.
The reject mode state information is information indicating a reject processing state of the state machine for any voice request, and includes a current state and a target state. The variable names can be set to "source" and "dest", respectively, and the type is a string (str) class. As in the present invention, there are two kinds of valid voice requests "clear" and invalid voice requests "noise".
And matching the acquired voice request with a pre-written state machine configuration template, and determining a target state machine configuration template corresponding to the current voice request.
In one example, the user requirement is "wake up as front row, then back row enters first rejection processing" specifically configured as { "triggerName": front _ wakeup "," source ":", "triggerDetail": { "soundLocation": 1/2"," sounda ": LR/RR/MR", "turns": null "," rejubble ": null", "reject": light "}. Wherein "soundLocation": 1/2 "represents the front row wakeup; "sounda": LR/RR/MR "represents that the current speaker is the back row; "turns": null, "rejSublabel": null, "rejConf": null represents that the rule is not set; "source" means that it can be any state at present; "dest": light "represents that the target state is the first rejection processing.
In the interaction process, according to the acquired voice request awakened from the front row, a template written according to the requirement that the front row is awakened and the rear row enters the first rejection processing can be matched as a target state machine configuration template.
In this way, the information related to a specific voice request is matched with a pre-written state description template to determine a state description template that matches the current state information.
Step 02121 includes:
021211: and matching the pre-written logic description template according to the matching round information, the awakening tone area information, the dialogue tone area information, the rejection sub-tag confidence information and the current rejection mode state information to determine a target logic description template.
The processor is used for matching in a pre-written logic description template according to the matching turn information, the awakening tone region information, the dialogue tone region information, the rejection sub-tag confidence information and the current rejection mode state information so as to determine a target logic description template.
Specifically, the logic description template needs to be filled with static descriptions of specific logic rule variables, and the static descriptions of the rule variables should correspond to the state variable items one to form a state trigger, that is, after the state trigger is filled with specific static description conditions of the state variables, it can be determined whether the current scene state meets the state machine jump condition. The status trigger name may be set to "triggerName," a type of string (str) class. A key-value pair list (ditt) type data set can also be established, the name can be set as 'triggerDetail', and unordered and parallel logic rule information can be filled in the table.
For the variable names of the matching round information, the wakeup tone zone information, the dialogue tone zone information, the rejection sub-tag confidence information, and the current rejection mode state information, reference is made to the disclosure of step 02121, which is not described herein again.
Particularly, all rules contained in a key value pair list (ditt) in the logic description template are logic judgment statements, so that all types of logic rule judgment results of the voice request are set as character string (str) type variables, and four logic judgment results including ' exist ', less than ' less _ than ', not less than ' more _ than ' and ' equal to ' exist ' can be set, wherein the less than ' less _ than ' and not less than ' more _ than ' only support numerical value type judgment, including integer (int) and floating point (float), and the presence of ' exist ' and the equal to ' exist ' simultaneously support numerical value type and character string type judgment.
And matching the acquired voice request with a pre-written state machine configuration template, and determining a target state machine configuration template corresponding to the current voice request.
In one example, the user requirement is "wake up as front row, then back row enters first rejection processing" is specifically configured as { "triggerName": front _ wakeup "," source ": null", "triggerDetail": { "soundLocation": existence "," sounda ": exit", "rejuvebel": null "," rejjconf ": null }," dest ": null }. Wherein "soundLocation": exist "represents the current soundLocation": 1/2 "of the wakeup tone zone presence status template; "sounda": exist "represents the" sounda ": LR/RR/MR" of the current speaker zone presence status template; "turns": null, "rejSublabel": null, "rejConf": null represents that the rule is not set; null represents that the rule is not set, namely the current state can be any state; "dest": light "represents that the current target state is the first rejection processing.
In the interaction process, according to the acquired voice request awakened from the front row, a template compiled according to the requirement that if the voice request is awakened from the front row, the back row enters the first rejection processing can be matched as a target state machine configuration template.
In this way, the information related to a specific voice request is matched with a pre-written state description template to determine a logic description template that conforms to the current state information.
Step 03 comprises:
031: and mapping the state description template and the logic description template analyzed by the analyzer through a logic calculation class and calculating to obtain a matching state.
And the processor is used for mapping the state description template and the logic description template analyzed by the analyzer through the logic calculation class and calculating to obtain a matching state.
Specifically, in the present invention, the logic calculation type is mapped according to a one-to-one correspondence rule with the state description template and the logic description template analyzed by the parser in the logic description templates, and the logic calculation is performed to obtain the matching state.
Taking the requirement as an example of a first rejection processing jump of a front row wake-up and a back row entering first rejection processing, firstly, a rule that a logic description template ' light _ local _ template ' in a ' trigger detail ' table is not null ' is obtained, namely a wake-up sound zone ' soundLocation ' variable and a dialogue sound zone ' sounda ' variable. The logic calculation class can define functions of "exist", "less _ than", "more _ than" and "equal" to carry out logic judgment, and map the state template and the logic template according to a one-to-one correspondence principle. In this example, it is determined whether the values of the actual "soundLocation" and "sounda" variables of the current system are within the range of the limit values of the logical description template "light _ local _ template". If both are satisfied, the output result "match" can be stored in string (str) type data "self.result"; if not, outputting other results or not outputting any results and directly jumping out of the processing process.
Further, the computational methods of the logical compute classes may increase as the processing items increase.
Therefore, the logic calculation module can compare and calculate the current actual state description template analyzed by the analyzer with the constructed logic description template to obtain a matching state so as to facilitate the jump of a subsequent state machine.
Step 04 comprises:
041: and updating rejection processing of each sound zone in the vehicle cabin to finish voice interaction under the condition that the matching state is successful through the state machine action class.
And the processor updates the rejection processing of each sound zone in the vehicle cabin to finish voice interaction under the condition that the matching state is successful through the state machine action class.
Specifically, when the matching state is successful, the state machine action class updates the rejection processing of each sound zone in the vehicle cabin to complete the voice interaction.
Taking the requirement as a first rejection processing jump of a front row wakeup and a first rejection processing mode of a rear row enter, the state machine action class can define functions "get _ server", "get _ transition" and "get _ trigger" to respectively obtain a parser, a current jump action and a jump state, and under the condition that the matching state is successful, namely the logical operation class outputs a result "self.
Further, the state Machine jump by the "get _ transition" function can be implemented by using the transition toolkit Machine class of Python itself.
Therefore, the action class of the state machine determines that the current state information is matched with the logic rule according to the output of the logic calculation class, the state of the state machine can be converted, the rejection processing of each sound zone in the vehicle cabin is updated, and the voice interaction process is completed.
Step 04 comprises:
042: and keeping rejection processing of each sound zone in the vehicle cabin to finish voice interaction under the condition that the matching state is not matched successfully through the state machine action class.
And the processor is used for keeping rejection processing of each sound zone in the vehicle cabin through the state machine action class under the condition that the matching state is not matched successfully so as to finish voice interaction.
Specifically, when the matching state is not successfully matched, the state machine action class does not perform rejection processing update of each sound zone, and the state machine maintains the current state to complete voice interaction.
Taking the requirement as the first rejection processing jump of the front row wakeup and the back row entering the first rejection processing as an example, the action class of the state machine can define functions of "get _ parser", "get _ transition" and "get _ trigger" to respectively obtain the parser, the current jump action and the jump state, and under the condition that the matching state is not successfully matched, namely the output result "self.
Further, under the condition that the matching state is unmatched successfully and the logical operation output result self is not match, the state machine can be directly jumped out of the jump flow without outputting other matching results, and therefore voice interaction is completed.
Therefore, the state machine action type determines that the current state information is not matched with the logic rule according to the logic calculation type output, the state of the state machine can not be converted, the rejection processing of each sound zone in the vehicle cabin is kept, and the voice interaction process is completed.
The computer-readable storage medium of the present invention stores a computer program that, when executed by one or more processors, implements the above-described method.
The following illustrates the configuration of the state and logic templates with two scenario examples:
example one: the user requirements and specific configurations are shown in table 1. In the setting of the state template, 1/2 represents the awakening of the front row; "soundArea" means "LR/RR/MR" that the speaker is currently in the back row, i.e. the dialogue area is left back, middle and right back; "turn" 2 means 2 rounds of matching, that is, the number of times of voice requests to the voice zone in the back row is 2; "rejSublabel" means that only valid voice requests are included in the voice assistant count; "source" represents that the current voice assistant can be in any rejection mode state; "dest": lose "represents that the current voice assistant target state is the second rejection processing, i.e. no matter what rejection processing the current voice assistant is under, if the current state meets the template requirement, the current voice assistant needs to keep or jump to the second rejection processing. In the logical template configuration, "soundLocation": exist "represents the" soundLocation ": 1/2" of the current wake-up zone existing state template, i.e. the current wake-up zone is in the front row; "sounda": exist "represents the" sounda ": LR/RR/MR" of the current speaker zone presence status template, i.e. the current speaker zone is in the back row; "turn" more _ than means no less than matching, i.e. the number of the back-row dialog turns in the current state needs to reach 2 or more times set in the state template; "rejSublabel" the equal represents a complete match, i.e., only valid voice requests are recognized; null represents that the rule is not set, namely the existing rejection processing is not limited; "dest": loose "represents that the current target state is the second rejection process, i.e. no matter what rejection process the current voice assistant is under, if the current state meets the template requirement, the current voice assistant needs to keep or jump to the first rejection process.
Figure 987724DEST_PATH_IMAGE001
TABLE 1
Example two: the user requirements and specific configurations are shown in table 2. In the state template setting, 1/2 represents the front row awakening; "soundArea" means "LR/RR/MR" that the speaker is currently in the back row, i.e. the dialogue area is left back, middle and right back; turn 3 represents 3 rounds of matching, namely the number of times of sending voice requests to the voice area in the back row is 3; "rejSublel": noise "represents that the invalid voice request is recognized and counted by the voice assistant; "source" represents that the current voice assistant can be in any rejection processing state; "dest": light "represents that the current voice assistant target state is the first rejection process, i.e. no matter what rejection process the current voice assistant is under, the current state needs to be maintained or jump to the first rejection process if the current state meets the template requirements. In the logical template configuration, "soundLocation": exist "represents the" soundLocation ": 1/2" of the current wake-up zone existing state template, i.e. the current wake-up zone is in the front row; "sounda": exist "represents the" sounda ": LR/RR/MR" of the current speaker zone presence status template, i.e. the current speaker zone is in the back row; "turn" more _ than means not less than match, i.e. the number of sessions performed in the back row under the current state needs to reach 2 or more times set in the state template; "rejSublabel" equal represents a complete match, i.e., only invalid voice requests are counted; null represents that the rule is not set, namely the existing rejection processing is not limited; "dest": light "represents that the current target state is the first rejection processing, i.e. no matter what rejection processing the current voice assistant is under, the current state needs to be maintained or jump to the first rejection processing if the current state meets the template requirements.
Figure 767461DEST_PATH_IMAGE002
TABLE 2
In the description of the present specification, references to the description of the terms "above," "specifically," "understandably," "further," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.
Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and alternate implementations are included within the scope of the preferred embodiment of the present invention in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present invention.
Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and not to be construed as limiting the present invention, and those skilled in the art can make changes, modifications, substitutions and alterations to the above embodiments within the scope of the present invention.

Claims (10)

1. A method of voice interaction, comprising:
receiving a user voice request which is forwarded by a vehicle and is awakened after a vehicle voice function is awakened;
loading a state machine configuration template according to the user voice request to analyze the state machine configuration template to obtain an analyzer;
performing logic calculation according to the analyzer to obtain a matching state;
and updating the rejection processing of each sound zone in the vehicle cabin according to the matching state so as to finish voice interaction.
2. The method of claim 1, wherein loading a state machine configuration template according to the user voice request to parse the state machine configuration template to obtain a parser comprises:
determining a target state machine configuration template in a pre-compiled state machine configuration template according to the user voice request;
and loading the target state machine template through a template analysis class and analyzing the target state machine template to obtain the analyzer.
3. The voice interaction method of claim 2, wherein determining a target state machine configuration template from the user voice request among pre-compiled state machine configuration templates comprises:
determining matching turn information, awakening tone zone information, dialogue tone zone information, rejection sub-tag confidence information and current rejection mode state information of a state machine corresponding to the user voice request;
and matching in the pre-written state machine configuration template according to the matching round information, the awakening sound zone information, the conversation sound zone information, the rejection sub-tag confidence information and the current rejection mode state information to determine the target state machine configuration template.
4. The method according to claim 3, wherein the matching according to the matching round information, the wake-up phoneme region information, the dialogue phoneme region information, the recognition rejection sub-tag confidence information, and the current recognition rejection mode state information in the pre-written state machine configuration template to determine the target state machine configuration template comprises:
and matching the pre-written state description template according to the matching round information, the awakening sound zone information, the dialogue sound zone information, the rejection sub-tag confidence information and the current rejection mode state information to determine a target state description template.
5. The voice interaction method according to claim 4, wherein the matching according to the matching round information, the wake-up tone region information, the dialogue tone region information, the recognition rejection sub-tag confidence information, and the current recognition rejection mode state information in the pre-written state machine configuration template to determine the target state machine configuration template comprises:
and matching in a pre-written logic description template according to the matching round information, the awakening tone area information, the dialogue tone area information, the rejection sub-tag confidence degree information and the current rejection mode state information to determine a target logic description template.
6. The voice interaction method of claim 5, wherein the performing a logical calculation according to the parser to obtain a matching state comprises:
and mapping the state description template and the logic description template analyzed by the analyzer through a logic calculation class and calculating to obtain the matching state.
7. The voice interaction method of claim 6, wherein the updating rejection processing of each zone in the vehicle cabin according to the matching state to complete voice interaction comprises:
and updating rejection processing of each sound zone in the vehicle cabin through the action class of the state machine under the condition that the matching state is successful so as to finish voice interaction.
8. The voice interaction method of claim 6, wherein the updating rejection processing of each zone in the vehicle cabin according to the matching state to complete voice interaction comprises:
and keeping rejection processing of each sound zone in the vehicle cabin through a state machine action class under the condition that the matching state is not matched successfully so as to finish voice interaction.
9. A server, characterized in that the server comprises a memory and a processor, the memory having stored therein a computer program which, when executed by the processor, carries out the method of any one of claims 1-8.
10. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program which, when executed by one or more processors, implements the method of any one of claims 1-8.
CN202211276398.2A 2022-10-19 2022-10-19 Voice interaction method, server and computer readable storage medium Active CN115376513B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202211276398.2A CN115376513B (en) 2022-10-19 2022-10-19 Voice interaction method, server and computer readable storage medium
PCT/CN2023/125013 WO2024083128A1 (en) 2022-10-19 2023-10-17 Voice interaction method, server, and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211276398.2A CN115376513B (en) 2022-10-19 2022-10-19 Voice interaction method, server and computer readable storage medium

Publications (2)

Publication Number Publication Date
CN115376513A true CN115376513A (en) 2022-11-22
CN115376513B CN115376513B (en) 2023-05-12

Family

ID=84072707

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211276398.2A Active CN115376513B (en) 2022-10-19 2022-10-19 Voice interaction method, server and computer readable storage medium

Country Status (2)

Country Link
CN (1) CN115376513B (en)
WO (1) WO2024083128A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2024083128A1 (en) * 2022-10-19 2024-04-25 广州小鹏汽车科技有限公司 Voice interaction method, server, and computer readable storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1301922A1 (en) * 2000-07-18 2003-04-16 QUALCOMM Incorporated System and method for voice recognition with a plurality of voice recognition engines
CN107665708A (en) * 2016-07-29 2018-02-06 科大讯飞股份有限公司 Intelligent sound exchange method and system
CN113330513A (en) * 2021-04-20 2021-08-31 华为技术有限公司 Voice information processing method and device
CN113990300A (en) * 2021-12-27 2022-01-28 广州小鹏汽车科技有限公司 Voice interaction method, vehicle, server and computer-readable storage medium
CN114155853A (en) * 2021-12-08 2022-03-08 斑马网络技术有限公司 Rejection method, device, equipment and storage medium
CN114267347A (en) * 2021-11-01 2022-04-01 惠州市德赛西威汽车电子股份有限公司 Multi-mode rejection method and system based on intelligent voice interaction

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6480823B1 (en) * 1998-03-24 2002-11-12 Matsushita Electric Industrial Co., Ltd. Speech detection for noisy conditions
CN103186416B (en) * 2011-12-29 2016-06-22 比亚迪股份有限公司 Build the method for multi-task multi-branch process, state machine and the method for execution
US10462619B2 (en) * 2016-06-08 2019-10-29 Google Llc Providing a personal assistant module with a selectively-traversable state machine
CN107316643B (en) * 2017-07-04 2021-08-17 科大讯飞股份有限公司 Voice interaction method and device
CN111008532B (en) * 2019-12-12 2023-09-12 广州小鹏汽车科技有限公司 Voice interaction method, vehicle and computer readable storage medium
CN111063350B (en) * 2019-12-17 2022-10-21 思必驰科技股份有限公司 Voice interaction state machine based on task stack and implementation method thereof
CN112164401B (en) * 2020-09-18 2022-03-18 广州小鹏汽车科技有限公司 Voice interaction method, server and computer-readable storage medium
CN112927692B (en) * 2021-02-24 2023-06-16 福建升腾资讯有限公司 Automatic language interaction method, device, equipment and medium
CN115376513B (en) * 2022-10-19 2023-05-12 广州小鹏汽车科技有限公司 Voice interaction method, server and computer readable storage medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1301922A1 (en) * 2000-07-18 2003-04-16 QUALCOMM Incorporated System and method for voice recognition with a plurality of voice recognition engines
CN107665708A (en) * 2016-07-29 2018-02-06 科大讯飞股份有限公司 Intelligent sound exchange method and system
CN113330513A (en) * 2021-04-20 2021-08-31 华为技术有限公司 Voice information processing method and device
CN114267347A (en) * 2021-11-01 2022-04-01 惠州市德赛西威汽车电子股份有限公司 Multi-mode rejection method and system based on intelligent voice interaction
CN114155853A (en) * 2021-12-08 2022-03-08 斑马网络技术有限公司 Rejection method, device, equipment and storage medium
CN113990300A (en) * 2021-12-27 2022-01-28 广州小鹏汽车科技有限公司 Voice interaction method, vehicle, server and computer-readable storage medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2024083128A1 (en) * 2022-10-19 2024-04-25 广州小鹏汽车科技有限公司 Voice interaction method, server, and computer readable storage medium

Also Published As

Publication number Publication date
WO2024083128A1 (en) 2024-04-25
CN115376513B (en) 2023-05-12

Similar Documents

Publication Publication Date Title
CN109145204B (en) Portrait label generation and use method and system
CN110472030A (en) Man-machine interaction method, device and electronic equipment
US20060155546A1 (en) Method and system for controlling input modalities in a multimodal dialog system
CN115376513A (en) Voice interaction method, server and computer readable storage medium
WO2021012649A1 (en) Method and device for expanding question and answer sample
EP4086893A1 (en) Natural language understanding method and device, vehicle and medium
CN109003611A (en) Method, apparatus, equipment and medium for vehicle audio control
WO2023134380A1 (en) Interaction method, server, and storage medium
CN114822533B (en) Voice interaction method, model training method, electronic device and storage medium
CN112837683B (en) Voice service method and device
CN116978368B (en) Wake-up word detection method and related device
CN115509572A (en) Method for dynamically configuring business logic, cloud platform, vehicle and storage medium
CN116486815A (en) Vehicle-mounted voice signal processing method and device
CN115662400A (en) Processing method, device and equipment for voice interaction data of vehicle machine and storage medium
CN115905293A (en) Switching method and device of job execution engine
CN115220922A (en) Vehicle application program running method and device and vehicle
CN113012687A (en) Information interaction method and device and electronic equipment
Lecœuche Learning optimal dialogue management rules by using reinforcement learning and inductive logic programming
US20050165601A1 (en) Method and apparatus for determining when a user has ceased inputting data
Kreutel et al. Context-dependent interpretation and implicit dialogue acts
CN116679758B (en) Unmanned aerial vehicle scheduling method, unmanned aerial vehicle scheduling system, computer and readable storage medium
US11893996B1 (en) Supplemental content output
US11790898B1 (en) Resource selection for processing user inputs
CN117496972B (en) Audio identification method, audio identification device, vehicle and computer equipment
CN117725185B (en) Intelligent dialogue generation method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant