WO2024083128A1 - 语音交互方法、服务器及计算机可读存储介质 - Google Patents

语音交互方法、服务器及计算机可读存储介质 Download PDF

Info

Publication number
WO2024083128A1
WO2024083128A1 PCT/CN2023/125013 CN2023125013W WO2024083128A1 WO 2024083128 A1 WO2024083128 A1 WO 2024083128A1 CN 2023125013 W CN2023125013 W CN 2023125013W WO 2024083128 A1 WO2024083128 A1 WO 2024083128A1
Authority
WO
WIPO (PCT)
Prior art keywords
information
state machine
rejection
voice
state
Prior art date
Application number
PCT/CN2023/125013
Other languages
English (en)
French (fr)
Inventor
韩传宇
易晖
翁志伟
Original Assignee
广州小鹏汽车科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 广州小鹏汽车科技有限公司 filed Critical 广州小鹏汽车科技有限公司
Publication of WO2024083128A1 publication Critical patent/WO2024083128A1/zh

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B60VEHICLES IN GENERAL
    • B60RVEHICLES, VEHICLE FITTINGS, OR VEHICLE PARTS, NOT OTHERWISE PROVIDED FOR
    • B60R16/00Electric or fluid circuits specially adapted for vehicles and not otherwise provided for; Arrangement of elements of electric or fluid circuits specially adapted for vehicles and not otherwise provided for
    • B60R16/02Electric or fluid circuits specially adapted for vehicles and not otherwise provided for; Arrangement of elements of electric or fluid circuits specially adapted for vehicles and not otherwise provided for electric constitutive elements
    • B60R16/037Electric or fluid circuits specially adapted for vehicles and not otherwise provided for; Arrangement of elements of electric or fluid circuits specially adapted for vehicles and not otherwise provided for electric constitutive elements for occupant comfort, e.g. for automatic adjustment of appliances according to personal settings, e.g. seats, mirrors, steering wheel
    • B60R16/0373Voice control
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command

Definitions

  • the present application relates to the field of voice technology, and in particular to a voice interaction method, a server and a computer-readable storage medium.
  • vehicles can support voice control services, such as voice control of window opening, etc.
  • voice control services such as voice control of window opening, etc.
  • users may speak from multiple sound zones in the car, and not all the voices are requests to the vehicle system. This requires the vehicle voice processor to reject useless information from all voices, extract voice requests for itself and respond.
  • the rejection processing of voice requests can usually only be applied to single-tone zone scenarios.
  • the rejection processing of voice requests can usually only be applied to single-tone zone scenarios.
  • the present application provides a voice interaction method, a server and a computer-readable storage medium, which can meet the needs of multi-zone voice interaction in a vehicle.
  • the voice interaction method of the present application includes:
  • the rejection processing of each sound zone in the vehicle cabin is updated according to the matching status to complete the voice interaction.
  • the vehicle cabin is divided into multiple audio zones, and the state machine configuration template is loaded for the voice request forwarded by the vehicle, so that the state machine configuration template can be parsed to obtain a parser.
  • the parser can determine the match between the current state and the rules of the state machine configuration template, and then confirm the switching or change of the state machine state according to the match.
  • the configurable template in the state machine is convenient for users to set or change according to specific needs, and has strong scalability. Better user experience.
  • the step of loading the state machine configuration template according to the user voice request to parse the state machine configuration template to obtain a parser includes:
  • the target state machine configuration template is loaded through a template parsing class and the target state machine configuration template is parsed to obtain the parser.
  • the template loading class can fill in the specific information of the voice request into the state machine configuration template, and define the loading and processing methods to obtain the parser under the corresponding state and logical configuration, so as to facilitate subsequent logical calculations or the introduction of more templates.
  • the step of determining a target state machine configuration template from a pre-written state machine configuration template according to the user voice request includes:
  • the matching round information the wake-up sound zone information, the conversation sound zone information, the rejection sub-tag information, the rejection sub-tag confidence information and the current rejection mode state information, a match is performed in the pre-written state machine configuration template to determine the target state machine configuration template.
  • the matching round information, wake-up voice zone information, conversation voice zone information, rejection sub-tag information, rejection sub-tag confidence information and current rejection mode state information of the state machine corresponding to the user's voice request are determined, and the above information of the voice request is matched in the pre-written state machine configuration template to determine the state machine configuration template that matches the current state information.
  • the step of matching the pre-written state machine configuration template to determine the target state machine configuration template according to the matching round information, the wake-up sound zone information, the conversation sound zone information, the rejection sub-tag information, the rejection sub-tag confidence information, and the current rejection mode state information includes:
  • a match is performed on a pre-written state description template to determine a target state description template.
  • the relevant information of the specific voice request is matched with the pre-written state description template to determine the state description template that matches the current state information.
  • the method further comprises: performing a match in the pre-written state machine configuration template according to the matching round information, the wake-up sound zone information, the conversation sound zone information, the rejection sub-tag information, the rejection sub-tag confidence information and the current rejection mode state information to determine the target state.
  • State machine configuration template including:
  • matching is performed on a pre-written logic description template to determine a target logic description template.
  • the relevant information of the specific voice request is matched with the pre-written state description template to determine the logical description template that matches the current state information.
  • the performing logical calculation according to the parser to obtain the matching status includes:
  • the state description template and the logic description template parsed by the parser are mapped through a logic calculation class and the matching state is calculated.
  • the logic calculation module can compare and calculate the current actual state description template parsed by the parser with the constructed logic description template to obtain a matching state for subsequent state machine jumps.
  • the updating of the rejection processing of each sound zone in the vehicle cabin according to the matching status to complete the voice interaction includes:
  • the state machine action class updates the rejection processing of each sound zone in the vehicle cabin to complete the voice interaction.
  • the state machine action class determines the current state information and the matching of logic rules according to the output of the logic calculation class, and can convert the state of the state machine, update the rejection processing of each sound zone in the vehicle cabin, and complete the voice interaction process.
  • the updating of the rejection processing of each sound zone in the vehicle cabin according to the matching status to complete the voice interaction includes:
  • the state machine action class maintains the rejection processing of each sound zone in the vehicle cabin to complete the voice interaction.
  • the state machine action class determines that the current state information does not match the logic rule based on the output of the logic calculation class, and the state machine state can be kept unchanged, maintaining the rejection processing of each sound zone in the vehicle cabin to complete the voice interaction process.
  • the state machine configuration template includes various types of label information about the voice request that can be filled in, including business rules, response rounds, rejection sub-labels and confidence information thereof.
  • the state machine configuration template includes conditional judgment rule statements that can be filled in with part of the relevant information about the voice request.
  • the rejection subtag information includes a valid voice request and an invalid voice request, wherein the validity or invalidity of the voice request is determined by the rejection mode of the state machine.
  • the server of the present application includes a memory and a processor, wherein a computer program is stored in the memory, and when the computer program is executed by the processor, the above method is implemented.
  • the computer-readable storage medium of the present application stores a computer program, and when the computer program is executed by one or more processors, the above method is implemented. Additional aspects and advantages will be given in part in the following description and in part will be obvious from the following description or learned through practice of the embodiments of the present application.
  • FIG1 is a flow chart of the voice interaction method of the present application.
  • FIG. 2 is a schematic diagram of the vehicle cabin of the present application.
  • the present application provides a voice interaction method, including:
  • the present application also provides a server, which includes a memory and a processor.
  • the voice interaction method of the present application can be implemented by the server of the present application.
  • the memory stores a computer program
  • the processor is used to receive a user voice request forwarded by the vehicle after the vehicle voice function is awakened, and load a state machine configuration template according to the user voice request to parse the state machine configuration template to obtain a parser, and perform logical calculations according to the parser to obtain a matching state, and according to the matching state, the processor performs a logic calculation according to the parser to obtain a matching state. Dynamically update the rejection processing of each audio zone in the vehicle cabin to complete voice interaction.
  • the voice assistant of the in-vehicle system provides many conveniences for users in the cockpit. Users can control the software or vehicle components in the cockpit through voice interaction. In order to facilitate interaction, the voice assistant can support continuous dialogue. Since the space in the car is a shared environment, the voice assistant may face the need to receive dialogues between different users and the voice assistant, dialogues between different users, etc. By setting semantic rejection rules, the voice assistant can give the same feedback to the same voice request that appears again. At the same time, it is hoped that the voice assistant can modify the feedback rules for specific voice requests as conveniently as possible according to user needs, so as to better serve users and enhance the user experience of voice interaction.
  • the vehicle voice wake-up function is to wake up the vehicle's voice assistant.
  • the wake-up voice request can be a wake-up word set by the manufacturer or customized by the user.
  • the voice assistant is woken up, the user in the cabin can have multiple consecutive conversations with the voice assistant. The conversation ends when the conversation reaches the set round threshold or when no voice request from the user is received within the predetermined time.
  • the wake-up audio zone is the audio zone where the user who issued the wake-up voice request is located. For example, if the driver wakes up the voice assistant, then the wake-up audio zone is the driver's audio zone.
  • the wake-up audio zone information is the audio zone location information corresponding to the wake-up audio zone.
  • the conversation audio zone is the audio zone where the voice assistant obtains the location of the user who is performing voice interaction.
  • the audio zone where the conversation is in progress is the conversation audio zone.
  • the main driver user and the co-driver user interact with the voice assistant successively.
  • the voice requests issued by the main driver user and the co-driver user are successively obtained by the voice assistant, and the audio zones where the main driver user and the co-driver user are located belong to the conversation audio zone.
  • the conversation audio zone and the awakening audio zone can be the same or different.
  • the rejection processing is used to identify which voice requests of the user are said to the voice assistant during the interaction process, and recall and execute them, and which are not said to the voice assistant and filter them as noise.
  • two rejection processings with different rejection degrees are provided, among which the rejection processing with a high rejection degree and only recalling voice requests with high relevance is the first rejection processing, and the rejection processing with a low rejection degree is the second rejection processing.
  • a state machine is introduced.
  • the state machine is used to record the rejection mode of each audio zone during the voice interaction process, and continuously update the state machine according to the corresponding audio zone information received and the user's voice request during the voice interaction process of this application.
  • the user's rejection rule requirements for the voice assistant are not necessarily fixed.
  • each audio zone The rejection processing needs to be updated with the progress of voice interaction. Users will modify the rejection rules of the voice assistant according to their changing needs.
  • the modular state machine configuration template ensures that users can easily add, delete or modify the specific rejection rules of the voice assistant.
  • the vehicle cabin is divided into multiple audio zones, and the state machine configuration template is loaded in response to the voice request forwarded by the vehicle, so that the state machine configuration template can be parsed to obtain a parser.
  • the parser can determine the match between the current state and the rules of the state machine configuration template, and thus confirm the switching or change of the state machine state based on the match.
  • the configurable template in the state machine is convenient for users to set or change according to specific needs, has strong scalability, and provides a better user experience.
  • Step 02 includes:
  • 022 Load the target state machine configuration template through the template parsing class and parse the target state machine configuration template to obtain the parser.
  • the processor is used to determine the target state machine configuration template in the pre-written state machine configuration template according to the user voice request, and to load the target state machine configuration template through the template parsing class and parse the target state machine configuration template to obtain the parser.
  • a state machine configuration template for users to configure, including a state description template (which can be referred to as a state template) and a logic description template (which can be referred to as a logic template).
  • a state description template which can be referred to as a state template
  • a logic description template which can be referred to as a logic template.
  • the configuration item "tight_state_template” is a key-value queue table (dict) type condition set, that is, a state template, in which various types of label information about the voice request can be filled in, including business rules, response rounds, rejection sub-labels and their confidence.
  • the configuration item "tight_logical_template” is a key-value queue table (dict) type condition set, that is, a logical template, in which there are conditional judgment rule statements of some relevant information about the voice request that can be filled in.
  • the filled-in state template and logical template will be classified into the "self.state_template” and "self.state_template” modules that can be parsed by the template parsing class.
  • the template parsing class defines functions "load_state_template” and “load_logical_template” to load the state template "tight_state_template” and logical template “tight_logical_template” input by the user, and finally processes the correspondence between the state and the logic and defines the processing function "process_logical_template” as the output parser.
  • the parser formed by the template parsing class can facilitate the logic processing class to perform further calculations.
  • the parser can package and process more than two or even To more type templates, more convenient for state and logic parsing.
  • the template loading class can fill in the specific information of the voice request into the state machine configuration template, and define the loading and processing methods to obtain the parser under the corresponding state and logical configuration, so as to facilitate subsequent logical calculations or the introduction of more templates.
  • Step 021 includes:
  • 0211 Determine the matching round information, wake-up voice zone information, conversation voice zone information, rejection sub-tag information, rejection sub-tag confidence information and current rejection mode state information of the state machine corresponding to the user voice request;
  • the processor is used to determine the matching round information, wake-up voice zone information, conversation voice zone information, rejection sub-tag information, rejection sub-tag confidence information and current rejection mode state information of the state machine corresponding to the user voice request, and match the pre-written state machine configuration template according to the matching round information, wake-up voice zone information, conversation voice zone information, rejection sub-tag information, rejection sub-tag confidence information and current rejection mode state information to determine the target state machine configuration template.
  • the cockpit is divided into different sound zones according to the areas where the user may make sounds. Please refer to Figure 2. Taking a five-seat vehicle as an example, the vehicle cockpit can be divided into five sound zones, including the main driver's sound zone, the co-driver's sound zone, the left side of the rear row, the middle of the rear row, and the right side of the rear row.
  • the state machine template When configuring the state machine template, one or more sound zones can be selected as state condition configuration content.
  • Multiple voice pickup devices can be set in the cockpit, so as to determine the sound zone position information of the user who issued the voice request based on the acquired state information of the voice request.
  • conditional variables need to exist in the state machine configuration template so that the static description of specific variables can be filled in to form a state trigger.
  • the state trigger name can be set to "triggerName” and the type is string (str).
  • the matching turn information represents the number of times the user in the voice zone issues a voice request after the voice assistant is awakened.
  • the variable name can be set to "turns" and the data type is integer (int).
  • the wake-up sound zone information is the sound zone position information corresponding to the wake-up sound zone, and the wake-up sound zone is also the sound zone position of the user who issued the wake-up voice request.
  • the variable name can be set to "soundLocation", as described above, and the type is an integer (int) class.
  • the conversation sound area is the sound area where the voice assistant obtains the location of the user who is performing voice interaction.
  • the sound area where the conversation is ongoing is the conversation sound area.
  • the variable name can be set to "soundArea", as mentioned above, and the type is a string (str) class.
  • the rejection sub-tag information includes valid voice requests and invalid voice requests, and determines the voice request Valid or invalid is determined by the state machine's rejection mode.
  • the variable name can be set to "rejSublabel" and the type is string (str).
  • the confidence information of the rejected sublabel represents the credibility of the rejected sublabel.
  • the variable name can be set to "rejSublabel" and the type is a float class.
  • the rejection mode state information is information used to indicate the rejection processing state of the state machine for any voice request, including the current state and the target state.
  • the variable names can be set to "source” and “dest” respectively, and the type is string (str) class.
  • the acquired voice request is matched with a pre-written state machine configuration template to determine a target state machine configuration template corresponding to the current voice request.
  • the user requirement is "wake up the front row, and enter the first rejection process in the back row”.
  • the variables that need to be specifically configured in the state machine setting template include the wake-up sound zone information “soundLocation”, the dialogue sound zone information “soundArea”, and the state information "dest” of the target rejection process, while the rejection sub-label, the rejection sub-label confidence, the matching round information and the current rejection process can be set to any state. For example, ⁇ "source”:"*",”triggerDetail”: ⁇ "turns":null,”rejSublabel”:null,”rejConf”:null ⁇ .
  • source means that the state of the current rejection mode is not limited;
  • turns means that the matching round, rejection sub-label and rejection sub-label confidence information rules are not set.
  • the matching round information, wake-up voice zone information, conversation voice zone information, rejection sub-tag information, rejection sub-tag confidence information and current rejection mode state information of the state machine corresponding to the user's voice request are determined, and the above information of the voice request is matched in the pre-written state machine configuration template to determine the state machine configuration template that matches the current state information.
  • Step 0212 includes:
  • a pre-written state description template is matched to determine the target state description template.
  • the processor is used to match the pre-written state description template to determine the target state description template according to the matching round information, the wake-up sound zone information, the conversation sound zone information, the rejection sub-tag information, the rejection sub-tag confidence information and the current rejection mode state information.
  • the state machine configuration template needs to fill in the specific static description of each state variable in the current scene to form a state trigger. That is, when the static description conditions of the specific state variables are filled in the state trigger, it can be judged whether the current scene state meets the state machine jump condition.
  • the name of the state trigger can be set to "triggerName” and the type is string (str).
  • the matching round information represents the number of voice requests issued by users in the voice zone after the voice assistant is awakened.
  • the variable name can be set to "turns" and the data type is integer (int), that is, the variable can take all natural numbers.
  • the sound zone is a wake-up sound zone
  • it can be represented by integers (int) 1, 2, 3, 4, and 5 respectively
  • the sound zone is the current dialogue sound zone, it can be represented by strings (str) LF, RF, LR, MR, and RR respectively.
  • the wake-up sound zone information is the sound zone position information corresponding to the wake-up sound zone, and the wake-up sound zone is also the sound zone position of the user who issued the wake-up voice request.
  • the variable name can be set to "soundLocation", as mentioned above, and the type is an integer (int) class. Among them, if the driver wakes up the voice assistant, then the wake-up sound zone is the driver's sound zone, which can be represented as "soundLocation":"1" in the state machine. It can be understood that when configuring the same state machine template, multiple sound zones are also selected as wake-up sound zone conditions.
  • the condition needs to be set for the driver or co-driver as the wake-up sound zone that is, as long as the front row wakes up, the user's needs can be met, then it can be represented as "soundLocation":"1/2" in the state machine.
  • the dialogue sound zone is the sound zone where the voice assistant obtains the location of the user who is performing voice interaction.
  • the sound zone where the conversation is in progress is the dialogue sound zone.
  • the variable name can be set to "soundArea", as mentioned above, the type is a string (str) class. Among them, if the left rear, middle, and right rear sound zones are having conversations at the same time, the dialogue sound zone is all the sound zones in the back row, which can be represented as "soundArea":"LR/MR/RR" in the state machine.
  • the rejection sublabel information includes valid voice requests and invalid voice requests.
  • the validity or invalidity of the voice request is determined by the rejection processing of the state machine.
  • the variable name can be set to "rejSublabel", and the type is a string (str) class. For example, in this application, there are two types of valid voice requests “clear” and invalid voice requests “noise”.
  • the confidence information of the rejected sublabel represents the credibility of the rejected sublabel.
  • the variable name can be set to "rejSublabel", and the type is a floating point class. In this application, it can be a floating point number between 0.00 and 1.00.
  • the rejection mode state information is information used to indicate the rejection processing state of the state machine for any voice request, including the current state and the target state.
  • the variable names can be set to "source” and “dest” respectively, and the type is a string (str) class. For example, in this application, there are two types of valid voice requests “clear” and invalid voice requests “noise”.
  • the acquired voice request is matched with a pre-written state machine configuration template to determine a target state machine configuration template corresponding to the current voice request.
  • the user requirement is "If the front row wakes up, the back row enters the first rejection process" can be configured as
  • a template written according to the requirement of "if the front row is woken up, the back row enters the first rejection processing" can be matched as the target state machine configuration template.
  • the relevant information of the specific voice request is matched with the pre-written state description template to determine the state description template that matches the current state information.
  • Step 02121 includes:
  • the processor is used to match the pre-written logic description template to determine the target logic description template according to the matching round information, the wake-up sound zone information, the conversation sound zone information, the rejection sub-tag information, the rejection sub-tag confidence information and the current rejection mode state information.
  • the static description of the specific logic rule variables needs to be filled in the logic description template.
  • the static description of the rule variables should correspond to the state variable items one by one to form a state trigger. That is, when the static description conditions of the specific state variables are filled in the state trigger, it can be judged whether the current scene state meets the state machine jump condition.
  • the state trigger name can be set to "triggerName” and the type is string (str).
  • variable names of the matching round information, wake-up sound zone information, conversation sound zone information, rejection sub-tag information, rejection sub-tag confidence information and current rejection mode status information can be found in the public content of step 02121 and will not be repeated here.
  • the acquired voice request is matched with a pre-written state machine configuration template to determine a target state machine configuration template corresponding to the current voice request.
  • the user requirement is "If the front row wakes up, the back row enters the first rejection process" can be configured as
  • a template written according to the requirement of "if the front row is woken up, the back row enters the first rejection processing" can be matched as the target state machine configuration template.
  • the relevant information of the specific voice request is matched with the pre-written state description template to determine the logical description template that matches the current state information.
  • Step 03 includes:
  • 031 Map the state description template and the logic description template parsed by the parser through the logic calculation class and calculate the matching state.
  • the processor is used to map the state description template and the logic description template parsed by the parser through the logic calculation class and calculate the matching state.
  • the logic calculation class exists in the logic description template and maps the state description template parsed by the parser and the logic description template according to the one-to-one correspondence principle, and performs logical calculation to obtain a matching state.
  • the logical calculation class can define functions "exist”, “less_than”, “more_than”, and “equal” to perform logical judgments, and map the state template and the logical template according to the principle of one-to-one correspondence.
  • the logic calculation module can describe the current actual state of the parser and the The constructed logic description template is compared and calculated to obtain the matching state for subsequent state machine jump.
  • Step 04 includes:
  • the state machine action class updates the rejection processing of each audio zone in the vehicle cabin to complete the voice interaction.
  • the processor updates the rejection processing of each audio zone in the vehicle cabin through the state machine action class to complete the voice interaction when the matching state is successful.
  • the state machine action class updates the rejection processing of each sound zone in the vehicle cabin to complete the voice interaction.
  • the state machine action class can define functions "get_parser”, “get_transition” and “get_trigger” to obtain the parser, current jump action and jump status respectively.
  • the matching status is a successful match, that is, the output result "self.result" of the logic operation class is "match”
  • the state machine action class can update the rejection processing of each sound zone in the vehicle cabin through the "get_transition” function to complete the voice interaction.
  • state machine jump performed by the "get_transition" function can be implemented using the Machine class of Python's built-in transition toolkit.
  • the state machine action class determines the current state information and the matching of logic rules according to the output of the logic calculation class, and can convert the state of the state machine, update the rejection processing of each sound zone in the vehicle cabin, and complete the voice interaction process.
  • Step 04 includes:
  • the processor is used to maintain the rejection processing of each sound zone in the vehicle cabin to complete the voice interaction through the state machine action class when the matching state is unsuccessful.
  • the state machine action class when the matching state is unsuccessful, the state machine action class does not perform rejection processing updates for each audio zone, and the state machine maintains the current status to complete the voice interaction.
  • the state machine action class can define functions "get_parser”, “get_transition” and “get_trigger” to obtain the parser, current jump action and jump status respectively.
  • the matching status is unsuccessful, that is, the output result "self.result” of the logic operation class is not "match”
  • the rejection processing of each audio zone is not updated, the state machine maintains the status quo, and the voice interaction is completed.
  • the state machine action class determines the current state information and logic rules according to the output of the logic calculation class. If there is no match, the state machine state may not be changed, and the rejection processing of each sound zone in the vehicle cabin may be maintained to complete the voice interaction process.
  • the computer-readable storage medium of the present application stores a computer program, and when the computer program is executed by one or more processors, the above method is implemented.
  • Example 1 User requirements and specific configurations are shown in Table 1.
  • “soundLocation”:"1/2" represents front row wake-up
  • “soundArea”:”LR/RR/MR” represents that the current speaker is in the back row, that is, the dialogue area is the left rear, middle and right rear areas
  • “turns”:2 represents 2 rounds of matching, that is, the number of voice requests issued by the back row in the dialogue area is 2 times
  • "rejSublabel”:”clear” represents that only valid voice requests are included in the voice assistant count
  • “source”:”*” represents that the current voice assistant can be in any rejection mode state
  • "dest”:”loose” represents that the current voice assistant target state is the second rejection processing, that is, no matter what rejection processing the current voice assistant is in, if the current state meets the template requirements, it needs to be maintained or jumped to the second rejection processing.
  • Example 2 User requirements and specific configurations are shown in Table 2.
  • “soundLocation”:"1/2" represents front row wake-up
  • “soundArea”:”LR/RR/MR” represents that the current speaker is in the back row, that is, the dialogue area is the left rear, middle and right rear areas
  • “turns”:3 represents 3 rounds of matching, that is, the number of voice requests issued by the back row to the dialogue area is 3 times
  • "rejSublabel”:”noise” represents the count of invalid voice requests recognized by the voice assistant
  • “source”:”*” represents that the current voice assistant can be in any rejection processing state
  • "dest”:"tight” represents that the current voice assistant target state is the first rejection processing, that is, no matter what rejection processing the current voice assistant is in, if the current state meets the template requirements, it needs to be maintained or jumped to the first rejection processing.
  • Any process or method description in a flowchart or otherwise described herein may be understood to represent a module, fragment or portion of executable requested code including one or more steps for implementing a specific logical function or process, and the scope of the preferred embodiments of the present application includes alternative implementations in which functions may not be performed in the order shown or discussed, including performing functions in a substantially simultaneous manner or in the reverse order depending on the functions involved, which should be understood by technicians in the technical field to which the embodiments of the present application belong.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Mechanical Engineering (AREA)
  • Navigation (AREA)
  • Machine Translation (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

本申请公开了一种语音交互方法,包括:接收车辆转发的在车辆语音功能被唤醒后的用户语音请求;根据用户语音请求加载状态机配置模板以解析状态机配置模板得到解析器;根据解析器进行逻辑计算得到匹配状态;根据匹配状态更新车辆座舱内各个音区的拒识处理以完成语音交互。本申请中,将车辆座舱划分为多个音区,针对接收到车辆转发的语音请求,来加载状态机配置模板,从而能够解析状态机配置模板得到解析器。解析器能够判断当前所处的状态与状态机配置模板的规则的匹配情况,从而根据匹配情况,确认状态机状态的切换或改变。状态机中的可配置模板方便用户根据具体需求进行设置或改变,具有较强的伸缩性,用户体验较佳。

Description

语音交互方法、服务器及计算机可读存储介质
本申请要求于2022年10月19日提交国家知识产权局、申请号为2022112763982、申请名称为“语音交互方法、服务器及计算机可读存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及语音技术领域,特别涉及一种语音交互方法、服务器及计算机可读存储介质。
背景技术
随着自动驾驶技术的发展,车辆可以支持语音控制服务,如语音控制车窗开启等。在实际用车场景中,用户可能从车内多个音区发出语音,且发出的语音并不都是对车载系统的请求,这就要求车载语音处理器能够在所有语音中拒绝识别无用信息,提取针对自己的语音请求并做出响应。
相关技术中,对于语音请求的拒识处理通常仅能够针对单音区场景,通过结合当前文本信息、自动语音识别技术、置信度表征语音特征等实现在单音区场景下对无关语音请求的拒识,无法满足对于车辆内多音区语音交互的需求。
发明内容
为解决或部分解决相关技术中存在的问题,本申请提供一种语音交互方法、服务器及计算机可读存储介质,能满足对于车辆内多音区语音交互的需求。
本申请的语音交互方法,包括:
接收所述车辆转发的在车辆语音功能被唤醒后的用户语音请求;
根据所述用户语音请求加载状态机配置模板以解析所述状态机配置模板得到解析器;
根据所述解析器进行逻辑计算得到匹配状态;
根据所述匹配状态更新车辆座舱内各个音区的拒识处理以完成语音交互。
如此,本申请中,将车辆座舱划分为多个音区,针对接收到车辆转发的语音请求,来加载状态机配置模板,从而能够解析状态机配置模板得到解析器。解析器能够判断当前所处的状态与状态机配置模板的规则的匹配情况,从而根据匹配情况,确认状态机状态的切换或改变。状态机中的可配置模板方便用户根据具体需求进行设置或改变,具有较强的伸缩性,用 户体验较佳。
所述根据所述用户语音请求加载状态机配置模板以解析所述状态机配置模板得到解析器,包括:
根据所述用户语音请求在预先编写的状态机配置模板中确定目标状态机配置模板;
通过模板解析类加载所述目标状态机配置模板并解析目标状态机配置模板得到所述解析器。
如此,模板加载类可将语音请求的各项具体信息填写进状态机配置模板中,并定义加载和处理的方法得到相应状态和逻辑配置下的解析器,以便后续的逻辑计算或更多模板的引入。
所述根据所述用户语音请求在预先编写的状态机配置模板中确定目标状态机配置模板,包括:
确定所述用户语音请求对应的匹配轮次信息、唤醒音区信息、对话音区信息、拒识子标签信息、拒识子标签置信度信息和状态机的当前拒识模式状态信息;
根据所述匹配轮次信息、所述唤醒音区信息、所述对话音区信息、所述拒识子标签信息、所述拒识子标签置信度信息和所述当前拒识模式状态信息,在预先编写的所述状态机配置模板中进行匹配以确定所述目标状态机配置模板。
如此,确定用户语音请求对应的匹配轮次信息、唤醒音区信息、对话音区信息、拒识子标签信息、拒识子标签置信度信息和状态机的当前拒识模式状态信息,并根据语音请求的上述信息在预先编写的状态机配置模板中进行匹配,以确定与当前状态信息符合的状态机配置模板。
所述根据所述匹配轮次信息、所述唤醒音区信息、所述对话音区信息、所述拒识子标签信息、所述拒识子标签置信度信息和所述当前拒识模式状态信息,在预先编写的所述状态机配置模板中进行匹配以确定所述目标状态机配置模板,包括:
根据所述匹配轮次信息、所述唤醒音区信息、所述对话音区信息、所述拒识子标签信息、所述拒识子标签置信度信息和所述当前拒识模式状态信息,在预先编写的状态描述模板进行匹配以确定目标状态描述模板。
如此,将具体语音请求的相关信息与预先编写的状态描述模板中进行匹配,以确定与当前状态信息符合的状态描述模板。
所述根据所述匹配轮次信息、所述唤醒音区信息、所述对话音区信息、所述拒识子标签信息、所述拒识子标签置信度信息和所述当前拒识模式状态信息,在预先编写的所述状态机配置模板中进行匹配以确定所述目标状 态机配置模板,包括:
根据所述匹配轮次信息、所述唤醒音区信息、所述对话音区信息、所述拒识子标签信息、所述拒识子标签置信度信息和所述当前拒识模式状态信息,在预先编写的逻辑描述模板进行匹配以确定目标逻辑描述模板。
如此,将具体语音请求的相关信息与预先编写的状态描述模板中进行匹配,以确定与当前状态信息符合的逻辑描述模板。
所述根据所述解析器进行逻辑计算得到匹配状态,包括:
通过逻辑计算类对所述解析器解析的所述状态描述模板和所述逻辑描述模板进行映射处理并计算得到所述匹配状态。
如此,逻辑计算类模块可对解析器解析的当前实际状态描述模板与已构建的逻辑描述模板进行对比计算,得到匹配状态,以便后续状态机跳转。
所述根据所述匹配状态更新车辆座舱内各个音区的拒识处理以完成语音交互,包括:
通过状态机动作类在所述匹配状态为匹配成功的情况下,更新所述车辆座舱内各个音区的拒识处理以完成语音交互。
如此,状态机动作类根据逻辑计算类输出确定当前状态信息和逻辑规则匹配,可以转换状态机状态,更新车辆座舱内各个音区的拒识处理,完成语音交互过程。
所述根据所述匹配状态更新车辆座舱内各个音区的拒识处理以完成语音交互,包括:
通过状态机动作类在所述匹配状态为未匹配成功的情况下,保持所述车辆座舱内各个音区的拒识处理以完成语音交互。
如此,状态机动作类根据逻辑计算类输出确定当前状态信息和逻辑规则不匹配,可以不转换状态机状态,保持车辆座舱内各个音区的拒识处理,完成语音交互过程。
所述状态机配置模板包括可填入的包括业务规则、响应轮数、拒识子标签及其置信度信息的关于语音请求的各类标签信息。
所述状态机配置模板包括可填入的关于语音请求的部分相关信息的条件判断规则语句。
所述拒识子标签信息包括有效语音请求和无效语音请求,其中判断语音请求的有效或无效由状态机的拒识模式确定。
本申请的服务器,包括存储器和处理器,所述存储器中存储有计算机程序,所述计算机程序被所述处理器执行时,实现上述的方法。
本申请的计算机可读存储介质,存储有计算机程序,当所述计算机程序被一个或多个处理器执行时,实现上述的方法。本申请的实施方式的附 加方面和优点将在下面的描述中部分给出,部分将从下面的描述中变得明显,或通过本申请的实施方式的实践了解到。
附图说明
通过结合附图对本申请示例性实施方式进行更详细的描述,本申请的上述以及其它目的、特征和优势将变得更加明显,其中,在本申请示例性实施方式中,相同的参考标号通常代表相同部件。
图1是本申请语音交互方法的流程示意图;
图2是本申请车辆座舱的示意图。
具体实施方式
下面将参照附图更详细地描述本申请的实施方式。虽然附图中显示了本申请的实施方式,然而应该理解,可以以各种形式实现本申请而不应被这里阐述的实施方式所限制。相反,提供这些实施方式是为了使本申请更加透彻和完整,并且能够将本申请的范围完整地传达给本领域的技术人员。
在本申请使用的术语是仅仅出于描述特定实施例的目的,而非旨在限制本申请。在本申请和所附权利要求书中所使用的单数形式的“一种”、“所述”和“该”也旨在包括多数形式,除非上下文清楚地表示其他含义。还应当理解,本文中使用的术语“和/或”是指并包含一个或多个相关联的列出项目的任何或所有可能组合。
下面详细描述本申请,本申请的示例在附图中示出,其中自始至终相同或类似的标号表示相同或类似的元件或具有相同或类似功能的元件。下面通过参考附图描述的实施例是示例性的,旨在用于解释本申请,而不能理解为对本申请的限制。
请参阅图1,本申请提供一种语音交互方法,包括:
01:接收车辆转发的在车辆语音功能被唤醒后的用户语音请求;
02:根据用户语音请求加载状态机配置模板以解析状态机配置模板得到解析器;
03:根据解析器进行逻辑计算得到匹配状态;
04:根据匹配状态更新车辆座舱内各个音区的拒识处理以完成语音交互。
本申请还提供了一种服务器,服务器包括存储器和处理器。本申请的语音交互方法可以由本申请的服务器实现。其中,存储器中存储有计算机程序,处理器用于接收车辆转发的在车辆语音功能被唤醒后的用户语音请求,以及根据用户语音请求加载状态机配置模板以解析状态机配置模板得到解析器,以及根据解析器进行逻辑计算得到匹配状态,以及根据匹配状 态更新车辆座舱内各个音区的拒识处理以完成语音交互。
其中,车载系统的语音助手为座舱内的用户提供诸多便利,用户可以通过语音交互实现对软件或座舱内车辆零部件的控制。为了交互便利,语音助手可支持连续对话,由于车内空间属于共享环境,语音助手可能会面临接收到来自不同用户与语音助手之间的对话、不同用户之间的对话等。通过设置语义拒识规则,可满足语音助手对再次出现的相同语音请求给出相同的反馈,同时希望语音助手对具体语音请求产生反馈规则能够尽可能方便地按用户需求进行修改,从而能够更好地为用户服务,提升用户进行语音交互的使用体验。
可以理解,在多音区连续对话的场景中,也即是,在语音助手被唤醒后,支持座舱内不同位置处的用户共同与语音助手进行多轮对话的场景。多个用户可能围绕同一主题进行自由度较高的交互,相较于单一音区的情况更为复杂,语义拒识规则的设置需更为细致。
唤醒车辆语音功能也即是唤醒车辆的语音助手,唤醒语音请求可以是由厂商设定或用户自定义的唤醒词。在语音助手被唤醒后,座舱内用户可与语音助手进行连续多轮对话。在对话达到设定的轮次阈值,或在预定时间内没有接收到用户的语音请求等情况后,对话结束。
唤醒音区也即是发出唤醒语音请求的用户所在的音区位置。如,主驾唤醒语音助手,那么唤醒音区就是主驾音区。唤醒音区信息也即是唤醒音区对应的音区位置信息。
对话音区也即是语音助手获取到的正在进行语音交互的用户所在的音区位置,正在进行对话的音区即为对话音区。如,在某一场景中,在语音助手被唤醒后,主驾用户与副驾用户先后与语音助手进行交互,则在该场景中,主驾用户和副驾用户发出的语音请求先后被语音助手获取,主驾用户和副驾用户所在音区都属于对话音区。对话音区与唤醒音区可以相同或不同。
拒识处理用于在交互过程中甄别出用户的语音请求哪些是对语音助手说的,将其进行召回并执行,哪些不是对语音助手说的,将其作为噪声过滤。本申请中,提供两种拒识程度不同的拒识处理,其中,拒识程度高,仅召回相关度高的语音请求的拒识处理为第一拒识处理,拒识程度低的拒识处理为第二拒识处理。
本申请中,引入状态机,状态机用于记录在语音交互过程中各个音区的拒识模式,并不断地在本申请的语音交互过程中根据接收到的对应音区信息和用户的语音请求进行状态机的更新。实际用车场景中,用户对语音助手的拒识规则要求不一定是一成不变的。当语音助手被唤醒后,各音区 的拒识处理需要跟随语音交互的进程更新。用户会根据自己需求的改变从而修改语音助手的拒识规则,模块化的状态机配置模板保证用户方便添加、删减或修改语音助手的具体拒识规则。
综上所述,本申请中,将车辆座舱划分为多个音区,针对接收到车辆转发的语音请求,来加载状态机配置模板,从而能够解析状态机配置模板得到解析器。解析器能够判断当前所处的状态与状态机配置模板的规则的匹配情况,从而根据匹配情况,确认状态机状态的切换或改变。状态机中的可配置模板方便用户根据具体需求进行设置或改变,具有较强的伸缩性,用户体验较佳。
步骤02包括:
021:根据用户语音请求在预先编写的状态机配置模板中确定目标状态机配置模板;
022:通过模板解析类加载目标状态机配置模板并解析目标状态机配置模板得到解析器。
处理器用于根据用户语音请求在预先编写的状态机配置模板中确定目标状态机配置模板,以及用于通过模板解析类加载目标状态机配置模板并解析目标状态机配置模板得到解析器。
其中,本申请中,提供状态机配置模板供用户进行配置,包括状态描述模板(可简称为状态模板)和逻辑描述模板(可简称为逻辑模板)。在状态机配置模板完成后,模板解析类变量加载目标状态机配置模板并解析目标状态机配置模板得到解析器。本申请能够从计算机存储器中加载对应逻辑模块和状态机跳转模块,使状态机得以完成后续逻辑判断。
以第一拒识处理为例,配置项"tight_state_template"为键值队列表(dict)类型条件集合,即状态模板,其中存在可填入的关于语音请求的各类标签信息,包括业务规则、响应轮数、拒识子标签及其置信度等信息。配置项"tight_logical_template"为键值队列表(dict)类型条件集合,即逻辑模板,其中存在可填入的关于语音请求的部分相关信息的条件判断规则语句。填写好的状态模板和逻辑模板会归入可被模板解析类解析的"self.state_template"和"self.state_template"模块中,模板解析类定义函数"load_state_template"和"load_logical_template"加载用户输入的状态模板"tight_state_template"和逻辑模板"tight_logical_template",最后处理状态和逻辑的对应关系并定义处理函数"process_logical_template"作为输出的解析器。
可以理解地,模板解析类形成的解析器可以方便逻辑处理类进行进一步计算。并且除状态模板和逻辑模板外,解析器可以打包处理两个以上甚 至更多的类型模板,更方便状态和逻辑的解析。
如此,模板加载类可将语音请求的各项具体信息填写进状态机配置模板中,并定义加载和处理的方法得到相应状态和逻辑配置下的解析器,以便后续的逻辑计算或更多模板的引入。
步骤021包括:
0211:确定用户语音请求对应的匹配轮次信息、唤醒音区信息、对话音区信息、拒识子标签信息、拒识子标签置信度信息和状态机的当前拒识模式状态信息;
0212:根据匹配轮次信息、唤醒音区信息、对话音区信息、拒识子标签信息、拒识子标签置信度信息和当前拒识模式状态信息,在预先编写的状态机配置模板中进行匹配以确定目标状态机配置模板。
处理器用于确定用户语音请求对应的匹配轮次信息、唤醒音区信息、对话音区信息、拒识子标签信息、拒识子标签置信度信息和状态机的当前拒识模式状态信息,并根据匹配轮次信息、唤醒音区信息、对话音区信息、拒识子标签信息、拒识子标签置信度信息和当前拒识模式状态信息,在预先编写的状态机配置模板中进行匹配以确定目标状态机配置模板。
座舱内根据用户可能发声的区域划分为不同的音区,请参阅图2,以五座车辆为例,车辆座舱内可划分为包括主驾音区、副驾音区、后排左侧即左后音区、后排中间即中间音区以及后排右侧即右后音区等在内的5个音区。在配置状态机模板时,可选择一个或多个音区作为状态条件配置内容,座舱内可设置有多个语音拾取装置,从而根据获取到的语音请求的状态信息判断发出语音请求的用户所在的音区位置信息。
其中,状态机配置模板中需要存在条件变量以便将具体变量的静态描述填入,形成状态触发器。状态触发器名称可设为"triggerName",类型为字符串(str)类。还可建立键值队列表(dict)类型条件集合,名称可设置为"triggerDetail",可在表内填入无序并列的状态变量信息。
其中,匹配轮次信息表征语音助手唤醒后音区内用户发出语音请求的次数。变量名称可设置为"turns",数据类型为整型(int)。
唤醒音区信息即是唤醒音区对应的音区位置信息,唤醒音区也即是发出唤醒语音请求的用户所在的音区位置。变量名称可设置为"soundLocation",如上所述,类型为整型(int)类。
对话音区也即是语音助手获取到的正在进行语音交互的用户所在的音区位置,正在进行对话的音区即为对话音区。变量名称可设置为"soundArea",如上所述,类型为字符串(str)类。
拒识子标签信息包括有效语音请求和无效语音请求,判断语音请求的 有效或无效由状态机的拒识模式确定。变量名称可设置为"rejSublabel",类型为字符串(str)类。
拒识子标签置信度信息表征拒识子标签的可信程度。变量名称可设置为"rejSublabel",类型为浮点(float)类。
拒识模式状态信息即是用于表示状态机对于任意语音请求的拒识处理状态的信息,包括当前状态和目标状态。变量名称可分别设置为"source"和"dest",类型为字符串(str)类。
将获取的语音请求与预先编写的状态机配置模板进行匹配,确定对应当前语音请求的目标状态机配置模板。
在一个示例中,用户需求为“前排唤醒,后排进入第一拒识处理”,状态机设置模板需要具体配置的变量有唤醒音区信息"soundLocation",对话音区信息"soundArea",及目标拒识处理的状态信息"dest",而拒识子标签,拒识子标签置信度,匹配轮次信息及当前拒识处理可不设置或设置为任意状态。举例而言,即{"source":"*","triggerDetail":{"turns":null,"rejSublabel":null,"rejConf":null}}。其中"source":"*"代表不限定当前拒识模式的状态;"turns":null,"rejSublabel":null,"rejConf":null代表匹配轮次、拒识子标签和拒识子标签置信度信息规则均不设置。
如此,确定用户语音请求对应的匹配轮次信息、唤醒音区信息、对话音区信息、拒识子标签信息、拒识子标签置信度信息和状态机的当前拒识模式状态信息,并根据语音请求的上述信息在预先编写的状态机配置模板中进行匹配,以确定与当前状态信息符合的状态机配置模板。
步骤0212包括:
02121:根据匹配轮次信息、唤醒音区信息、对话音区信息、拒识子标签信息、拒识子标签置信度信息和当前拒识模式状态信息,在预先编写的状态描述模板进行匹配以确定目标状态描述模板。
处理器用于根据匹配轮次信息、唤醒音区信息、对话音区信息、拒识子标签信息、拒识子标签置信度信息和当前拒识模式状态信息,在预先编写的状态描述模板进行匹配以确定目标状态描述模板。
其中,状态机配置模板中需要将当前场景下各状态变量具体的静态描述填入,形成状态触发器,即当状态触发器中填入具体的状态变量的静态描述条件后,可以判断当前场景状态是否满足状态机跳转条件。状态触发器的名称可设为"triggerName",类型为字符串(str)类。还可建立键值对列表(dict)类型数据集合,名称可设置为"triggerDetail",可在表内填入无序并列的状态变量信息。
其中,匹配轮次信息表征语音助手唤醒后音区内用户发出语音请求的 次数。变量名称可设置为"turns",数据类型为整型(int),即变量可取所有自然数。
特别地,为了区分唤醒音区和对话音区,对于唤醒音区和对话音区可用不同标识方法,如本申请中,对于主驾、副驾、左后、中间、右后五个音区,若音区为唤醒音区,则可分别用整型(int)1、2、3、4、5表示;若音区为当前对话音区,则可分别用字符串(str)LF、RF、LR、MR、RR来表示。
唤醒音区信息即是唤醒音区对应的音区位置信息,唤醒音区也即是发出唤醒语音请求的用户所在的音区位置。变量名称可设置为"soundLocation",如上所述,类型为整型(int)类。其中,如主驾唤醒语音助手,那么唤醒音区就是主驾音区,在状态机中可表示为"soundLocation":"1"。可以理解地,在配置同一个状态机模板时,还选择多个音区作为唤醒音区条件,例如,若需设置条件为主驾或副驾作为唤醒音区,即只要是前排唤醒都能满足用户需求,则状态机中可表示为"soundLocation":"1/2"。
对话音区也即是语音助手获取到的正在进行语音交互的用户所在的音区位置,正在进行对话的音区即为对话音区。变量名称可设置为"soundArea",如上所述,类型为字符串(str)类。其中,如左后、中间、右后音区同时进行对话,则对话音区为后排所有音区,在状态机中可表示为"soundArea":"LR/MR/RR"。
拒识子标签信息包括有效语音请求和无效语音请求,判断语音请求的有效或无效由状态机的拒识处理确定。变量名称可设置为"rejSublabel",类型为字符串(str)类。如在本申请中,存在有效语音请求"clear"和无效语音请求"noise"两种。
拒识子标签置信度信息表征拒识子标签的可信程度。变量名称可设置为"rejSublabel",类型为浮点(float)类。在本申请中,可取0.00至1.00的浮点数。
拒识模式状态信息即是用于表示状态机对于任意语音请求的拒识处理状态的信息,包括当前状态和目标状态。变量名称可分别设置为"source"和"dest",类型为字符串(str)类。如在本申请中,存在有效语音请求"clear"和无效语音请求"noise"两种。
将获取的语音请求与预先编写的状态机配置模板进行匹配,确定对应当前语音请求的目标状态机配置模板。
在一个示例中,用户需求为“如前排唤醒,则后排进入第一拒识处理”可以配置为
{"triggerName":"front_wakeup","source":"*","triggerDetail":{"soundLocation":"1/2","soundArea":"LR/RR/MR","turns":null,"rejSublabel":null,"rejConf":null},"dest":"tight"}。其中"soundLocation":"1/2"代表前排唤醒;"soundArea":"LR/RR/MR"代表当前说话人是后排;"turns":null,"rejSublabel":null,"rejConf":null代表规则不设置;"source":"*"代表当前可以是任意状态;"dest":"tight"代表目标状态是第一拒识处理。
在交互过程中,根据获取的前排唤醒的语音请求,就可以匹配到根据“如前排唤醒,则后排进入第一拒识处理”的需求所编写的模板作为目标状态机配置模板。
如此,将具体语音请求的相关信息与预先编写的状态描述模板中进行匹配,以确定与当前状态信息符合的状态描述模板。
步骤02121包括:
021211:根据匹配轮次信息、唤醒音区信息、对话音区信息、拒识子标签信息、拒识子标签置信度信息和当前拒识模式状态信息,在预先编写的逻辑描述模板进行匹配以确定目标逻辑描述模板。
处理器用于根据匹配轮次信息、唤醒音区信息、对话音区信息、拒识子标签信息、拒识子标签置信度信息和当前拒识模式状态信息,在预先编写的逻辑描述模板进行匹配以确定目标逻辑描述模板。
其中,逻辑描述模板中需要将具体逻辑规则变量的静态描述填入,规则变量的静态描述应与状态变量项一一对应,形成状态触发器,即当状态触发器中填入具体的状态变量的静态描述条件后,可以判断当前场景状态是否满足状态机跳转条件。状态触发器名称可设为"triggerName",类型为字符串(str)类。还可建立键值对列表(dict)类型数据集合,名称可设置为"triggerDetail",可在表内填入无序并列的逻辑规则信息。
其中匹配轮次信息、唤醒音区信息、对话音区信息、拒识子标签信息、拒识子标签置信度信息和当前拒识模式状态信息的变量名称参见步骤02121公开内容,在此不作赘述。
特别地,逻辑描述模板中键值对列表(dict)所含所有规则,应均为逻辑判断语句,故将语音请求的各类逻辑规则判断结果均设置为字符串(str)类型变量,并且可设置存在"exist"、少于"less_than"、不少于"more_than"和等于"exist"四种逻辑判断结果,其中,少于"less_than"和不少于"more_than"仅支持数值类型判断,包括整型(int)和浮点型(float),存在"exist"和等于"exist"则同时支持数值类型和字符串类型的判断。
将获取的语音请求与预先编写的状态机配置模板进行匹配,确定对应当前语音请求的目标状态机配置模板。
在一个示例中,用户需求为“如前排唤醒,则后排进入第一拒识处理”可以配置为
{"triggerName":"front_wakeup","source":null,"triggerDetail":{"soundLocation":"exist","soundArea":"exist","turns":null,"rejSublabel":null,"rejConf":null},"dest":null}。其中"soundLocation":"exist"代表当前唤醒音区存在状态模板的"soundLocation":"1/2"中;"soundArea":"exist"代表当前说话人音区存在状态模板的"soundArea":"LR/RR/MR"中;"turns":null,"rejSublabel":null,"rejConf":null代表规则不设置;"source":null代表规则不设置,即当前可以是任意状态;"dest":"tight"代表当前目标状态是第一拒识处理。
在交互过程中,根据获取的前排唤醒的语音请求,就可以匹配到根据“如前排唤醒,则后排进入第一拒识处理”的需求所编写的模板作为目标状态机配置模板。
如此,将具体语音请求的相关信息与预先编写的状态描述模板中进行匹配,以确定与当前状态信息符合的逻辑描述模板。
步骤03包括:
031:通过逻辑计算类对解析器解析的状态描述模板和逻辑描述模板进行映射处理并计算得到匹配状态。
处理器用于通过逻辑计算类对解析器解析的状态描述模板和逻辑描述模板进行映射处理并计算得到匹配状态。
其中,本申请中,逻辑计算类存在与逻辑描述模板中将解析器解析的状态描述模板和逻辑描述模板根据一一对应的原则进行映射处理,并进行逻辑计算,得到匹配状态。
以需求为“前排唤醒,后排进入第一拒识处理”的第一拒识处理跳转为例,首先获取逻辑描述模板"tight_logical_template"的"triggerDetail"表中不为"null"的规则,即为唤醒音区"soundLocation"变量和对话音区"soundArea"变量。逻辑计算类可定义函数"exist"、"less_than"、"more_than"、"equal"进行逻辑判断,并将状态模板和逻辑模板按一一对应的原则进行映射处理。此例中,即判断当前系统实际的"soundLocation"和"soundArea"变量的值是否存在于逻辑描述模板"tight_logical_template"的限定值范围之内。若都满足,则可将输出结果"match"存入字符串(str)类型数据"self.result"中;若不满足,则输出其他结果或不输出任何结果直接跳出处理进程。
进一步地,逻辑计算类的计算方法会随着处理项目增加而增加。
如此,逻辑计算类模块可对解析器解析的当前实际状态描述模板与已 构建的逻辑描述模板进行对比计算,得到匹配状态,以便后续状态机跳转。
步骤04包括:
041:通过状态机动作类在匹配状态为匹配成功的情况下,更新车辆座舱内各个音区的拒识处理以完成语音交互。
处理器通过状态机动作类在匹配状态为匹配成功的情况下,更新车辆座舱内各个音区的拒识处理以完成语音交互。
其中,状态机动作类在匹配状态为匹配成功的情况下,更新车辆座舱内各个音区的拒识处理以完成语音交互。
以需求为“前排唤醒,后排进入第一拒识处理模式”的第一拒识处理跳转为例,状态机动作类可定义函数"get_parser","get_transition"和"get_trigger"分别得到解析器、当前跳转动作和跳转状态,在匹配状态为匹配成功,即逻辑运算类输出结果"self.result"为"match"的情况下,状态机动作类可通过"get_transition"函数更新车辆座舱内各个音区的拒识处理以完成语音交互。
进一步地,"get_transition"函数所进行的状态机跳转可以使用Python自带的transition工具包Machine类实现。
如此,状态机动作类根据逻辑计算类输出确定当前状态信息和逻辑规则匹配,可以转换状态机状态,更新车辆座舱内各个音区的拒识处理,完成语音交互过程。
步骤04包括:
042:通过状态机动作类在匹配状态为未匹配成功的情况下,保持车辆座舱内各个音区的拒识处理以完成语音交互。
处理器用于通过状态机动作类在匹配状态为未匹配成功的情况下,保持车辆座舱内各个音区的拒识处理以完成语音交互。
其中,状态机动作类在匹配状态为未匹配成功的情况下,不进行各个音区的拒识处理更新,状态机保持现状,完成语音交互。
以需求为“前排唤醒,后排进入第一拒识处理”的第一拒识处理跳转为例,状态机动作类可定义函数"get_parser","get_transition"和"get_trigger"分别得到解析器、当前跳转动作和跳转状态,在匹配状态为未匹配成功,即逻辑运算类输出结果"self.result"不为"match"的情况下,不进行各个音区的拒识处理更新,状态机保持现状,完成语音交互。
进一步地,在匹配状态为未匹配成功,逻辑运算类输出结果"self.result"不为"match"的情况下,可以通过输出其他匹配结果以达到状态机不发生跳转目标,也可以不输出结果直接跳出跳转流程,从而完成语音交互。
如此,状态机动作类根据逻辑计算类输出确定当前状态信息和逻辑规 则不匹配,可以不转换状态机状态,保持车辆座舱内各个音区的拒识处理,完成语音交互过程。
本申请的计算机可读存储介质,存储有计算机程序,当计算机程序被一个或多个处理器执行时,实现上述的方法。
以下通过两个场景示例对状态模板和逻辑模板的配置进行图示辅助说明:
示例一:用户需求和具体配置如表1。状态模板设置中,"soundLocation":"1/2"代表前排唤醒;"soundArea":"LR/RR/MR"代表当前说话人在后排,即对话音区为左后、中间和右后音区;"turns":2代表2轮匹配,即后排对话音区发出语音请求的次数为2次;"rejSublabel":"clear"代表仅有效语音请求被纳入语音助手计数;"source":"*"代表当前语音助手可以处在任意拒识模式状态;"dest":"loose"代表当前语音助手目标状态是第二拒识处理,即无论当前语音助手处在什么拒识处理下,如果当前状态符合模板要求都需要保持或跳转至第二拒识处理。逻辑模板配置中,"soundLocation":"exist"代表当前唤醒音区存在状态模板的"soundLocation":"1/2"中,即当前唤醒音区在前排;"soundArea":"exist"代表当前说话人音区存在状态模板的"soundArea":"LR/RR/MR"中,即当前对话音区在后排;"turns":more_than代表不少于匹配,即当前状态下后排进行对话轮次需要达到状态模板中设置的2次及以上;"rejSublabel":equal代表完全匹配,即仅识别有效语音请求;"source":null代表规则不设置,即不对现有拒识处理做任何限定;"dest":"loose"代表当前目标状态是第二拒识处理,即无论当前语音助手处在什么拒识处理下,如果当前状态符合模板要求都需要保持或跳转至第一拒识处理。

表1
示例二:用户需求和具体配置如表2。状态模板设置中,"soundLocation":"1/2"代表前排唤醒;"soundArea":"LR/RR/MR"代表当前说话人在后排,即对话音区为左后、中间和右后音区;"turns":3代表3轮匹配,即后排对话音区发出语音请求的次数为3次;"rejSublabel":"noise"代表无效语音请求被语音助手识别计数;"source":"*"代表当前语音助手可以处在任意拒识处理状态;"dest":"tight"代表当前语音助手目标状态是第一拒识处理,即无论当前语音助手处在什么拒识处理下,如果当前状态符合模板要求都需要保持或跳转至第一拒识处理。逻辑模板配置中,"soundLocation":"exist"代表当前唤醒音区存在状态模板的"soundLocation":"1/2"中,即当前唤醒音区在前排;"soundArea":"exist"代表当前说话人音区存在状态模板的"soundArea":"LR/RR/MR"中,即当前对话音区在后排;"turns":more_than代表不少于匹配,即当前状态下后排进行对话轮次需要达到状态模板中设置的2次及以上;"rejSublabel":equal代表完全匹配,即仅对无效语音请求进行计数;"source":null代表规则不设置,即不对现有拒识处理做任何限定;"dest":"tight"代表当前目标状态是第一拒识处理,即无论当前语音助手处在什么拒识处理下,如果当前状态符合模板要求都需要保持或跳转至第一拒识处理。
表2
在本说明书的描述中,参考术语“上述”、“其中”、“可以理解地”、“进一步地”等的描述意指结合实施方式或示例描述的具体特征、结构、材料或者特点包含于本申请的至少一个实施方式或示例中。在本说明书中,对上述术语的示意性表述不一定指的是相同的实施方式或示例。而且,描述的具体特征、结构、材料或者特点可以在任何的一个或多个实施方式或示例中以合适的方式结合。此外,在不相互矛盾的情况下,本领域的技术人员可以将本说明书中描述的不同实施例或示例以及不同实施例或示例的特征进行结合和组合。
流程图中或在此以其他方式描述的任何过程或方法描述可以被理解为,表示包括一个或更多个用于实现特定逻辑功能或过程的步骤的可执行请求的代码的模块、片段或部分,并且本申请的优选实施方式的范围包括另外的实现,其中可以不按所示出或讨论的顺序,包括根据所涉及的功能按基本同时的方式或按相反的顺序,来执行功能,这应被本申请的实施例所属技术领域的技术人员所理解。
以上已经描述了本申请的各实施例,上述说明是示例性的,并非穷尽性的,并且也不限于所披露的各实施例。在不偏离所说明的各实施例的范围和精神的情况下,对于本技术领域的普通技术人员来说许多修改和变更都是显而易见的。本文中所用术语的选择,旨在最好地解释各实施例的原理、实际应用或对市场中的技术的改进,或者使本技术领域的其它普通技术人员能理解本文披露的各实施例。

Claims (13)

  1. 一种语音交互方法,其特征在于,包括:
    接收车辆转发的在车辆语音功能被唤醒后的用户语音请求;
    根据所述用户语音请求加载状态机配置模板以解析所述状态机配置模板得到解析器;
    根据所述解析器进行逻辑计算得到匹配状态;
    根据所述匹配状态更新车辆座舱内各个音区的拒识处理以完成语音交互。
  2. 根据权利要求1所述的语音交互方法,其特征在于,所述根据所述用户语音请求加载状态机配置模板以解析所述状态机配置模板得到解析器,包括:
    根据所述用户语音请求在预先编写的状态机配置模板中确定目标状态机配置模板;
    通过模板解析类加载所述目标状态机配置模板并解析目标状态机配置模板得到所述解析器。
  3. 根据权利要求2所述的语音交互方法,其特征在于,所述根据所述用户语音请求在预先编写的状态机配置模板中确定目标状态机配置模板,包括:
    确定所述用户语音请求对应的匹配轮次信息、唤醒音区信息、对话音区信息、拒识子标签信息、拒识子标签置信度信息和状态机的当前拒识模式状态信息;
    根据所述匹配轮次信息、所述唤醒音区信息、所述对话音区信息、所述拒识子标签信息、所述拒识子标签置信度信息和所述当前拒识模式状态信息,在预先编写的所述状态机配置模板中进行匹配以确定所述目标状态机配置模板。
  4. 根据权利要求3所述的语音交互方法,其特征在于,所述根据所述匹配轮次信息、所述唤醒音区信息、所述对话音区信息、所述拒识子标签信息、所述拒识子标签置信度信息和所述当前拒识模式状态信息,在预先编写的所述状态机配置模板中进行匹配以确定所述目标状态机配置模板,包括:
    根据所述匹配轮次信息、所述唤醒音区信息、所述对话音区信息、所述拒识子标签信息、所述拒识子标签置信度信息和所述当前拒识模式状态信息,在预先编写的状态描述模板进行匹配以确定目标状态描述模板。
  5. 根据权利要求4所述的语音交互方法,其特征在于,所述根据所述匹配轮次信息、所述唤醒音区信息、所述对话音区信息、所述拒识子标签 信息、所述拒识子标签置信度信息和所述当前拒识模式状态信息,在预先编写的所述状态机配置模板中进行匹配以确定所述目标状态机配置模板,包括:
    根据所述匹配轮次信息、所述唤醒音区信息、所述对话音区信息、所述拒识子标签信息、所述拒识子标签置信度信息和所述当前拒识模式状态信息,在预先编写的逻辑描述模板进行匹配以确定目标逻辑描述模板。
  6. 根据权利要求5所述的语音交互方法,其特征在于,所述根据所述解析器进行逻辑计算得到匹配状态,包括:
    通过逻辑计算类对所述解析器解析的所述状态描述模板和所述逻辑描述模板进行映射处理并计算得到所述匹配状态。
  7. 根据权利要求6所述的语音交互方法,其特征在于,所述根据所述匹配状态更新车辆座舱内各个音区的拒识处理以完成语音交互,包括:
    通过状态机动作类在所述匹配状态为匹配成功的情况下,更新所述车辆座舱内各个音区的拒识处理以完成语音交互。
  8. 根据权利要求6所述的语音交互方法,其特征在于,所述根据所述匹配状态更新车辆座舱内各个音区的拒识处理以完成语音交互,包括:
    通过状态机动作类在所述匹配状态为未匹配成功的情况下,保持所述车辆座舱内各个音区的拒识处理以完成语音交互。
  9. 根据权利要求1至8任一项所述的语音交互方法,其特征在于:
    所述状态机配置模板包括可填入的包括业务规则、响应轮数、拒识子标签及其置信度信息的关于语音请求的各类标签信息。
  10. 根据权利要求1至8任一项所述的语音交互方法,其特征在于:
    所述状态机配置模板包括可填入的关于语音请求的部分相关信息的条件判断规则语句。
  11. 根据权利要求3所述的语音交互方法,其特征在于:
    所述拒识子标签信息包括有效语音请求和无效语音请求,其中判断语音请求的有效或无效由状态机的拒识模式确定。
  12. 一种服务器,其特征在于,所述服务器包括存储器和处理器,所述存储器中存储有计算机程序,所述计算机程序被所述处理器执行时,实现权利要求1-11任一项所述的方法。
  13. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质存储有计算机程序,当所述计算机程序被一个或多个处理器执行时,实现如权利要求1-11任意一项所述的方法。
PCT/CN2023/125013 2022-10-19 2023-10-17 语音交互方法、服务器及计算机可读存储介质 WO2024083128A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211276398.2 2022-10-19
CN202211276398.2A CN115376513B (zh) 2022-10-19 2022-10-19 语音交互方法、服务器及计算机可读存储介质

Publications (1)

Publication Number Publication Date
WO2024083128A1 true WO2024083128A1 (zh) 2024-04-25

Family

ID=84072707

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/125013 WO2024083128A1 (zh) 2022-10-19 2023-10-17 语音交互方法、服务器及计算机可读存储介质

Country Status (2)

Country Link
CN (1) CN115376513B (zh)
WO (1) WO2024083128A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115376513B (zh) * 2022-10-19 2023-05-12 广州小鹏汽车科技有限公司 语音交互方法、服务器及计算机可读存储介质

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0945854A2 (en) * 1998-03-24 1999-09-29 Matsushita Electric Industrial Co., Ltd. Speech detection system for noisy conditions
CN103186416A (zh) * 2011-12-29 2013-07-03 比亚迪股份有限公司 构建多任务多分支过程的方法、状态机及执行方法
CN107316643A (zh) * 2017-07-04 2017-11-03 科大讯飞股份有限公司 语音交互方法及装置
US20170359707A1 (en) * 2016-06-08 2017-12-14 Google Inc. Providing a personal assistant module with a selectively-traversable state machine
CN111008532A (zh) * 2019-12-12 2020-04-14 广州小鹏汽车科技有限公司 语音交互方法、车辆和计算机可读存储介质
CN111063350A (zh) * 2019-12-17 2020-04-24 苏州思必驰信息科技有限公司 基于任务栈的语音交互状态机及其实现方法
CN112164401A (zh) * 2020-09-18 2021-01-01 广州小鹏汽车科技有限公司 语音交互方法、服务器和计算机可读存储介质
CN112927692A (zh) * 2021-02-24 2021-06-08 福建升腾资讯有限公司 一种自动语言交互方法、装置、设备和介质
CN113330513A (zh) * 2021-04-20 2021-08-31 华为技术有限公司 语音信息处理方法及设备
CN113990300A (zh) * 2021-12-27 2022-01-28 广州小鹏汽车科技有限公司 语音交互方法、车辆、服务器和计算机可读存储介质
CN115376513A (zh) * 2022-10-19 2022-11-22 广州小鹏汽车科技有限公司 语音交互方法、服务器及计算机可读存储介质

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6671669B1 (en) * 2000-07-18 2003-12-30 Qualcomm Incorporated combined engine system and method for voice recognition
CN107665708B (zh) * 2016-07-29 2021-06-08 科大讯飞股份有限公司 智能语音交互方法及系统
CN114267347A (zh) * 2021-11-01 2022-04-01 惠州市德赛西威汽车电子股份有限公司 一种基于智能语音交互的多模态拒识方法和系统
CN114155853A (zh) * 2021-12-08 2022-03-08 斑马网络技术有限公司 一种拒识方法、装置、设备及存储介质

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0945854A2 (en) * 1998-03-24 1999-09-29 Matsushita Electric Industrial Co., Ltd. Speech detection system for noisy conditions
CN103186416A (zh) * 2011-12-29 2013-07-03 比亚迪股份有限公司 构建多任务多分支过程的方法、状态机及执行方法
US20170359707A1 (en) * 2016-06-08 2017-12-14 Google Inc. Providing a personal assistant module with a selectively-traversable state machine
CN107316643A (zh) * 2017-07-04 2017-11-03 科大讯飞股份有限公司 语音交互方法及装置
CN111008532A (zh) * 2019-12-12 2020-04-14 广州小鹏汽车科技有限公司 语音交互方法、车辆和计算机可读存储介质
CN111063350A (zh) * 2019-12-17 2020-04-24 苏州思必驰信息科技有限公司 基于任务栈的语音交互状态机及其实现方法
CN112164401A (zh) * 2020-09-18 2021-01-01 广州小鹏汽车科技有限公司 语音交互方法、服务器和计算机可读存储介质
CN112927692A (zh) * 2021-02-24 2021-06-08 福建升腾资讯有限公司 一种自动语言交互方法、装置、设备和介质
CN113330513A (zh) * 2021-04-20 2021-08-31 华为技术有限公司 语音信息处理方法及设备
CN113990300A (zh) * 2021-12-27 2022-01-28 广州小鹏汽车科技有限公司 语音交互方法、车辆、服务器和计算机可读存储介质
CN115376513A (zh) * 2022-10-19 2022-11-22 广州小鹏汽车科技有限公司 语音交互方法、服务器及计算机可读存储介质

Also Published As

Publication number Publication date
CN115376513A (zh) 2022-11-22
CN115376513B (zh) 2023-05-12

Similar Documents

Publication Publication Date Title
WO2024083128A1 (zh) 语音交互方法、服务器及计算机可读存储介质
US10796100B2 (en) Underspecification of intents in a natural language processing system
US11657797B2 (en) Routing for chatbots
CN108205627B (zh) 交互式助理模块对访问的有条件提供
US9772990B2 (en) Personal assistant context building
US20170229122A1 (en) Hybridized client-server speech recognition
WO2018149209A1 (zh) 语音识别方法、电子设备以及计算机存储介质
CN111989685A (zh) 跨域个性化词汇的学习方法及其电子装置
WO2018067260A1 (en) Task initiation using long-tail voice commands
US20210193108A1 (en) Voice synthesis method, device and apparatus, as well as non-volatile storage medium
US11929073B2 (en) Hybrid arbitration system
US11928430B2 (en) Detecting unrelated utterances in a chatbot system
US20200026977A1 (en) Electronic apparatus and control method thereof
CN112185369B (zh) 一种基于语音控制的音量调节方法、装置、设备和介质
CN112308541B (zh) 处理审批业务流程的方法、计算设备和计算机存储介质
US11074908B2 (en) System and method for aligning ASR model weights with NLU concepts
CN114822533A (zh) 语音交互方法、模型训练方法、电子设备和存储介质
WO2024120450A1 (zh) 语音交互方法、服务器及计算机可读存储介质
WO2024099375A1 (zh) 语音交互方法、服务器和存储介质
CN110287384B (zh) 智能服务方法、装置及设备
CN116486815A (zh) 车载语音信号处理方法及装置
CN115220922A (zh) 车辆应用程序运行方法、装置以及车辆
CN112988738B (zh) 用于区块链的数据分片方法和装置
US11501135B2 (en) Smart engine with dynamic profiles
US11893996B1 (en) Supplemental content output

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23879116

Country of ref document: EP

Kind code of ref document: A1