WO2024088085A1 - 语音交互方法、语音交互装置、车辆和可读存储介质 - Google Patents

语音交互方法、语音交互装置、车辆和可读存储介质 Download PDF

Info

Publication number
WO2024088085A1
WO2024088085A1 PCT/CN2023/124567 CN2023124567W WO2024088085A1 WO 2024088085 A1 WO2024088085 A1 WO 2024088085A1 CN 2023124567 W CN2023124567 W CN 2023124567W WO 2024088085 A1 WO2024088085 A1 WO 2024088085A1
Authority
WO
WIPO (PCT)
Prior art keywords
result
dialogue
local
dialogue result
type
Prior art date
Application number
PCT/CN2023/124567
Other languages
English (en)
French (fr)
Inventor
鲍鹏丽
左佑
Original Assignee
广州小鹏汽车科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 广州小鹏汽车科技有限公司 filed Critical 广州小鹏汽车科技有限公司
Publication of WO2024088085A1 publication Critical patent/WO2024088085A1/zh

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B60VEHICLES IN GENERAL
    • B60RVEHICLES, VEHICLE FITTINGS, OR VEHICLE PARTS, NOT OTHERWISE PROVIDED FOR
    • B60R16/00Electric or fluid circuits specially adapted for vehicles and not otherwise provided for; Arrangement of elements of electric or fluid circuits specially adapted for vehicles and not otherwise provided for
    • B60R16/02Electric or fluid circuits specially adapted for vehicles and not otherwise provided for; Arrangement of elements of electric or fluid circuits specially adapted for vehicles and not otherwise provided for electric constitutive elements
    • B60R16/037Electric or fluid circuits specially adapted for vehicles and not otherwise provided for; Arrangement of elements of electric or fluid circuits specially adapted for vehicles and not otherwise provided for electric constitutive elements for occupant comfort, e.g. for automatic adjustment of appliances according to personal settings, e.g. seats, mirrors, steering wheel
    • B60R16/0373Voice control
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • G10L15/34Adaptation of a single recogniser for parallel processing, e.g. by use of multiple processors or cloud computing

Definitions

  • the present application belongs to the field of vehicle-mounted voice interaction technology, and in particular, relates to a voice interaction method, a voice interaction device, a vehicle and a readable storage medium.
  • In-vehicle voice interaction usually includes two types of processing: local vehicle-side processing and cloud server processing.
  • Cloud server processing is highly dependent on the network. In underground garages and other environments with no network or weak network, it is difficult to respond to users' voice requests in a timely and effective manner. Even in the case of high network quality, the response speed is inferior to local vehicle-side processing. Due to the limited computing power of the local vehicle-side, it completely relies on local vehicle-side processing, resulting in low-quality results and a small range of business support. At present, both types of processing methods have defects in at least one aspect of quality and response speed, which affects user experience and leaves room for improvement.
  • the present application aims to solve at least one of the technical problems existing in the prior art. To this end, the present application proposes a voice interaction method, a voice interaction device, a vehicle, a readable storage medium and a computer program product, which can significantly enhance the response sensitivity of voice interaction while ensuring accuracy.
  • the present application provides a voice interaction method, the method comprising: obtaining a conversation result; updating a local conversation result or obtaining an arbitration result according to the type of the conversation result, the local priority level of the conversation result and at least part of the interaction mode in which the vehicle computer is located; wherein the conversation result comprises a first category of conversation results, a second category of conversation results and a third category of conversation results, the first category of conversation results being determined by local text recognition and semantic understanding, the second category of conversation results being determined by cloud-based text recognition and local semantic understanding, and the third category of conversation results being determined by cloud-based text recognition and semantic understanding; obtaining an arbitration result according to the local conversation result and the waiting time after receiving a user voice request; and performing voice interaction according to the arbitration result.
  • the classification results are output in stages and the arbitration results are output in advance.
  • a finer-grained arbitration result can be obtained, which helps to significantly enhance the response sensitivity of voice interaction while ensuring accuracy, and achieve a faster experience while ensuring accuracy, thereby achieving ultra-fast conversation.
  • the local dialogue result is updated or the arbitration result is obtained according to at least part of the type of the dialogue result, the local priority level of the dialogue result and the interaction mode in which the vehicle is located, including: when the dialogue result is a second-class dialogue result and the vehicle is not in the high-speed dialogue mode, the current dialogue result is used as the local dialogue result.
  • the current dialogue result (CLResult) is used as the local dialogue result as the basis for arbitration of the subsequent waiting timeout, which can provide a higher quality voice interaction result.
  • the updating of the local dialogue result or obtaining of the arbitration result according to at least part of the type of the dialogue result, the local priority level of the dialogue result and the interaction mode in which the vehicle is located includes: when the dialogue result is a second-category dialogue result, the vehicle is in an extremely fast dialogue mode, and the local priority level of the dialogue result is a directly executable level, the current dialogue result is used as the arbitration result; when the dialogue result is a second-category dialogue result, the vehicle is in an extremely fast dialogue mode, and the local priority level of the dialogue result is not a directly executable level, the current dialogue result is used as the local dialogue result.
  • the updating of the local dialogue result or obtaining the arbitration result according to at least part of the type of the dialogue result, the local priority level of the dialogue result and the interaction mode in which the vehicle is located includes: when the dialogue result is a first-class dialogue result and the vehicle is not connected to the network, obtaining the arbitration result according to the local priority level of the dialogue result; when the dialogue result is a first-class dialogue result, the vehicle is connected to the network, and it is determined that the vehicle is in the ultra-fast dialogue mode, and the local priority level of the dialogue result is a directly executable level, the current dialogue result is used as the local dialogue result; when the dialogue result is a first-class dialogue result, the vehicle is connected to the network, and it is determined that the vehicle is in the ultra-fast dialogue mode, and the local priority level of the dialogue result is not a directly executable level, the current dialogue result is used as the local dialogue result; when the dialogue result is a first-class dialogue result, the vehicle is connected to the network, and the vehicle is not in
  • the updating of the local dialogue result or obtaining of the arbitration result according to at least part of the type of the dialogue result, the local priority level of the dialogue result and the interaction mode of the vehicle computer includes: when the dialogue result is a third-category dialogue result, taking the current dialogue result as the local dialogue result. In this way, a high-quality voice interaction result can be obtained, and the judgment logic is simple.
  • the method before obtaining the dialogue result, further includes: receiving a user voice request in the vehicle cockpit; sending the user voice request to the server so that the server performs text recognition on the user voice request to obtain cloud recognition text, and the server performs semantic understanding on the cloud recognition text to obtain a third type of dialogue result; performing text recognition on the user voice request to obtain local recognition text, and performing semantic understanding on the local recognition text to obtain a first type of dialogue result; in the case of receiving the cloud recognition text sent by the server, performing semantic understanding on the cloud recognition text to obtain a second type of dialogue result; in the case of receiving the third type of dialogue result sent by the server, obtaining a third type of dialogue result.
  • the arbitration result is obtained according to the local dialogue result and the waiting time after receiving the user voice request, including: when the waiting time exceeds the first time and is less than the second time, it is determined that there is currently a local dialogue result, and the local priority level of the local dialogue result is a directly executable level or a timeout executable level, the current local dialogue result is used as the arbitration result.
  • the local priority level of the local dialogue result is a directly executable level or a timeout executable level
  • the arbitration result is obtained based on the local conversation result and the waiting time after receiving the user voice request, including: when the waiting time exceeds the second time, it is determined that there is currently a local conversation result, and the local priority level of the local conversation result is a rejection level, a first arbitration result is obtained, and the first arbitration result has no voice broadcast information.
  • the arbitration result is obtained based on the local conversation result and the waiting time after receiving the user voice request, including: when the waiting time exceeds a second time, it is determined that there is currently a local conversation result, and the local priority level of the local conversation result is an unsupported level or a reserved field level, a second arbitration result is obtained, and the second arbitration result includes voice broadcast information for indicating network abnormalities.
  • the arbitration result is obtained based on the local dialogue result and the waiting time after receiving the user voice request, including: when the waiting time exceeds the second time and it is determined that there is no local dialogue result at present, a third arbitration result is obtained, and the third arbitration result includes voice broadcast information for indicating network abnormalities.
  • the present application provides a voice interaction device, which includes: a first acquisition module, used to obtain a conversation result; a first processing module, used to update the local conversation result or obtain an arbitration result according to the type of the conversation result, the local priority level of the conversation result and at least part of the interaction mode of the vehicle computer; wherein the conversation result includes a first type of conversation result, a second type of conversation result and a third type of conversation result, the first type of conversation result is determined by local text recognition and semantic understanding, the second type of conversation result is determined by text recognition in the cloud and local semantic understanding, and the third type of conversation result is determined by text recognition in the cloud and semantic understanding; a second processing module, used to obtain an arbitration result according to the local conversation result and the waiting time after receiving the user's voice request; a third processing module, used to perform voice interaction according to the arbitration result.
  • the voice interaction device of the present application by fusing LLResult, CLResult and CCResult, the classification results are output in stages and the arbitration results are output in advance. Combined with the waiting time after receiving the user's voice request, a finer-grained arbitration result can be obtained, which helps to significantly enhance the response sensitivity of voice interaction while ensuring accuracy, and achieve a faster experience while ensuring accuracy, thereby achieving ultra-fast conversation.
  • the present application provides a vehicle comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor implements the voice interaction method as described in the first aspect above when executing the computer program.
  • the present application provides a non-transitory computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the voice interaction method as described in the first aspect above.
  • the present application provides a chip, comprising a processor and a communication interface, wherein the communication interface is coupled to the processor, and the processor is used to run a program or instruction to implement the voice interaction method as described in the first aspect.
  • the present application provides a computer program product, including a computer program, which, when executed by a processor, implements the voice interaction method as described in the first aspect above.
  • FIG1 is a flow chart of a voice interaction method provided by the present application.
  • FIG2 is a second flow chart of the voice interaction method provided by the present application.
  • FIG3 is a schematic diagram of the structure of the voice interaction device provided by the present application.
  • FIG. 4 is a schematic diagram of the structure of the vehicle provided in the present application.
  • first, second, etc. in the specification and claims of this application are used to distinguish similar objects, and are not used to describe a specific order or sequence. It should be understood that the data used in this way can be interchangeable under appropriate circumstances, so that the embodiments of the present application can be implemented in an order other than those illustrated or described here, and the objects distinguished by "first”, “second”, etc. are generally of one type, and the number of objects is not limited.
  • the first object can be one or more.
  • “and/or” in the specification and claims represents at least one of the connected objects, and the character “/" generally indicates that the objects associated with each other are in an "or” relationship.
  • the voice interaction method may be applied to a terminal, and may be specifically executed by hardware or software in the terminal.
  • the terminal may be a vehicle computer, and the terminal may be a device including a microphone, a touch panel or other physical user interfaces.
  • the voice interaction method provided by the present application can be performed by a vehicle computer or a functional module or functional entity in the vehicle computer that can implement the voice interaction method.
  • a vehicle computer due to the complex network conditions, such as when the vehicle is driving, the network status changes dynamically due to location switching, and the complexity of voice interaction is much higher than that of voice interaction in a home environment.
  • the voice interaction method includes: step 110 , step 120 , step 130 and step 140 .
  • Step 110 Obtain the dialogue result
  • the conversation result is the output of the local vehicle computer or cloud server after performing text recognition (ASR, Automatic Speech Recognition) and semantic understanding (NLU) on the user's voice request.
  • ASR Automatic Speech Recognition
  • NLU semantic understanding
  • the dialogue result is used to arbitrate with other dialogue results in subsequent steps to determine the final arbitration result to be output.
  • the arbitration result can be one of the dialogue results obtained in the previous order.
  • the in-vehicle voice interaction system usually chooses one of the two processing methods, or takes both into account:
  • LLResult Local ASR & Local NLU
  • the voice interaction method of the present application designs three processing routes, and correspondingly, the obtained dialogue results include the first type of dialogue results, the second type of dialogue results and the third type of dialogue results:
  • the first type of dialogue results are determined by local text recognition and semantic understanding. This type of dialogue result is referred to as LLResult (Local ASR & Local NLU).
  • the second type of dialogue results is determined by text recognition in the cloud and semantic understanding locally. This type of dialogue result is referred to as CLResult (Cloud ASR & Local NLU).
  • CCResult Cloud ASR & Cloud NLU
  • the user's voice request can be processed through the above three processing routes, and one or more dialogue results can be obtained according to the network conditions.
  • Step 120 updating the local dialogue result or obtaining the arbitration result according to at least part of the type of the dialogue result, the local priority level of the dialogue result and the interaction mode of the vehicle computer;
  • the resulting dialogue result will contain an identifier indicating the executing entity, depending on the executing entity.
  • the ASR in the conversation result obtained in the recognition step 110 is executed locally or in the cloud, and whether the NLU of the conversation result is executed locally or in the cloud, and then the type of the conversation result is determined, that is, whether the conversation result is a first-class conversation result (LLResult), a second-class conversation result (CLResult), or a third-class conversation result (CCResult).
  • LLResult first-class conversation result
  • CLResult second-class conversation result
  • CCResult third-class conversation result
  • the confidence ranking of the first type of dialogue results (LLResult), the second type of dialogue results (CLResult), and the third type of dialogue results (CCResult) is as follows: CCResult>CLResult>LLResult.
  • the local priority of the dialogue results can also be determined.
  • the local priority can be obtained based on model prediction such as domain or confidence.
  • the local priority levels predicted by the first type of dialogue result (LLResult) and the second type of dialogue result (CLResult) are not necessarily the same.
  • the local priority levels of the conversation results are divided into the following five levels, as shown in Table 1.
  • the local priority levels may be divided into more or fewer levels according to actual needs.
  • the car machine has multiple interaction modes, such as fast dialogue mode and non-fast dialogue mode.
  • fast dialogue mode When the car machine is in fast dialogue mode, it indicates that the user needs a faster response speed from the car machine. If the fast dialogue mode is turned on, it means that the user trusts the local algorithm capabilities.
  • users can switch the interaction mode of the car computer through voice control or touch operation on the touch display.
  • the user's input is received, and the input may be an operation of clicking the "Extreme Speed Dialogue” control.
  • the car computer switches to the extreme speed dialogue mode; when the "Extreme Speed Dialogue” control is lit, the user's input is received, and the input may be an operation of clicking the "Extreme Speed Dialogue” control.
  • the "Extreme Speed Dialogue” control is half lit or off, the car computer switches to the non-Extreme Speed Dialogue mode.
  • the dialogue result obtained in step 110 is used to update the local dialogue result or obtain the arbitration result.
  • the local dialogue results are used for subsequent arbitration to obtain the arbitration result.
  • step 120 since the above three factors are comprehensively considered, a faster response can be given based on user needs while ensuring the quality of interaction.
  • Step 130 Obtain an arbitration result according to the local conversation result and the waiting time after receiving the user voice request;
  • the target time will be preset in advance to ensure that the car computer responds after the maximum waiting time exceeds.
  • the first type of dialogue result (LLResult) can be obtained within the target time, that is, the local dialogue result can be at least the first type of dialogue result (LLResult).
  • the local dialogue result may be updated to the second type of dialogue result (CLResult) or the third type of dialogue result (CCResult).
  • the arbitration result is obtained based on the current local conversation result and the waiting time after receiving the user's voice request.
  • the conversation result with the highest confidence (quality) can be obtained within the allowed waiting time as the arbitration result.
  • Step 140 Perform voice interaction according to the arbitration result.
  • the arbitration result obtained in step 130 is the dialogue result with the highest quality currently obtained within the allowed waiting time, and the voice interaction is performed according to the dialogue result.
  • Performing voice interaction can take many forms:
  • the user's voice request is "open the sunroof"
  • executing the voice interaction may include opening the sunroof.
  • the voice interaction may include reporting "It is 30 minutes away from the destination.”
  • executing the voice interaction may include opening the sunroof and announcing "the sunroof is open”.
  • the classification results are output in stages and the arbitration results are output in advance.
  • a finer-grained arbitration result can be obtained, which helps to significantly enhance the response sensitivity of voice interaction while ensuring accuracy, and achieve a faster experience while ensuring accuracy, thereby achieving ultra-fast conversation.
  • the voice interaction method may further include:
  • the third type of dialogue result sent by the server is received, the third type of dialogue result is obtained.
  • a microphone or other pickup is provided in the vehicle cabin to obtain user voice requests in the vehicle cabin, wherein the user voice requests may come from various sound zones in the vehicle cabin, including but not limited to the sound zone of the driver's seat, the sound zone of the front passenger seat, the sound zone on the left side of the second row behind the driver's seat, the sound zone in the middle of the second row behind the driver's seat, the sound zone on the right side of the second row behind the front passenger seat, and some vehicles have more rows.
  • the user voice request after receiving the user voice request, the user voice request will be kept locally and sent to the client central control SDK for text recognition, and will also be sent to the cloud server for the server to perform text recognition on the user voice request.
  • the locally recognized text will be transmitted to the local dialogue system for semantic understanding to obtain the first type of dialogue results.
  • the first type of dialogue results have the fastest response speed and do not rely on the network.
  • the cloud-recognized text will also be transmitted to the local dialogue system for semantic understanding to obtain the second type of dialogue results.
  • the response speed of the second type of dialogue results is slower than that of the first type of dialogue results.
  • the basis of its semantic understanding is the cloud-recognized text of the cloud ASR, the quality of the second type of dialogue results is higher than that of the first type of dialogue results.
  • the cloud-recognized text will also be transmitted to the cloud-based dialogue system for semantic understanding to obtain the third type of dialogue results.
  • the response speed of the third type of dialogue results is slower than that of the second type of dialogue results.
  • the basis of its semantic understanding is the cloud-based recognized text of the cloud-based ASR, and the semantic understanding is also completed through the cloud-based dialogue system, the quality of the third type of dialogue results is higher than that of the third type of dialogue results.
  • the voice interaction method can also obtain the second type of dialogue result (CLResult); while in the related technology, in this scenario, only the first type of dialogue result (LLResult) can be obtained, or a long time of waiting for the network to recover can be obtained before the third type of dialogue result (CCResult) can be obtained.
  • CLResult the second type of dialogue result
  • three-way parallel processing can cope with various network conditions and provide faster response while ensuring accuracy.
  • step 120 updating the local dialogue result or obtaining the arbitration result according to at least part of the type of the dialogue result, the local priority level of the dialogue result, and the interaction mode of the vehicle computer, includes:
  • the arbitration result is obtained according to the local priority level of the dialogue result
  • the vehicle computer When the dialogue result is a first-class dialogue result, the vehicle computer is connected to the Internet, and it is determined that the vehicle computer is in the extreme-speed dialogue mode, and the local priority level of the dialogue result is a directly executable level (level 1 in Table 1), the current dialogue result is used as the local dialogue result;
  • the vehicle computer When the dialogue result is a first-class dialogue result, the vehicle computer is connected to the Internet, and it is determined that the vehicle computer is in the extreme-speed dialogue mode, and the local priority level of the dialogue result is not a directly executable level (level 2, 3, 4, or 5 in Table 1), the current dialogue result is used as the local dialogue result;
  • the vehicle computer When the conversation result is the first type of conversation result, the vehicle computer is connected to the Internet, and the vehicle computer is not in the high-speed conversation mode, the current conversation result is used as the local conversation result.
  • the first type of dialogue result (LLResult) is returned faster, for example, 100+ms.
  • the prerequisite for arbitration is whether the vehicle computer is connected to the Internet.
  • step 120 updating the local dialogue result or obtaining the arbitration result according to at least part of the type of the dialogue result, the local priority level of the dialogue result, and the interaction mode of the vehicle computer, includes:
  • the current dialogue result is used as the local dialogue result.
  • the currently received conversation result is the second type of conversation result (CLResult)
  • CLResult the first type of conversation result has been received before and has been saved locally as the local conversation result. If the vehicle computer is not in the high-speed conversation mode, it means that the user's requirements for reply quality are higher than the requirements for response speed.
  • the current conversation result (CLResult) is used as the local conversation result and as the basis for arbitration of subsequent waiting timeouts. In this way, higher quality voice interaction results can be provided according to user needs.
  • step 120 updating the local dialogue result or obtaining the arbitration result according to at least part of the type of the dialogue result, the local priority level of the dialogue result, and the interaction mode of the vehicle computer, includes:
  • the vehicle computer is in the extreme speed dialogue mode, and the local priority level of the dialogue result is the directly executable level (level 1 in Table 1), the current dialogue result is used as the arbitration result;
  • the vehicle computer is in the extreme speed dialogue mode, and the local priority level of the dialogue result is not a directly executable level (level 2, level 3, level 4 or level 5 in Table 1), the current dialogue result is used as the local dialogue result.
  • the currently received conversation result is the second type of conversation result (CLResult), which is usually returned in 200+ms
  • CLResult the second type of conversation result
  • the first type of conversation result has been received before and has been saved locally as a local conversation result.
  • the vehicle computer is in the ultra-fast conversation mode, it means that the user has a high requirement for the response speed. It is necessary to determine whether to directly obtain the arbitration result or update the local conversation result based on the local priority level of the conversation result.
  • the current dialogue result is used as the arbitration result and a direct preemption is performed; when the local priority level of the dialogue result is not a directly executable level (level 2, 3, 4 or 5 in Table 1), the current dialogue result (CLResult) is used as the local dialogue result and as the basis for arbitration for subsequent waiting timeouts.
  • CLResult the current dialogue result
  • the local priority level of LLResult and the local priority level of CLResult are not necessarily the same. For example, if the local priority level of LLResult is 2 and the local priority level of CLResult is 1, preemption will also occur.
  • step 120 updating the local dialogue result or obtaining the arbitration result according to at least part of the type of the dialogue result, the local priority level of the dialogue result, and the interaction mode of the vehicle computer, includes:
  • the current dialogue result is used as the local dialogue result.
  • the dialogue result can be directly returned to end the arbitration. In this way, high-quality voice interaction results can be obtained, and the judgment logic is simple.
  • step 130 obtaining an arbitration result according to the local conversation result and the waiting time after receiving the user voice request, includes:
  • the waiting time exceeds the first time period and is less than the second time period, and it is determined that there is currently a local dialogue result, and the local priority level of the local dialogue result is a directly executable level or a timeout executable level (level 1 or level 2 in Table 1), the current local dialogue result is used as the arbitration result.
  • the cloud has not yet fed back the third type of conversation result (CCResult), if there is a local conversation result (the local conversation result is the LLResult or CLResult saved during the previous arbitration), the current local conversation result can be returned as the result to end the arbitration, so that users can experience faster while ensuring accuracy.
  • CCResult third type of conversation result
  • first duration and the second duration may be preset, for example, the first duration may be 2.5s-3.5s, and the second duration may be 4.5s-5.5s, for example, the first duration may be 3s, and the second duration may be 5s.
  • the first duration and the second duration may be factory set, or may be adjusted according to user input.
  • step 130 obtaining an arbitration result according to the local conversation result and the waiting time after receiving the user voice request, includes:
  • the waiting time exceeds the second time, it is determined that there is a local dialogue result and the local priority level of the local dialogue result is the rejection level (level 5 in Table 1), a first arbitration result is obtained, and the first arbitration result has no voice broadcast information.
  • step 130 obtaining an arbitration result according to the local conversation result and the waiting time after receiving the user voice request, includes:
  • the waiting time exceeds the second time period, it is determined that there is currently a local dialogue result, and the local priority level of the local dialogue result is an unsupported level or a reserved field level (level 3 or level 4 in Table 1), a second arbitration result is obtained, and the second arbitration result includes voice broadcast information for indicating a network abnormality.
  • the second time period is a bottom-line waiting time, that is, when the waiting time reaches the second time period, the arbitration will be terminated according to the local priority level of the result.
  • step 130 obtaining an arbitration result according to the local conversation result and the waiting time after receiving the user voice request, includes:
  • the waiting time exceeds the second time and it is determined that there is no local dialogue result at present, a third arbitration result is obtained, and the third arbitration result includes voice broadcast information for indicating a network abnormality.
  • the voice interaction method comprises the following steps:
  • VadEnd Voice Activity Detection End
  • Monitor ASRResult determine whether the recognized text is empty to determine whether the mark is a valid voice (to prevent accidental voice such as "dong dong");
  • the prerequisite for arbitration is whether the vehicle is connected to the Internet
  • the local result is considered to be of low confidence and it is necessary to wait for a better conversation result. In this case, the local conversation result is saved as the basis for arbitration of subsequent waiting timeout.
  • the local result is considered to be not very confident and it is necessary to wait for a better conversation result. In this case, the local conversation result is saved as the basis for arbitration of subsequent waiting timeout.
  • the local conversation is graded as 1 or 2 (2 if the super-fast conversation is turned on, 1/2 if the super-fast conversation is turned off), the timer is canceled, the result is returned, and the arbitration is ended, so that the user can experience faster while ensuring accuracy;
  • the local conversation level is 5, generating a default conversation result without TTS response, and the arbitration ends;
  • the advantages are maximized and the disadvantages are avoided, so that high-quality service results can be output at any time and in any state, and voice responses can be provided as quickly and accurately as possible.
  • the cloud ASR local NLU output results are added on the basis of the traditional dialogue results, with finer granularity.
  • the arbitration results can be output in advance according to the confidence and domain classification algorithm decision output classification results in stages, so that users can experience faster while ensuring accuracy.
  • the voice interaction method provided in the present application can be executed by a voice interaction device.
  • a voice interaction device executing the voice interaction method is taken as an example to illustrate the voice interaction device provided in the present application.
  • the present application also provides a voice interaction device.
  • the voice interaction device includes: a first acquisition module 310 , a first processing module 320 , a second processing module 330 and a third processing module 340 .
  • a first acquisition module 310 used to acquire a conversation result
  • the first processing module 320 is used to update the local dialogue result or obtain the arbitration result according to the type of the dialogue result, the local priority level of the dialogue result and at least part of the interaction mode of the vehicle computer; wherein the dialogue result includes a first type of dialogue result, a second type of dialogue result and a third type of dialogue result, the first type of dialogue result is determined by local text recognition and semantic understanding, the second type of dialogue result is determined by cloud text recognition and local semantic understanding, and the third type of dialogue result is determined by cloud text recognition and semantic understanding;
  • the second processing module 330 is used to obtain an arbitration result according to the local dialogue result and the waiting time after receiving the user voice request;
  • the third processing module 340 is used to perform voice interaction according to the arbitration result.
  • the voice interaction device by fusing LLResult, CLResult and CCResult, the classification results are output in stages and the arbitration results are output in advance. Combined with the waiting time after receiving the user's voice request, a finer-grained arbitration result can be obtained, which helps to significantly enhance the response sensitivity of voice interaction while ensuring accuracy, and achieve a faster experience while ensuring accuracy, thereby achieving ultra-fast conversation.
  • the first processing module 320 is further configured to use the current dialogue result as a local dialogue result when the dialogue result is a second-type dialogue result and the vehicle computer is not in the high-speed dialogue mode.
  • the first processing module 320 is also used to use the current dialogue result as the arbitration result when the dialogue result is a second-category dialogue result, the vehicle computer is in an ultra-fast dialogue mode, and the local priority level of the dialogue result is a directly executable level; and use the current dialogue result as the local dialogue result when the dialogue result is a second-category dialogue result, the vehicle computer is in an ultra-fast dialogue mode, and the local priority level of the dialogue result is not a directly executable level.
  • the first processing module 320 is also used to obtain an arbitration result based on the local priority level of the dialogue result when the dialogue result is a first-category dialogue result and the vehicle computer is not connected to the network; when the dialogue result is a first-category dialogue result, the vehicle computer is connected to the network, and it is determined that the vehicle computer is in an extreme speed dialogue mode, and the local priority level of the dialogue result is a directly executable level, the current dialogue result is used as the local dialogue result; when the dialogue result is a first-category dialogue result, the vehicle computer is connected to the network, and it is determined that the vehicle computer is in an extreme speed dialogue mode, and the local priority level of the dialogue result is not a directly executable level, the current dialogue result is used as the local dialogue result; when the dialogue result is a first-category dialogue result, the vehicle computer is connected to the network, and the vehicle computer is not in an extreme speed dialogue mode, the current dialogue result is used as the local dialogue result.
  • the first processing module 320 is further configured to use the current dialogue result as a local dialogue result when the dialogue result is a third type of dialogue result.
  • the voice interaction device may further include:
  • a receiving module used for receiving a user voice request in the vehicle cockpit before obtaining a dialogue result
  • a sending module used for sending a user voice request to a server, so that the server can perform text recognition on the user voice request to obtain cloud-recognized text, and the server can perform semantic understanding on the cloud-recognized text to obtain a third type of dialogue result;
  • a text recognition module is used to perform text recognition on user voice requests to obtain local recognition text, perform semantic understanding on the local recognition text, and obtain a first-category dialogue result;
  • the receiving module is further used to perform semantic understanding on the cloud-recognized text when receiving the cloud-recognized text sent by the server to obtain a second type of dialogue result;
  • the receiving module is further used to obtain the third type of dialogue result when the third type of dialogue result sent by the server is received.
  • the second processing module 330 is also used to use the current local conversation result as the arbitration result when the waiting time exceeds the first time length and is less than the second time length, and it is determined that there is currently a local conversation result and the local priority level of the local conversation result is a directly executable level or a timeout executable level.
  • the second processing module 330 is also used to obtain a first arbitration result without voice broadcast information when the waiting time exceeds the second time and it is determined that there is currently a local dialogue result and the local priority level of the local dialogue result is a rejection level.
  • the second processing module 330 is also used to obtain a second arbitration result when the waiting time exceeds the second time period and it is determined that there is currently a local conversation result and the local priority level of the local conversation result is an unsupported level or a reserved field level.
  • the second arbitration result includes voice broadcast information for indicating a network abnormality.
  • the second processing module 330 is further used to obtain a third arbitration result when the waiting time exceeds the second time and it is determined that there is no local dialogue result at present, and the third arbitration result includes voice broadcast information for indicating a network abnormality.
  • the voice interaction device in the present application may be an electronic device or a component in an electronic device, such as an integrated circuit or a chip.
  • the electronic device may be a terminal or other device other than a terminal.
  • the electronic device may be a vehicle or a head unit on a vehicle, etc., which is not specifically limited in the present application.
  • the voice interaction device in the present application may be a device having an operating system.
  • the operating system may be an Android operating system, an IOS operating system, or other possible operating systems, which are not specifically limited in the present application.
  • the voice interaction device provided in the present application can implement each process of the method examples implemented in Figures 1 to 2. To avoid repetition, they will not be described here.
  • the present application also provides a vehicle 400, including a processor 401, a memory 402, and a computer program stored in the memory 402 and executable on the processor 401.
  • a vehicle 400 including a processor 401, a memory 402, and a computer program stored in the memory 402 and executable on the processor 401.
  • the program is executed by the processor 401, the various processes of the above-mentioned voice interaction method example are implemented, and the same technical effect can be achieved. To avoid repetition, it will not be described here.
  • the present application also provides a non-transitory computer-readable storage medium, on which a computer program is stored.
  • a computer program is stored.
  • the various processes of the above-mentioned voice interaction method example are implemented and the same technical effect can be achieved. To avoid repetition, it will not be repeated here.
  • the processor is the processor in the electronic device described in the above example.
  • the readable storage medium includes a computer readable storage medium, such as a computer read-only memory ROM, a random access memory RAM, a magnetic disk or an optical disk.
  • the present application also provides a computer program product, including a computer program, which implements the above-mentioned voice interaction method when executed by a processor.
  • the processor is the processor in the electronic device described in the above example.
  • the readable storage medium includes a computer readable storage medium, such as a computer read-only memory ROM, a random access memory RAM, a magnetic disk or an optical disk.
  • the present application also provides a chip, which includes a processor and a communication interface, wherein the communication interface is coupled to the processor, and the processor is used to run programs or instructions to implement the various processes of the above-mentioned voice interaction method example, and can achieve the same technical effect. To avoid repetition, it will not be repeated here.
  • the chip mentioned in this application can also be called a system-level chip, a system chip, a chip system or a system-on-chip chip, etc.
  • the technical solution of the present application can be embodied in the form of a computer software product, which is stored in a storage medium (such as ROM/RAM, a disk, or an optical disk), and includes a number of instructions for a terminal (which can be a mobile phone, a computer, a server, or a network device, etc.) to execute the methods described in each example of the present application.
  • a storage medium such as ROM/RAM, a disk, or an optical disk
  • a terminal which can be a mobile phone, a computer, a server, or a network device, etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mechanical Engineering (AREA)
  • User Interface Of Digital Computer (AREA)
  • Navigation (AREA)

Abstract

本申请公开了一种语音交互方法、语音交互装置、车辆和可读存储介质。该语音交互方法包括:获取对话结果;根据对话结果的类型、对话结果的本地优先级别和车机所处的交互模式中的至少部分,更新本地对话结果或得到仲裁结果;其中,对话结果包括第一类、第二类和第三类对话结果,第一类对话结果通过本地进行文本识别及语义理解确定,第二类对话结果通过云端进行文本识别且本地进行语义理解确定,第三类对话结果通过云端进行文本识别及语义理解确定;根据本地对话结果以及接收到用户语音请求后的等待时长,得到仲裁结果;按仲裁结果执行语音交互。

Description

语音交互方法、语音交互装置、车辆和可读存储介质
本申请要求于2022年10月28日申请的、申请号为202211332359.X的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请属于车载语音交互技术领域,尤其涉及一种语音交互方法、语音交互装置、车辆和可读存储介质。
背景技术
车载语音交互通常包括本地车机端处理和云端服务器处理两类:云端服务器处理对网络的依赖性较高,在地库等无网以及弱网环境下,难以及时有效地响应用户的语音请求,即使在网络质量高的情况下,响应速度也次于本地车机端处理;由于本地车机端的计算能力有限,完全依赖本地车机端处理,得到的结果质量较低,且业务支持范围小。目前两类处理方式在质量和响应速度上存在至少一个方面的缺陷,影响用户体验,存在改进空间。
技术问题
本申请旨在至少解决现有技术中存在的技术问题之一。为此,本申请提出一种语音交互方法、语音交互装置、车辆、可读存储介质和计算机程序产品,可以在保证准确性的同时,显著增强语音交互的响应灵敏度。
技术解决方案
第一方面,本申请提供了一种语音交互方法,该方法包括:获取对话结果;根据所述对话结果的类型、所述对话结果的本地优先级别和车机所处的交互模式中的至少部分,更新本地对话结果或得到仲裁结果;其中,所述对话结果包括第一类对话结果、第二类对话结果和第三类对话结果,所述第一类对话结果通过本地进行文本识别及语义理解确定,所述第二类对话结果通过云端进行文本识别且本地进行语义理解确定,所述第三类对话结果通过云端进行文本识别及语义理解确定;根据所述本地对话结果以及接收到用户语音请求后的等待时长,得到仲裁结果;按所述仲裁结果执行语音交互。
根据本申请的语音交互方法,通过融合LLResult、CLResult和CCResult,分阶段输出分类结果提前输出仲裁结果,结合接收到用户语音请求后的等待时长,可以得到颗粒度更细的仲裁结果,有助于在保证准确性的同时,显著增强语音交互的响应灵敏度,在保证准确性的同时体验更快,实现极速对话。
根据本申请的语音交互方法,所述根据所述对话结果的类型、所述对话结果的本地优先级别和车机所处的交互模式中的至少部分,更新本地对话结果或得到仲裁结果,包括:在所述对话结果为第二类对话结果,且所述车机不处于极速对话模式的情况下,将当前的所述对话结果作为本地对话结果。在用户对回复质量的要求高于对响应速度的要求的情况下,将当前的对话结果(CLResult)作为本地对话结果,作为后续等待超时的仲裁依据,可提供更高质量的语音交互结果。
根据本申请的语音交互方法,所述根据所述对话结果的类型、所述对话结果的本地优先级别和车机所处的交互模式中的至少部分,更新本地对话结果或得到仲裁结果,包括:在所述对话结果为第二类对话结果,所述车机处于极速对话模式,且所述对话结果的本地优先级别为可直接执行级别的情况下,将当前的所述对话结果作为仲裁结果;在所述对话结果为第二类对话结果,所述车机处于极速对话模式,且所述对话结果的本地优先级别不为可直接执行级别的情况下,将当前的所述对话结果作为本地对话结果。这样,在用户对响应速度的要求较高的情况下,根据该对话结果的本地优先级别,确定是直接得到仲裁结果还是更新本地对话结果,可以实现部分情况下的抢跑,提高响应灵敏度。
根据本申请的语音交互方法,所述根据所述对话结果的类型、所述对话结果的本地优先级别和车机所处的交互模式中的至少部分,更新本地对话结果或得到仲裁结果,包括:在所述对话结果为第一类对话结果,且车机未联网的情况下,根据所述对话结果的本地优先级别,得到仲裁结果;在所述对话结果为第一类对话结果,车机联网,且确定所述车机处于极速对话模式,所述对话结果的本地优先级别为可直接执行级别的情况下,将当前的所述对话结果作为本地对话结果;在所述对话结果为第一类对话结果,车机联网,且确定所述车机处于极速对话模式,所述对话结果的本地优先级别不为可直接执行级别的情况下,将当前的所述对话结果作为本地对话结果;在所述对话结果为第一类对话结果,车机联网,且所述车机不处于极速对话模式的情况下,将当前的所述对话结果作为本地对话结果。在得到第一类对话结果(LLResult)的情况下,在部分情况,可以实现极速响应。
根据本申请的语音交互方法,所述根据所述对话结果的类型、所述对话结果的本地优先级别和车机所处的交互模式中的至少部分,更新本地对话结果或得到仲裁结果,包括:在所述对话结果为第三类对话结果的情况下,将当前的所述对话结果作为本地对话结果。这样,可以得到高质量的语音交互结果,且判断逻辑简单。
根据本申请的语音交互方法,在所述获取对话结果之前,所述方法还包括:接收车辆座舱的用户语音请求;向服务器发送所述用户语音请求,以便由所述服务器对所述用户语音请求进行文本识别,得到云端识别文本,并由所述服务器对所述云端识别文本进行语义理解,得到第三类对话结果;对所述用户语音请求进行文本识别,得到本地识别文本,对所述本地识别文本进行语义理解,得到第一类对话结果;在接收到所述服务器发送的云端识别文本的情况下,对所述云端识别文本进行语义理解,得到第二类对话结果;在接收到所述服务器发送的第三类对话结果的情况下,得到第三类对话结果。通过三路并行处理的方式,可以应对各种网络状况,且在保证准确性的同时,提供更快的响应。
根据本申请的语音交互方法,所述根据所述本地对话结果以及接收到用户语音请求后的等待时长,得到仲裁结果,包括:在所述等待时长超过第一时长且小于第二时长,确定当前存在本地对话结果,且所述本地对话结果的本地优先级别为可直接执行级别或超时可执行级别的情况下,将当前的所述本地对话结果作为仲裁结果。这样,对于部分级别的对话结果,仅等待相对更短的第一时长,可以保持更快速的响应灵敏度。
根据本申请的语音交互方法,所述根据所述本地对话结果以及接收到用户语音请求后的等待时长,得到仲裁结果,包括:在所述等待时长超过第二时长,确定当前存在本地对话结果,且所述本地对话结果的本地优先级别为拒识级别的情况下,得到第一仲裁结果,所述第一仲裁结果无语音播报信息。
根据本申请的语音交互方法,所述根据所述本地对话结果以及接收到用户语音请求后的等待时长,得到仲裁结果,包括:在所述等待时长超过第二时长,确定当前存在本地对话结果,且所述本地对话结果的本地优先级别为不支持级别或预留字段级别的情况下,得到第二仲裁结果,所述第二仲裁结果包括用于指示网络异常的语音播报信息。
根据本申请的语音交互方法,所述根据所述本地对话结果以及接收到用户语音请求后的等待时长,得到仲裁结果,包括:在所述等待时长超过第二时长,且确定当前不存在本地对话结果的情况下,得到第三仲裁结果,所述第三仲裁结果包括用于指示网络异常的语音播报信息。
第二方面,本申请提供了一种语音交互装置,该装置包括:第一获取模块,用于获取对话结果;第一处理模块,用于根据所述对话结果的类型、所述对话结果的本地优先级别和车机所处的交互模式中的至少部分,更新本地对话结果或得到仲裁结果;其中,所述对话结果包括第一类对话结果、第二类对话结果和第三类对话结果,所述第一类对话结果通过本地进行文本识别及语义理解确定,所述第二类对话结果通过云端进行文本识别且本地进行语义理解确定,所述第三类对话结果通过云端进行文本识别及语义理解确定;第二处理模块,用于根据所述本地对话结果以及接收到用户语音请求后的等待时长,得到仲裁结果;第三处理模块,用于按所述仲裁结果执行语音交互。
根据本申请的语音交互装置,通过融合LLResult、CLResult和CCResult,分阶段输出分类结果提前输出仲裁结果,结合接收到用户语音请求后的等待时长,可以得到颗粒度更细的仲裁结果,有助于在保证准确性的同时,显著增强语音交互的响应灵敏度,在保证准确性的同时体验更快,实现极速对话。
第三方面,本申请提供了一种车辆,包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现如上述第一方面所述的语音交互方法。
第四方面,本申请提供了一种非暂态计算机可读存储介质,其上存储有计算机程序,所述计算机程序被处理器执行时实现如上述第一方面所述的语音交互方法。
第五方面,本申请提供了一种芯片,所述芯片包括处理器和通信接口,所述通信接口和所述处理器耦合,所述处理器用于运行程序或指令,实现如第一方面所述的语音交互方法。
第六方面,本申请提供了一种计算机程序产品,包括计算机程序,所述计算机程序被处理器执行时实现如上述第一方面所述的语音交互方法。
本申请的附加方面和优点将在下面的描述中部分给出,部分将从下面的描述中变得明显,或通过本申请的实践了解到。
附图说明
本申请的上述和/或附加的方面和优点从结合下面附图对实施例的描述中将变得明显和容易理解,其中:
图1是本申请提供的语音交互方法的流程示意图之一;
图2是本申请提供的语音交互方法的流程示意图之二;
图3是本申请提供的语音交互装置的结构示意图;
图4是本申请提供的车辆的结构示意图。
本发明的实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。根据本申请中的实施例,本领域普通技术人员获得的所有其他实施例,都属于本申请保护的范围。
本申请的说明书和权利要求书中的术语“第一”、“第二”等是用于区别类似的对象,而不用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便本申请的实施例能够以除了在这里图示或描述的那些以外的顺序实施,且“第一”、“第二”等所区分的对象通常为一类,并不限定对象的个数,例如第一对象可以是一个,也可以是多个。此外,说明书以及权利要求中“和/或”表示所连接对象的至少其中之一,字符“/”,一般表示前后关联对象是一种“或”的关系。
下面结合附图,通过具体的实现方式及其应用场景对本申请提供的语音交互方法、语音交互装置、车辆、电子设备、可读存储介质和计算机程序产品进行详细地说明。
其中,语音交互方法可应用于终端,具体可由,终端中的硬件或软件执行。
该终端可以为车机,终端可以为包括诸如拾音器或触控面板或其它物理用户接口的设备。
本申请提供的语音交互方法,该语音交互方法的执行主体可以为车机或者车机中能够实现该语音交互方法的功能模块或功能实体。在车载环境下,由于网络情况复杂,比如车辆行驶时,由于地点切换导致网络状态动态变化,语音交互的复杂性也远高于居家环境的语音交互。
如图1所示,该语音交互方法包括:步骤110、步骤120、步骤130和步骤140。
步骤110、获取对话结果;
对话结果为本地车机或云端服务器对用户语音请求进行文本识别(ASR,Automatic Speech Recognition)和语义理解(NLU)后输出的对话结果。
该对话结果用于在后续步骤中与其他对话结果进行仲裁,以确定出最后要输出的仲裁结果,仲裁结果可以为前序得到的对话结果之一。
相关技术中,车载语音交互系统通常在两种处理方式中二选一,或二者兼顾:
(1)通过本地进行文本识别及语义理解,得到对话结果,这类对话结果简称LLResult(Local ASR&Local NLU);
(2)通过云端进行文本识别及语义理解,得到对话结果,这类对话结果简称CCResult(Cloud ASR&Cloud NLU)。
本申请的语音交互方法,设计了三条处理路线,对应地,得到的对话结果包括第一类对话结果、第二类对话结果和第三类对话结果:
(1)第一类对话结果通过本地进行文本识别及语义理解确定,这类对话结果简称LLResult(Local ASR&Local NLU);
(2)第二类对话结果通过云端进行文本识别且本地进行语义理解确定,这类对话结果简称CLResult(Cloud ASR&Local NLU);
(3)第三类对话结果通过云端进行文本识别及语义理解确定,这类对话结果简称CCResult(Cloud ASR&Cloud NLU)。
可以理解的是,该语音交互方法中,在云端进行文本识别后,会将识别的文本继续在云端进行语义理解,也会将该云端识别文本下发到车机,车机可以对云端识别文本进行语义理解,得到第二类对话结果(CLResult)。第二类对话结果(CLResult)的准确度和质量高于第一类对话结果(LLResult),第二类对话结果(CLResult)的响应速度快于第三类对话结果(CCResult)。
通过融合LLResult、CLResult和CCResult,特别是增加了CLResult,使得后续得到的仲裁结果的颗粒度更细,有助于在保证准确性的同时,显著增强语音交互的响应灵敏度。
可以理解的是,在车机正常工作时,用户唤醒语音交互系统后,用户语音请求可以通过上述三条处理路线进行处理,并根据网络状况,得到一路或多路对话结果。
步骤120、根据对话结果的类型、对话结果的本地优先级别和车机所处的交互模式中的至少部分,更新本地对话结果或得到仲裁结果;
在用户语音请求被本地车机或云端服务器处理时,根据执行主体的不同,得到的对话结果中会带有用于表示执行主体的标识。
也就是说,可以确定识别步骤110中获得的对话结果中,ASR是在本地执行还是在云端执行,该对话结果的NLU是在本地执行还是在云端执行,进而确定对话结果的类型,也即对话结果是第一类对话结果(LLResult)、第二类对话结果(CLResult)还是第三类对话结果(CCResult)。
第一类对话结果(LLResult)、第二类对话结果(CLResult)和第三类对话结果(CCResult)的置信度排序如下:CCResult>CLResult>LLResult。
通过对对话结果进行模型预测,还可以确定对话结果的本地优先级别(Local Priority),比如可以根据domain或者置信度等模型预测得到该本地优先级别。
需要说明的是,对于同一个用户语音请求,第一类对话结果(LLResult)和第二类对话结果(CLResult)预测的本地优先级别不一定相同。
在一些实施方式中,将对话结果的本地优先级别分为了如下5个级别,如表1。
表1
级别 说明
1 本地可直接执行(可抢跑)
2 本地输出结果需要等待云端,融合超时时采用(可兜底)
3 本地对话不支持领域,例如查询天气,必须等待云端结果
4 预留字段
5 据识,本地对话不支持的闲聊类的,例如啦啦啦
当然,还可以根据实际需要,将本地优先级别分为更多或更少的级别。
车机具有多种交互模式,比如包括极速对话模式和非极速对话模式,在当前车机处于极速对话模式的情况下,表明用户需要更车机提供更快的响应速度。若打开极速对话模式,代表用户相信本地算法能力。
在实际的执行中,用户可以通过语音控制或通过触控显示屏的触控操作,切换车机的交互模式。
以通过触控显示屏进行选择交互模式为例,在显示有交互模式的界面上,接收用户的输入,该输入可以为点击“极速对话”控件的操作,在“极速对话”控件点亮的情况下,车机切换为极速对话模式;在“极速对话”控件被点亮的情况下,接收用户的输入,该输入可以为点击“极速对话”控件的操作,“极速对话”控件半亮或熄灭,车机切换为非极速对话模式。
步骤110中获取的对话结果,用于更新本地对话结果或得到仲裁结果。
可以理解的是,每次在获取到对话结果的情况下,需要根据对话结果的类型、对话结果的本地优先级别和车机所处的交互模式中的至少部分,来确定是按照最新获取的对话结果更新本地对话结果,还是直接将当前获取的对话结果得到仲裁结果。
而本地对话结果用于后续进行仲裁以得到仲裁结果。
在步骤120中,由于综合考虑到上述三种因素,可以根据用户需求,在确保交互质量的基础上,给出更快速的响应。
步骤130、根据本地对话结果以及接收到用户语音请求后的等待时长,得到仲裁结果;
可以理解的是,为了确保语音交互的响应等待时长不影响用户体验,会提前预设目标时长,以确保在超过最长等待时长后,车机有回复。
在车机正常工作时,均可在目标时长内得到第一类对话结果(LLResult),即本地对话结果至少可为第一类对话结果(LLResult),在网络状态良好时,本地对话结果可能更新为第二类对话结果(CLResult)或第三类对话结果(CCResult)。
在该步骤中,根据当前的本地对话结果,结合接收到用户语音请求后的等待时长,得到仲裁结果,可以在允许的等待时长内,得到置信度(质量)最高的对话结果,作为仲裁结果。
步骤140、按仲裁结果执行语音交互。
在步骤130中得的仲裁结果即为在允许的等待时长内,当前得到的质量最高的对话结果,按照该对话结果执行语音交互。
执行语音交互可以有多种形式:
其一,执行语音交互对应的控制指令。
比如,用户语音请求为“打开天窗”,执行语音交互可以包括开启天窗。
其二,播报语音回复。
比如,用户语音请求为“还有多久到目的地”,执行语音交互可以包括播报“离目的地还有30分钟路程”。
其三,执行语音交互对应的控制指令且播报语音回复。
比如,用户语音请求为“打开天窗”,执行语音交互可以包括开启天窗且播报“天窗已开启”。
根据本申请提供的语音交互方法,通过融合LLResult、CLResult和CCResult,分阶段输出分类结果提前输出仲裁结果,结合接收到用户语音请求后的等待时长,可以得到颗粒度更细的仲裁结果,有助于在保证准确性的同时,显著增强语音交互的响应灵敏度,在保证准确性的同时体验更快,实现极速对话。
在一些示例中,在步骤110、获取对话结果之前,该语音交互方法还可以包括:
接收车辆座舱的用户语音请求;
向服务器发送用户语音请求,以便由服务器对用户语音请求进行文本识别,得到云端识别文本,并由服务器对云端识别文本进行语义理解,得到第三类对话结果;
对用户语音请求进行文本识别,得到本地识别文本,对本地识别文本进行语义理解,得到第一类对话结果;
在接收到服务器发送的云端识别文本的情况下,对云端识别文本进行语义理解,得到第二类对话结果;
在接收到服务器发送的第三类对话结果的情况下,得到第三类对话结果。
可以理解的是,在车辆座舱内设置有麦克风或其它拾音器,以对车辆座舱内的用户语音请求进行获取,其中,用户语音请求可来自车辆座舱内的各个音区,包括但不限于主驾座位的音区、副驾座位的音区、主驾座位后第二排左侧的音区、主驾座位后第二排中间的音区、副驾座位后第二排右侧的音区,部分车辆还有更多排。
如图2所示,在该实施方式中,在接收用户语音请求后,该用户语音请求会留在本地发给客户端中控SDK进行文本识别,也会向云端服务器发送,以供服务器对用户语音请求进行文本识别。
这样,在文本识别阶段,通过本地文本识别可以做到快速文本识别,且对网络状态的依赖度低;通过服务器,则可以得到质量更高的云端识别文本。
本地识别文本会传输给本地对话系统进行语义理解,得到第一类对话结果,第一类对话结果的响应速度最快,且无需依赖网络。
云端识别文本也会传输给本地对话系统进行语义理解,得到第二类对话结果,第二类对话结果的响应速度比第一类对话结果慢,但是由于其语义理解的基础是云端ASR的云端识别文本,因此,第二类对话结果的质量比第一类对话结果高。
云端识别文本也会传输给云端对话系统进行语义理解,得到第三类对话结果,第三类对话结果的响应速度比第二类对话结果慢,但是由于其语义理解的基础是云端ASR的云端识别文本,且语义理解也是通过云端对话系统完成,因此,第三类对话结果的质量比第三类对话结果高。
比如,当车辆从网络状况好的地方驶入网络状况差的地方时,在本地对话系统接收到云端识别文本后断网,则该语音交互方法还可以得到第二类对话结果(CLResult);而相关技术中,在该场景下,只能得到第一类对话结果(LLResult),或者需要长时间等待网络恢复后,再得到第三类对话结果(CCResult)。
在该示例中,通过三路并行处理的方式,可以应对各种网络状况,且在保证准确性的同时,提供更快的响应。
在一些示例中,步骤120、根据对话结果的类型、对话结果的本地优先级别和车机所处的交互模式中的至少部分,更新本地对话结果或得到仲裁结果,包括:
在对话结果为第一类对话结果,且车机未联网的情况下,根据对话结果的本地优先级别,得到仲裁结果;
在对话结果为第一类对话结果,车机联网,且确定车机处于极速对话模式,该对话结果的本地优先级别为可直接执行级别(表1中的级别1)的情况下,将当前的对话结果作为本地对话结果;
在对话结果为第一类对话结果,车机联网,且确定车机处于极速对话模式,该对话结果的本地优先级别不为可直接执行级别(表1中的级别2、3、4或5)的情况下,将当前的对话结果作为本地对话结果;
在对话结果为第一类对话结果,车机联网,且车机不处于极速对话模式的情况下,将当前的对话结果作为本地对话结果。
可以理解的是,第一类对话结果(LLResult)的返回速度较快,比如100+ms返回对话结果,在收到第一类对话结果(LLResult)的情况下,仲裁的先决条件为车机是否联网。
车机未联网时,无需等待后续其他的对话结果,根据对话结果的本地优先级别直接返回结果,可以在无网时快速响应用户语音请求。本地优先级别为1或2的情况下,返回该对话结果,仲裁结束;在本地优先级别为5的情况下,生成默认无TTS播报的对话结果,返回该对话结果,仲裁结束;在本地优先级别为3或4的情况下,生成类似“网络异常,该功能不可用”的TTS播报对话结果,返回该对话结果,仲裁结束。
车机联网且打开极速对话时,根据对话结果的本地优先级别,选择是结束仲裁还是更新本地对话结果:本地优先级别为1的情况下,返回结果,仲裁结束,使得用户体验更快;本地优先级别为2或3或4或5的情况下,认为本地结果置信度不高,需要等待更优的对话结果,此时保存本地对话结果,作为后续等待超时的仲裁依据。
车机联网且未打开极速对话时,高置信云端结果,将当前的对话结果作为本地对话结果,作为后续等待超时的仲裁依据。
在示例中,在得到第一类对话结果(LLResult)的情况下,在部分情况,可以实现极速响应。
在一些示例中,步骤120、根据对话结果的类型、对话结果的本地优先级别和车机所处的交互模式中的至少部分,更新本地对话结果或得到仲裁结果,包括:
在对话结果为第二类对话结果,且车机不处于极速对话模式的情况下,将当前的对话结果作为本地对话结果。
在该示例中,如当前收到的对话结果为第二类对话结果(CLResult),则之前已收到过第一类对话结果,并已保存在本地作为本地对话结果,在车机不处于极速对话模式的情况下,说明用户对回复质量的要求高于对响应速度的要求,将当前的对话结果(CLResult)作为本地对话结果,作为后续等待超时的仲裁依据。这样,可以根据用户的需求,提供更高质量的语音交互结果。
在一些示例中,步骤120、根据对话结果的类型、对话结果的本地优先级别和车机所处的交互模式中的至少部分,更新本地对话结果或得到仲裁结果,包括:
在对话结果为第二类对话结果,车机处于极速对话模式,且该对话结果的本地优先级别为可直接执行级别(表1中的1级)的情况下,将当前的对话结果作为仲裁结果;
在对话结果为第二类对话结果,车机处于极速对话模式,且该对话结果的本地优先级别不为可直接执行级别(表1中的2级、3级、4级或5级)的情况下,将当前的对话结果作为本地对话结果。
在该示例方式中,如当前收到的对话结果为第二类对话结果(CLResult),通常200+ms返回,则之前已收到过第一类对话结果,并已保存在本地作为本地对话结果,在车机处于极速对话模式的情况下,说明用户对响应速度的要求较高,需根据该对话结果的本地优先级别,确定是直接得到仲裁结果还是更新本地对话结果。
在该对话结果的本地优先级别为可直接执行级别(表1中的1级)的情况下,将当前的对话结果作为仲裁结果,直接抢跑;在该对话结果的本地优先级别不为可直接执行级别(表1中的2级、3级、4级或5级)的情况下,将当前的对话结果(CLResult)作为本地对话结果,作为后续等待超时的仲裁依据。
需要说明的是,对于同一用户语音请求,LLResult的本地优先级别与CLResult的本地优先级别不一定相同,比如LLResult的本地优先级别为2,CLResult的本地优先级别为1,则也会抢跑。
在一些示例中,步骤120、根据对话结果的类型、对话结果的本地优先级别和车机所处的交互模式中的至少部分,更新本地对话结果或得到仲裁结果,包括:
在对话结果为第三类对话结果的情况下,将当前的对话结果作为本地对话结果。
在该示例中,如得到最高置信度的第三类对话结果(CCResult),则可直接返回该对话结果,结束仲裁。这样,可以得到高质量的语音交互结果,且判断逻辑简单。
在一些示例中,步骤130、根据本地对话结果以及接收到用户语音请求后的等待时长,得到仲裁结果,包括:
在等待时长超过第一时长且小于第二时长,确定当前存在本地对话结果,且本地对话结果的本地优先级别为可直接执行级别或超时可执行级别(表1中的1级或2级)的情况下,将当前的本地对话结果作为仲裁结果。
换言之,在融合过程中,在等待时长达到第一时长的情况下,云端还未反馈第三类对话结果(CCResult),如存在本地对话结果(该本地对话结果为前序仲裁时保存的LLResult或CLResult),则可将当前的本地对话结果作为结果返回,结束仲裁,使得用户在保证准确性的前提下,体验更快。
需要说明的是,第一时长和第二时长可以为预先设置的,比如的第一时长可以为2.5s-3.5s,第二时长可以为4.5s-5.5s,比如第一时长可以为3s,第二时长可以为5s。第一时长和第二时长可以是出厂时设置好的,或者可以根据用户的输入进行调整。
这样,对于部分级别的对话结果,仅等待相对更短的第一时长,可以保持更快速的响应灵敏度。
在一些示例中,步骤130、根据本地对话结果以及接收到用户语音请求后的等待时长,得到仲裁结果,包括:
在等待时长超过第二时长,确定当前存在本地对话结果,且本地对话结果的本地优先级别为拒识级别(表1中的5级)的情况下,得到第一仲裁结果,第一仲裁结果无语音播报信息。
在该示例中,如果等待时长超过第二时长时,还未收到云端结果,则得到无TTS回复的默认对话结果,仲裁结束。
在一些示例中,步骤130、根据本地对话结果以及接收到用户语音请求后的等待时长,得到仲裁结果,包括:
在等待时长超过第二时长,确定当前存在本地对话结果,且本地对话结果的本地优先级别为不支持级别或预留字段级别(表1中的3级或4级)的情况下,得到第二仲裁结果,第二仲裁结果包括用于指示网络异常的语音播报信息。
在该示例中,生成类似为“网络异常,该功能不可用”的TTS播报,返回结果,仲裁结束。
设置第二时长,可以防止长时等待,相关技术中,经常出现无响应的情况,影响用户体验,在本申请的技术方案中,第二时长为兜底等待时长,即在等待时长达到第二时长的情况下,会根据结果的本地优先级别结束仲裁。
在一些示例中,步骤130、根据本地对话结果以及接收到用户语音请求后的等待时长,得到仲裁结果,包括:
在等待时长超过第二时长,且确定当前不存在本地对话结果的情况下,得到第三仲裁结果,第三仲裁结果包括用于指示网络异常的语音播报信息。
换言之,如不存在本地对话结果,生成类似为“网络异常,该功能不可用”的TTS播报,返回结果,仲裁结束。需要说明的是,一般都会至少存在LLResult作为本地对话结果,如出现上述情况,表面程序出错。
下面描述本申请提供的一种语音交互方法。
该语音交互方法包括如下步骤:
1.监听到VadEnd(Voice Activity Detection End,声音活性检测结束信号结束),开启仲裁,启动3S和5S的等待时长Timer(计时);
2.监听到ASRResult,判断识别文本是否为空,用来确定标识是否是有效语音(防止“咚咚”等误触语音);
2.a.若为空直接中断仲裁(停止timer)——空文本下端云无对话结果;
2.b.若不为空继续仲裁流程。
3.有端云对话结果输入时,判断仲裁流程是否完成
3.a.若完成忽略结果,表示已生成仲裁结果;
3.b.否则继续仲裁流程。
4.若结果是LLResult,仲裁的先决条件为车机是否联网
4.a.车机未联网时,无需等待其他的对话结果,根据分级结果直接返回结果,可以在无网时快速响应用户语音请求
4.a.i.本地优先级别为1或2的情况下,取消Timer,返回结果,仲裁结束;
4.a.ii.本地优先级别为5的情况下,生成默认无TTS播报的对话结果,取消Timer,返回结果,仲裁结束;
4.a.iii.否则,生成类似为“网络异常,该功能不可用”的TTS播报对话结果,取消Timer,返回结果,仲裁结束。
4.b.有网络
4.b. i.打开极速对话模式(表明用户相信本地算法能力)
4.b. i.1.本地优先级别为1的情况下,取消Timer,返回结果,仲裁结束,使得用户体验更快;
4.b. i.2.否则,认为本地结果置信度不高,需要等待更优的对话结果,此时保存本地对话结果,作为后续等待超时的仲裁依据。
4.b.ii.关闭极速对话模式,高置信云端结果,保存本地对话结果,作为后续等待超时的仲裁依据。
5.若结果是CLResult,仲裁先决条件为是否打开极速对话
a.关闭极速对话模式,更新保存本地对话结果,作为后续等待超时的仲裁依据;
b.打开极速对话模式
b.i.如本地分级为1,则取消Timer,返回结果,仲裁结束,将当前的对话结果作为仲裁结果,高置信度下使得用户体验更快;
b.ii.否则认为本地结果置信度不高,需要等待更优的对话结果,此时保存本地对话结果,作为后续等待超时的仲裁依据。
6.若输入结果为CCResult,纯云端对话结果(最高置信度,更准确),取消Timer,返回结果,仲裁结束。
7.融合过程中,3S本地兜底对话等待时长到(云端对话结果没返回)
7.a.存在本地对话结果
7.a.i.本地对话分级为1或者2(打开极速对话是2,关闭极速对话是1/2),取消Timer,返回结果,结束仲裁,使得用户在保证准确性的前提下,体验更快;
7.a.ii.否则继续仲裁流程,等待云端的最终高置信度结果。
7.b.不存在本地对话结果,继续仲裁流程(程序出错才会存在)。
8.融合过程中,5S云端等待时长到(云端对话结果无返回,否则已采用云端,极端网络情况下)
8.a.存在本地对话结果
8.a.i.本地对话分级为5生成无TTS回复的默认对话结果,仲裁结束;
8.a.ii.否则(此时本地对话分级为3/4,分类为1/2的结果会在3STimer采纳)生成类似为“网络异常,该功能不可用”的TTS播报,返回结果,仲裁结束。
8.b.不存在本地对话结果,生成类似为“网络异常,该功能不可用”的TTS播报,返回结果,仲裁结束(程序出错才会存在)。
根据上述语音交互方法,通过提出本地端和云端的融合方案,扬长避短,可以实现在任意时间和状态下均可输出较高质量的服务结果,提供尽可能快而准的语音响应。特别是在传统对话结果的基础上融合增加了云端ASR本地NLU输出结果,颗粒度更细,同时分阶段可根据置信度以及领域分类等算法决策输出分类结果提前输出仲裁结果,使得用户在保证准确性的同时体验更快。
本申请提供的语音交互方法,执行主体可以为语音交互装置。本申请中以语音交互装置执行语音交互方法为例,说明本申请提供的语音交互装置。
本申请还提供一种语音交互装置。
如图3所示,该语音交互装置包括:第一获取模块310、第一处理模块320、第二处理模块330和第三处理模块340。
第一获取模块310,用于获取对话结果;
第一处理模块320,用于根据对话结果的类型、对话结果的本地优先级别和车机所处的交互模式中的至少部分,更新本地对话结果或得到仲裁结果;其中,对话结果包括第一类对话结果、第二类对话结果和第三类对话结果,第一类对话结果通过本地进行文本识别及语义理解确定,第二类对话结果通过云端进行文本识别且本地进行语义理解确定,第三类对话结果通过云端进行文本识别及语义理解确定;
第二处理模块330,用于根据本地对话结果以及接收到用户语音请求后的等待时长,得到仲裁结果;
第三处理模块340,用于按仲裁结果执行语音交互。
根据本申请提供的语音交互装置,通过融合LLResult、CLResult和CCResult,分阶段输出分类结果提前输出仲裁结果,结合接收到用户语音请求后的等待时长,可以得到颗粒度更细的仲裁结果,有助于在保证准确性的同时,显著增强语音交互的响应灵敏度,在保证准确性的同时体验更快,实现极速对话。
在一些示例中,第一处理模块320,还用于在对话结果为第二类对话结果,且车机不处于极速对话模式的情况下,将当前的对话结果作为本地对话结果。
在一些示例中,第一处理模块320,还用于在对话结果为第二类对话结果,车机处于极速对话模式,且对话结果的本地优先级别为可直接执行级别的情况下,将当前的对话结果作为仲裁结果;在对话结果为第二类对话结果,车机处于极速对话模式,且对话结果的本地优先级别不为可直接执行级别的情况下,将当前的对话结果作为本地对话结果。
在一些示例中,第一处理模块320,还用于在对话结果为第一类对话结果,且车机未联网的情况下,根据对话结果的本地优先级别,得到仲裁结果;在对话结果为第一类对话结果,车机联网,且确定车机处于极速对话模式,对话结果的本地优先级别为可直接执行级别的情况下,将当前的对话结果作为本地对话结果;在对话结果为第一类对话结果,车机联网,且确定车机处于极速对话模式,对话结果的本地优先级别不为可直接执行级别的情况下,将当前的对话结果作为本地对话结果;在对话结果为第一类对话结果,车机联网,且车机不处于极速对话模式的情况下,将当前的对话结果作为本地对话结果。
在一些示例中,第一处理模块320,还用于在对话结果为第三类对话结果的情况下,将当前的对话结果作为本地对话结果。
在一些示例中,该语音交互装置,还可以包括:
接收模块,用于在获取对话结果之前,接收车辆座舱的用户语音请求;
发送模块,用于向服务器发送用户语音请求,以便由服务器对用户语音请求进行文本识别,得到云端识别文本,并由服务器对云端识别文本进行语义理解,得到第三类对话结果;
文本识别模块,用于对用户语音请求进行文本识别,得到本地识别文本,对本地识别文本进行语义理解,得到第一类对话结果;
接收模块,还用于在接收到服务器发送的云端识别文本的情况下,对云端识别文本进行语义理解,得到第二类对话结果;
接收模块,还用于在接收到服务器发送的第三类对话结果的情况下,得到第三类对话结果。
在一些示例中,第二处理模块330,还用于在等待时长超过第一时长且小于第二时长,确定当前存在本地对话结果,且本地对话结果的本地优先级别为可直接执行级别或超时可执行级别的情况下,将当前的本地对话结果作为仲裁结果。
在一些示例中,第二处理模块330,还用于在等待时长超过第二时长,确定当前存在本地对话结果,且本地对话结果的本地优先级别为拒识级别的情况下,得到第一仲裁结果,第一仲裁结果无语音播报信息。
在一些示例中,第二处理模块330,还用于在等待时长超过第二时长,确定当前存在本地对话结果,且本地对话结果的本地优先级别为不支持级别或预留字段级别的情况下,得到第二仲裁结果,第二仲裁结果包括用于指示网络异常的语音播报信息。
在一些示例中,第二处理模块330,还用于在等待时长超过第二时长,且确定当前不存在本地对话结果的情况下,得到第三仲裁结果,第三仲裁结果包括用于指示网络异常的语音播报信息。
本申请中的语音交互装置可以是电子设备,也可以是电子设备中的部件,例如集成电路或芯片。该电子设备可以是终端,也可以为除终端之外的其他设备。示例性的,电子设备可以为车辆或车辆上的车机等,本申请不作具体限定。
本申请中的语音交互装置可以为具有操作系统的装置。该操作系统可以为安卓(Android)操作系统,可以为IOS操作系统,还可以为其他可能的操作系统,本申请不作具体限定。
本申请提供的语音交互装置能够实现图1至图2的方法示例实现的各个过程,为避免重复,这里不再赘述。
在一些示例中,如图4所示,本申请还提供一种车辆400,包括处理器401、存储器402及存储在存储器402上并可在处理器401上运行的计算机程序,该程序被处理器401执行时实现上述语音交互方法示例的各个过程,且能达到相同的技术效果,为避免重复,这里不再赘述。
本申请还提供一种非暂态计算机可读存储介质,该非暂态计算机可读存储介质上存储有计算机程序,该计算机程序被处理器执行时实现上述语音交互方法示例的各个过程,且能达到相同的技术效果,为避免重复,这里不再赘述。
其中,所述处理器为上述示例中所述的电子设备中的处理器。所述可读存储介质,包括计算机可读存储介质,如计算机只读存储器ROM、随机存取存储器RAM、磁碟或者光盘等。
本申请还提供一种计算机程序产品,包括计算机程序,该计算机程序被处理器执行时实现上述语音交互方法。
其中,所述处理器为上述示例中所述的电子设备中的处理器。所述可读存储介质,包括计算机可读存储介质,如计算机只读存储器ROM、随机存取存储器RAM、磁碟或者光盘等。
本申请另提供了一种芯片,所述芯片包括处理器和通信接口,所述通信接口和所述处理器耦合,所述处理器用于运行程序或指令,实现上述语音交互方法示例的各个过程,且能达到相同的技术效果,为避免重复,这里不再赘述。
应理解,本申请提到的芯片还可以称为系统级芯片、系统芯片、芯片系统或片上系统芯片等。
需要说明的是,在本文中,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者装置不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者装置所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括该要素的过程、方法、物品或者装置中还存在另外的相同要素。此外,需要指出的是,本申请实施方式中的方法和装置的范围不限按示出或讨论的顺序来执行功能,还可包括根据所涉及的功能按基本同时的方式或按相反的顺序来执行功能,例如,可以按不同于所描述的次序来执行所描述的方法,并且还可以添加、省去、或组合各种步骤。另外,参照某些示例所描述的特征可在其他示例中被组合。
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到上述示例方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。根据这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以计算机软件产品的形式体现出来,该计算机软件产品存储在一个存储介质(如ROM/RAM、磁碟、光盘)中,包括若干指令用以使得一台终端(可以是手机,计算机,服务器,或者网络设备等)执行本申请各个示例所述的方法。
上面结合附图对本申请的示例进行了描述,但是本申请并不局限于上述的具体实施方式,上述的具体实施方式仅仅是示意性的,而不是限制性的,本领域的普通技术人员在本申请的启示下,在不脱离本申请宗旨和权利要求所保护的范围情况下,还可做出很多形式,均属于本申请的保护之内。
在本说明书的描述中,参考术语“一个示例”、“一些示例”、“示意性示例”、“示例”、“具体示例”、或“一些示例”等的描述意指结合该示例或示例描述的具体特征、结构、材料或者特点包含于本申请的至少一个示例或示例中。在本说明书中,对上述术语的示意性表述不一定指的是相同的示例或示例。而且,描述的具体特征、结构、材料或者特点可以在任何的一个或多个示例或示例中以合适的方式结合。
尽管已经示出和描述了本申请的示例,本领域的普通技术人员可以理解:在不脱离本申请的原理和宗旨的情况下可以对这些示例进行多种变化、修改、替换和变型,本申请的范围由权利要求及其等同物限定。

Claims (13)

  1. 一种语音交互方法,其中,所述方法包括:
    获取对话结果;
    根据所述对话结果的类型、所述对话结果的本地优先级别和车机所处的交互模式中的至少部分,更新本地对话结果或得到仲裁结果;其中,所述对话结果包括第一类对话结果、第二类对话结果和第三类对话结果,所述第一类对话结果通过本地进行文本识别及语义理解确定,所述第二类对话结果通过云端进行文本识别且本地进行语义理解确定,所述第三类对话结果通过云端进行文本识别及语义理解确定;
    根据所述本地对话结果以及接收到用户语音请求后的等待时长,得到仲裁结果;
    按所述仲裁结果执行语音交互。
  2. 根据权利要求1所述的语音交互方法,其中,所述根据所述对话结果的类型、所述对话结果的本地优先级别和车机所处的交互模式中的至少部分,更新本地对话结果或得到仲裁结果,包括:
    在所述对话结果为第二类对话结果,且所述车机不处于极速对话模式的情况下,将当前的所述对话结果作为本地对话结果。
  3. 根据权利要求1所述的语音交互方法,其中,所述根据所述对话结果的类型、所述对话结果的本地优先级别和车机所处的交互模式中的至少部分,更新本地对话结果或得到仲裁结果,包括:
    在所述对话结果为第二类对话结果,所述车机处于极速对话模式,且所述对话结果的本地优先级别为可直接执行级别的情况下,将当前的所述对话结果作为仲裁结果;
    在所述对话结果为第二类对话结果,所述车机处于极速对话模式,且所述对话结果的本地优先级别不为可直接执行级别的情况下,将当前的所述对话结果作为本地对话结果。
  4. 根据权利要求1所述的语音交互方法,其中,所述根据所述对话结果的类型、所述对话结果的本地优先级别和车机所处的交互模式中的至少部分,更新本地对话结果或得到仲裁结果,包括:
    在所述对话结果为第一类对话结果,且车机未联网的情况下,根据所述对话结果的本地优先级别,得到仲裁结果;
    在所述对话结果为第一类对话结果,车机联网,且确定所述车机处于极速对话模式,所述对话结果的本地优先级别为可直接执行级别的情况下,将当前的所述对话结果作为本地对话结果;
    在所述对话结果为第一类对话结果,车机联网,且确定所述车机处于极速对话模式,所述对话结果的本地优先级别不为可直接执行级别的情况下,将当前的所述对话结果作为本地对话结果;
    在所述对话结果为第一类对话结果,车机联网,且所述车机不处于极速对话模式的情况下,将当前的所述对话结果作为本地对话结果。
  5. 根据权利要求1所述的语音交互方法,其中,所述根据所述对话结果的类型、所述对话结果的本地优先级别和车机所处的交互模式中的至少部分,更新本地对话结果或得到仲裁结果,包括:
    在所述对话结果为第三类对话结果的情况下,将当前的所述对话结果作为本地对话结果。
  6. 根据权利要求1-5中任一项所述的语音交互方法,其中,在所述获取对话结果之前,所述方法还包括:
    接收车辆座舱的用户语音请求;
    向服务器发送所述用户语音请求,以便由所述服务器对所述用户语音请求进行文本识别,得到云端识别文本,并由所述服务器对所述云端识别文本进行语义理解,得到第三类对话结果;
    对所述用户语音请求进行文本识别,得到本地识别文本,对所述本地识别文本进行语义理解,得到第一类对话结果;
    在接收到所述服务器发送的云端识别文本的情况下,对所述云端识别文本进行语义理解,得到第二类对话结果;
    在接收到所述服务器发送的第三类对话结果的情况下,得到第三类对话结果。
  7. 根据权利要求1-5中任一项所述的语音交互方法,其中,所述根据所述本地对话结果以及接收到用户语音请求后的等待时长,得到仲裁结果,包括:
    在所述等待时长超过第一时长且小于第二时长,确定当前存在本地对话结果,且所述本地对话结果的本地优先级别为可直接执行级别或超时可执行级别的情况下,将当前的所述本地对话结果作为仲裁结果。
  8. 根据权利要求1-5中任一项所述的语音交互方法,其中,所述根据所述本地对话结果以及接收到用户语音请求后的等待时长,得到仲裁结果,包括:
    在所述等待时长超过第二时长,确定当前存在本地对话结果,且所述本地对话结果的本地优先级别为拒识级别的情况下,得到第一仲裁结果,所述第一仲裁结果无语音播报信息。
  9. 根据权利要求1-5中任一项所述的语音交互方法,其中,所述根据所述本地对话结果以及接收到用户语音请求后的等待时长,得到仲裁结果,包括:
    在所述等待时长超过第二时长,确定当前存在本地对话结果,且所述本地对话结果的本地优先级别为不支持级别或预留字段级别的情况下,得到第二仲裁结果,所述第二仲裁结果包括用于指示网络异常的语音播报信息。
  10. 根据权利要求1-5中任一项所述的语音交互方法,其中,所述根据所述本地对话结果以及接收到用户语音请求后的等待时长,得到仲裁结果,包括:
    在所述等待时长超过第二时长,且确定当前不存在本地对话结果的情况下,得到第三仲裁结果,所述第三仲裁结果包括用于指示网络异常的语音播报信息。
  11. 一种语音交互装置,其中,所述装置包括:
    第一获取模块,用于获取对话结果;
    第一处理模块,用于根据所述对话结果的类型、所述对话结果的本地优先级别和车机所处的交互模式中的至少部分,更新本地对话结果或得到仲裁结果;其中,所述对话结果包括第一类对话结果、第二类对话结果和第三类对话结果,所述第一类对话结果通过本地进行文本识别及语义理解确定,所述第二类对话结果通过云端进行文本识别且本地进行语义理解确定,所述第三类对话结果通过云端进行文本识别及语义理解确定;
    第二处理模块,用于根据所述本地对话结果以及接收到用户语音请求后的等待时长,得到仲裁结果;
    第三处理模块,用于按所述仲裁结果执行语音交互。
  12. 一种车辆,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,其中,所述处理器执行所述程序时实现如权利要求1-10任一项所述的语音交互方法。
  13. 一种非暂态计算机可读存储介质,其上存储有计算机程序,其中,该计算机程序被处理器执行时实现如权利要求1-10任一项所述的语音交互方法。
PCT/CN2023/124567 2022-10-28 2023-10-13 语音交互方法、语音交互装置、车辆和可读存储介质 WO2024088085A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211332359.X 2022-10-28
CN202211332359.XA CN115410579B (zh) 2022-10-28 2022-10-28 语音交互方法、语音交互装置、车辆和可读存储介质

Publications (1)

Publication Number Publication Date
WO2024088085A1 true WO2024088085A1 (zh) 2024-05-02

Family

ID=84167973

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/124567 WO2024088085A1 (zh) 2022-10-28 2023-10-13 语音交互方法、语音交互装置、车辆和可读存储介质

Country Status (2)

Country Link
CN (1) CN115410579B (zh)
WO (1) WO2024088085A1 (zh)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115410579B (zh) * 2022-10-28 2023-03-31 广州小鹏汽车科技有限公司 语音交互方法、语音交互装置、车辆和可读存储介质
CN115862600B (zh) * 2023-01-10 2023-09-12 广州小鹏汽车科技有限公司 语音识别方法、装置及车辆

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103440867A (zh) * 2013-08-02 2013-12-11 安徽科大讯飞信息科技股份有限公司 语音识别方法及系统
CN108305620A (zh) * 2018-05-09 2018-07-20 上海蓥石汽车技术有限公司 一种依赖大数据的本地云端混合的主动式交互语音识别系统
CN109949817A (zh) * 2019-02-19 2019-06-28 一汽-大众汽车有限公司 基于双操作系统双语音识别引擎的语音仲裁方法及装置
CN109961792A (zh) * 2019-03-04 2019-07-02 百度在线网络技术(北京)有限公司 用于识别语音的方法和装置
CN115410579A (zh) * 2022-10-28 2022-11-29 广州小鹏汽车科技有限公司 语音交互方法、语音交互装置、车辆和可读存储介质

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105551494A (zh) * 2015-12-11 2016-05-04 奇瑞汽车股份有限公司 一种基于手机互联的车载语音识别系统及识别方法
CN106328148B (zh) * 2016-08-19 2019-12-31 上汽通用汽车有限公司 基于本地和云端混合识别的自然语音识别方法、装置和系统
CN106371801A (zh) * 2016-09-23 2017-02-01 安徽声讯信息技术有限公司 一种基于语音识别技术的语音鼠标系统
CN106384594A (zh) * 2016-11-04 2017-02-08 湖南海翼电子商务股份有限公司 语音识别的车载终端及其方法
EP3567585A4 (en) * 2017-11-15 2020-04-15 Sony Corporation INFORMATION PROCESSING DEVICE AND INFORMATION PROCESSING METHOD
WO2019198405A1 (ja) * 2018-04-12 2019-10-17 ソニー株式会社 情報処理装置、情報処理システム、および情報処理方法、並びにプログラム
CN112699257A (zh) * 2020-06-04 2021-04-23 华人运通(上海)新能源驱动技术有限公司 作品生成和编辑方法、装置、终端、服务器和系统
CN112562681B (zh) * 2020-12-02 2021-11-19 腾讯科技(深圳)有限公司 语音识别方法和装置、存储介质
CN112509585A (zh) * 2020-12-22 2021-03-16 北京百度网讯科技有限公司 车载设备的语音处理方法、装置、设备及存储介质
CN112992145B (zh) * 2021-05-10 2021-08-06 湖北亿咖通科技有限公司 离线在线语义识别仲裁方法、电子设备及存储介质

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103440867A (zh) * 2013-08-02 2013-12-11 安徽科大讯飞信息科技股份有限公司 语音识别方法及系统
CN108305620A (zh) * 2018-05-09 2018-07-20 上海蓥石汽车技术有限公司 一种依赖大数据的本地云端混合的主动式交互语音识别系统
CN109949817A (zh) * 2019-02-19 2019-06-28 一汽-大众汽车有限公司 基于双操作系统双语音识别引擎的语音仲裁方法及装置
CN109961792A (zh) * 2019-03-04 2019-07-02 百度在线网络技术(北京)有限公司 用于识别语音的方法和装置
CN115410579A (zh) * 2022-10-28 2022-11-29 广州小鹏汽车科技有限公司 语音交互方法、语音交互装置、车辆和可读存储介质

Also Published As

Publication number Publication date
CN115410579B (zh) 2023-03-31
CN115410579A (zh) 2022-11-29

Similar Documents

Publication Publication Date Title
WO2024088085A1 (zh) 语音交互方法、语音交互装置、车辆和可读存储介质
CN106992009B (zh) 车载语音交互方法、系统及计算机可读存储介质
US20220013122A1 (en) Voice assistant tracking and activation
CN109949817B (zh) 基于双操作系统双语音识别引擎的语音仲裁方法及装置
WO2016127550A1 (zh) 人机语音交互方法和装置
CN107943796A (zh) 一种翻译方法和装置、终端、可读存储介质
CN107483324B (zh) 用于车机的即时通信信息管理方法及装置、存储介质、终端
CN110519727B (zh) 基于中央网关cgw的数据处理方法及cgw
US11271877B2 (en) Primary chat bot service and secondary chat bot service integration
CN108924038A (zh) 基于共享文档的群聊发起方法及其装置、设备、存储介质
CN114036390A (zh) 场景服务推荐方法、装置、电子设备以及存储介质
CN112614491A (zh) 一种车载语音交互方法、装置、车辆、可读介质
WO2017166602A1 (zh) 车载终端与移动终端协同输入的控制方法及移动终端
CN113492856B (zh) 巡航跟车停车等待时间控制方法、系统、车辆及存储介质
CN116634531A (zh) 一种休眠唤醒方法、系统及装置
CN107885583B (zh) 操作触发方法及装置
CN114666363B (zh) 信息传输方法、装置、电子设备、存储介质及产品
CN106740115A (zh) 汽车仪表与中控交互系统及方法
CN106547228A (zh) 车库门控制装置及其方法
US10945116B2 (en) Inter-vehicle communication system for broadcasting a message based upon emergency conditions
CN113791843A (zh) 一种执行方法、装置、设备及存储介质
CN112235279A (zh) 用于应用间通信的方法、装置、电子设备及可读存储介质
CN118098229A (zh) 语音控制方法、装置、车辆和存储介质
US20240124015A1 (en) Cooperative Vehicle Infrastructure Information Processing Method and Apparatus, and Terminal Device
CN115509419A (zh) 基于安卓系统的车机应用分屏显示方法、系统、存储介质以及车辆