WO2023124957A1 - 语音交互方法及其装置、服务器和可读存储介质 - Google Patents

语音交互方法及其装置、服务器和可读存储介质 Download PDF

Info

Publication number
WO2023124957A1
WO2023124957A1 PCT/CN2022/138587 CN2022138587W WO2023124957A1 WO 2023124957 A1 WO2023124957 A1 WO 2023124957A1 CN 2022138587 W CN2022138587 W CN 2022138587W WO 2023124957 A1 WO2023124957 A1 WO 2023124957A1
Authority
WO
WIPO (PCT)
Prior art keywords
recognition
type
text
voice interaction
entity
Prior art date
Application number
PCT/CN2022/138587
Other languages
English (en)
French (fr)
Inventor
王亭玉
赵群
宁洪珂
樊骏锋
潘晓彤
赵恒艺
Original Assignee
广州小鹏汽车科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 广州小鹏汽车科技有限公司 filed Critical 广州小鹏汽车科技有限公司
Publication of WO2023124957A1 publication Critical patent/WO2023124957A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B60VEHICLES IN GENERAL
    • B60WCONJOINT CONTROL OF VEHICLE SUB-UNITS OF DIFFERENT TYPE OR DIFFERENT FUNCTION; CONTROL SYSTEMS SPECIALLY ADAPTED FOR HYBRID VEHICLES; ROAD VEHICLE DRIVE CONTROL SYSTEMS FOR PURPOSES NOT RELATED TO THE CONTROL OF A PARTICULAR SUB-UNIT
    • B60W50/00Details of control systems for road vehicle drive control not related to the control of a particular sub-unit, e.g. process diagnostic or vehicle driver interfaces
    • B60W50/08Interaction between the driver and the control system
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B60VEHICLES IN GENERAL
    • B60WCONJOINT CONTROL OF VEHICLE SUB-UNITS OF DIFFERENT TYPE OR DIFFERENT FUNCTION; CONTROL SYSTEMS SPECIALLY ADAPTED FOR HYBRID VEHICLES; ROAD VEHICLE DRIVE CONTROL SYSTEMS FOR PURPOSES NOT RELATED TO THE CONTROL OF A PARTICULAR SUB-UNIT
    • B60W2540/00Input parameters relating to occupants
    • B60W2540/21Voice

Definitions

  • the present application relates to the field of voice technology, in particular to a voice interaction method and its device, server and readable storage medium.
  • the in-vehicle voice interaction system cannot accurately identify the user's intention when controlling the vehicle, that is, the system cannot correctly identify the user's real needs, and needs to give a tts (text to speech, from text to voice) reply to guide the user to the second step.
  • a tts text to speech, from text to voice
  • the tts broadcast is lengthy, the user needs multiple rounds of interaction, and the user experience deteriorates. For example, when you are already on a certain related page on the car screen, you say "Volume is greatly increased", not because you want the default "media volume” to be turned up, it may be another volume.
  • the present application provides a voice interaction method and its device, server and readable storage medium, which can accurately identify the real intention of the user and improve user experience.
  • the present application provides a voice interaction method.
  • the voice interaction method includes: performing voice recognition on the voice request for vehicle preset function adjustment to obtain a preliminary recognition text, the preset function refers to the function of simulating the scale adjustment of the operation of the vehicle parts; determining according to the preliminary recognition text Corresponding entities of the first type; perform screen element query according to the entities of the first type to obtain entities of the second type, one entity of the first type corresponds to entities of the second type; combine the entities of the second type with the entities of the second type Combining the above-mentioned preliminary recognition texts to generate the text to be recognized; using the intent recognition model to perform intent recognition on the text to be recognized, and perform voice interaction according to the result of the intent recognition.
  • the voice interaction method of the present application creates a mapping relationship between multiple second-type entities and first-type entities, and combines the real second-type entities retrieved from screen elements to generate the second-type entities and the preliminary recognition text
  • the text to be recognized can accurately identify the user's true intention.
  • the determining the corresponding first-type entity according to the preliminary recognition text includes: performing redundancies extraction on the preliminary recognition text to obtain preset text words; determining the preliminary recognition text according to the preset text words First class entities.
  • the preset text words can be obtained based on the overlapping word extraction of the user's preliminary recognition text, and then the first type of entity can be determined according to the preset text words, and the range of the first type of entity can be quickly located according to the extraction of the preset text words. It lays the foundation for the subsequent determination of the second type of entity based on the first type of entity.
  • the performing overlapping word extraction on the preliminary recognition text to obtain preset text words includes: performing overlapping word extraction on the preliminary recognition text through character string matching or regular search to obtain preset text words.
  • the voice interaction method includes: establishing a first mapping relationship table between changeable verbs and the first type of entities, one changeable verb corresponds to a plurality of the first type of entities.
  • the voice interaction method of the present application establishes the first mapping relationship table between variable verbs and the first type of entity, and can determine the first type of entity according to the first mapping relationship table, laying a foundation for accurately identifying user intentions.
  • the determining the first-type entity of the preliminary recognition text according to the preset text word includes: performing normalization processing on the preset text word to determine the changeable verb corresponding to the preset text word ; Determine the first type of entity according to the changeable verb and the first mapping relationship table.
  • the changeable verbs corresponding to the preset text words can be accurately determined, so that the first type of entities can be accurately determined according to the changeable verbs and the first mapping relationship table.
  • the voice interaction method includes: establishing a second mapping relationship table between the first type of entity and the second type of entity.
  • the performing screen element query according to the first type of entity to obtain the second type of entity includes: when the current page of the screen is a non-expandable page, according to the non-expandable page, the first type of entity and the The second mapping relationship table determines the second type of entity.
  • the voice interaction method of this application first judges that the current page of the screen is a non-expandable page. Since the non-expandable page does not contain a pop-up control interface, it can be directly determined according to the non-expandable page, the first type of entity, and the second mapping relationship table. At this time, the second type of entity is first judged as an unexpandable page, which improves the efficiency of determining the second type of entity.
  • the non-expandable interface refers to an interface that does not contain a pop-up control interface, but may contain draggable elements, such as a system volume adjustment interface.
  • the obtaining the second type of entity by performing screen element query according to the first type of entity includes: obtaining the main page name and control name of the expandable page when the current page is an expandable page; The main page name, the control name and the second mapping relationship table determine the second type of entity.
  • the voice interaction method of the present application can obtain the main page name and control name of the expandable page when the current page is an expandable page, and determine the second type of entity according to the main page name, control name and the second mapping relationship table , that is, the voice interaction method of the present application, only reads the possible name of the control and the name of the current page, and does not create a script command of an executable node for the control, and the operation is simple.
  • the expandable page refers to an interface containing pop-up controls, such as a system setting interface, which contains a volume adjustment control and can be opened.
  • the voice interaction method includes: obtaining the intention recognition model by training intention training data, and the intention training data is related to vehicle components and adjustable ranges of the vehicle components.
  • the voice interaction method of the present application can obtain an intention recognition model through training the intention training data, and perform intention recognition according to the intention recognition model, so as to accurately recognize the intention of the user instruction.
  • the step of using the intent recognition model to perform intent recognition on the text to be recognized, and performing voice interaction according to the result of the intent recognition includes: acquiring the intent discrimination probability of each preset intent corresponding to the intent recognition result; One of the preset intentions whose discrimination probability is greater than the first probability threshold is determined as the target intention corresponding to the voice request.
  • the voice interaction method of the present application can obtain the intention discrimination probability corresponding to each preset intention from the result of intention recognition, and determine a preset intention whose intention discrimination probability is greater than the first probability threshold as the target intention corresponding to the voice request, thereby realizing recognition Users need to accurately adjust the scale of vehicle parts.
  • the preset intentions include: volume up, volume down, air volume up, air volume down, temperature up, temperature down, map zoom in, map zoom out, screen brighter, screen darker, screen slide up, screen Down, Gauges brighter, Gauges dimmed, Ambient lights up, Ambient lights dimmed, Seat forward, Seat back, Seat up, Seat down, Seat back forward, Seat back back, Car at least one of window up and window down.
  • the performing intention recognition on the text to be recognized by using the intention recognition model, and performing voice interaction according to the result of the intention recognition includes: performing precision recognition on the text to be recognized by using the accuracy recognition model, and performing the precision recognition on the text to be recognized according to the result of the intention recognition Perform voice interaction with the result of the accuracy recognition.
  • the precision recognition of the text to be recognized is performed according to the precision recognition model, and the voice interaction is performed according to the result of the intention recognition and the result of the precision recognition, so as to realize precise voice interaction according to the precise intention and precision of the user's voice request.
  • the voice interaction method includes: obtaining the precision recognition model through training with precision training data, the precision training data being related to the vehicle parts, the adjustable range of the vehicle parts and the scale adjustment precision range of the vehicle parts .
  • the scale adjustment precision corresponding to the voice request can be determined.
  • the step of performing precision recognition on the text to be recognized by using the precision recognition model, and performing voice interaction according to the result of the intention recognition includes: obtaining the precision discrimination probability of the accuracy recognition result corresponding to a plurality of preset scale adjustment precision values; A preset scale adjustment accuracy value whose accuracy discrimination probability is greater than a second probability threshold is determined as a target scale adjustment accuracy value corresponding to the voice request.
  • the voice interaction method of the present application can obtain the precision discrimination probability corresponding to each preset scale adjustment precision value of the precision recognition result, and determine the preset scale adjustment precision value whose precision discrimination probability is greater than the second probability threshold value as the target scale adjustment precision value, This allows for precise scale adjustments.
  • the performing voice interaction according to the result of the intention recognition and the result of the precision recognition includes: fusing and generating a control instruction according to the target intention and the target scale adjustment precision value, so as to control corresponding vehicle components.
  • the present application fuses and generates control instructions according to the target intention and the target scale adjustment accuracy value to control the corresponding vehicle components, and can realize the precise scale adjustment of the user's voice request.
  • the present application also provides a voice interaction device.
  • the voice interaction device includes: a voice recognition module, a determination module, a query module, a combination module and a voice interaction module.
  • the speech recognition module is used to perform speech recognition on the voice request for vehicle preset function adjustment to obtain a preliminary recognition text.
  • the preset function refers to the function of simulating the scale adjustment of the operation of the vehicle parts;
  • the preliminary recognition text determines the corresponding first-type entity;
  • the query module is used to perform screen element query according to the first-type entity to obtain a second-type entity, and one first-type entity corresponds to multiple second-type entities.
  • the combination module is used to combine the second class entity and the preliminary recognition text to generate the text to be recognized;
  • the voice interaction module is used to use the intent recognition model to perform intent recognition on the text to be recognized, according to Voice interaction is performed on the result of the intention recognition.
  • the voice interaction device of the present application creates a mapping relationship between multiple second-type entities and first-type entities, and combines the real second-type entities retrieved from screen elements to generate the second-type entities and the preliminary recognition text
  • the text to be recognized can accurately identify the user's true intention.
  • the application also provides a server.
  • the server includes a processor and a memory, and a computer program is stored in the memory.
  • the computer program is executed by the processor, the voice interaction method described in any one of the above-mentioned implementation manners is implemented.
  • the server of the present application establishes the mapping relationship between multiple second-type entities and first-type entities, and combines the real second-type entities retrieved from screen elements to combine the second-type entities and the preliminary recognition text to generate the to-be-recognized text, which can accurately identify the user's true intent.
  • the present application also provides a non-volatile computer-readable storage medium containing the computer program.
  • the computer program is executed by one or more processors, the voice interaction method described in any one of the above implementation manners is realized.
  • the computer-readable storage medium of the present application establishes the mapping relationship between multiple second-type entities and first-type entities, and combines the real second-type entities retrieved by screen elements to combine the second-type entities and the preliminary recognition text The combination generates the text to be recognized, which can accurately identify the user's true intention.
  • Fig. 1 is one of the schematic flow charts of the voice interaction method of the present application
  • FIG. 2 is one of the structural schematic diagrams of the voice interaction device of the present application.
  • FIG. 3 is the second schematic flow diagram of the voice interaction method of the present application.
  • Fig. 4 is the second structural diagram of the voice interaction device of the present application.
  • Fig. 5 is the third schematic flow diagram of the voice interaction method of the present application.
  • Fig. 6 is the third structural diagram of the voice interaction device of the present application.
  • FIG. 7 is the fourth schematic flow diagram of the voice interaction method of the present application.
  • FIG. 8 is a schematic structural diagram of a first determination unit in the voice interaction device of the present application.
  • FIG. 9 is the fifth schematic flow diagram of the voice interaction method of the present application.
  • FIG. 10 is the fourth structural schematic diagram of the voice interaction device of the present application.
  • FIG. 11 is the sixth schematic flow diagram of the voice interaction method of the present application.
  • FIG. 12 is the seventh schematic flow diagram of the voice interaction method of the present application.
  • FIG. 13 is the eighth schematic flow diagram of the voice interaction method of the present application.
  • Fig. 14 is the fifth structural diagram of the voice interaction device of the present application.
  • FIG. 15 is the ninth schematic flow diagram of the voice interaction method of the present application.
  • Fig. 16 is one of the structural schematic diagrams of the voice interaction module in the voice interaction device of the present application.
  • FIG. 17 is the tenth schematic flow diagram of the voice interaction method of the present application.
  • Fig. 18 is the second structural diagram of the voice interaction module in the voice interaction device of the present application.
  • FIG. 19 is the eleventh schematic flow diagram of the voice interaction method of the present application.
  • FIG. 20 is the twelveth schematic flow diagram of the voice interaction method of the present application.
  • Fig. 21 is a schematic structural diagram of the accuracy identification unit in the voice interaction module of the present application.
  • Fig. 22 is the thirteenth schematic flow diagram of the voice interaction method of the present application.
  • Fig. 23 is the third structural diagram of the voice interaction module in the voice interaction device of the present application.
  • Fig. 24 is a schematic structural diagram of the server of the present application.
  • Fig. 25 is a schematic structural diagram of a computer-readable storage medium of the present application.
  • first, second, third and so on may be used in this application to describe various information, such information should not be limited to these terms. These terms are only used to distinguish information of the same type from one another.
  • first information may also be called second information, and similarly, second information may also be called first information.
  • second information may also be called first information.
  • a feature defined as “first” and “second” may explicitly or implicitly include one or more of these features.
  • “plurality” means two or more, unless otherwise specifically defined.
  • the voice interaction method includes:
  • the preset function refers to the function of simulating the scale adjustment of the operation of vehicle parts
  • the present application also provides a voice interaction device 10 .
  • the voice interaction device 10 includes: a voice recognition module 11 , a first-type entity determination module 12 , a second-type entity determination module 13 , a combination module 14 , and a voice interaction module 15 .
  • Step 01 can be realized by the speech recognition module 11, step 02 can be realized by the first type entity determination module 12, step 03 can be realized by the second type entity determination module 13, step 04 can be realized by the combination module 14, and step 05 can be realized by the voice Interaction module 15 is realized.
  • the voice recognition module 11 is used to perform voice recognition on the voice request for adjusting the preset function of the vehicle to obtain a preliminary recognition text.
  • the preset function refers to the function of simulating the scale adjustment of the operation of the vehicle parts;
  • Module 12 is used to determine the corresponding first-type entity according to the preliminary recognition text;
  • the second-type entity determination module 13 is used to perform screen element query according to the first-type entity to obtain the second-type entity, and one first-type entity corresponds to multiple second-type entities.
  • the combination module 14 is used to combine the second class entity and the preliminary recognition text to generate the text to be recognized;
  • the voice interaction module 15 is used to use the intent recognition model to perform intent recognition on the text to be recognized, and perform voice interaction according to the result of the intent recognition.
  • the voice request for vehicle preset function adjustment can be, for example, "the screen is bright”, “the volume is louder”, “the screen is brighter”, “the air volume of the air conditioner is louder”, “rear behind the seat”, that is, Voice requests with shortened words.
  • the preset function refers to the function of simulating the scale adjustment of the operation of the vehicle parts, wherein the vehicle parts may refer to physical parts such as mechanical knobs or buttons, which are vehicle parts that can adjust the scale.
  • a preliminary recognition text that is, for example, perform voice recognition on the voice request "bright screen bright" input by the user and adjust the preset function of the vehicle, and the preliminary recognition text obtained is For "screen bright bright”.
  • the text to be recognized after speech recognition may not be clear and accurate due to the limitation of the vehicle hardware, or because of network instability, colloquial or dialect-based user expressions, etc.
  • Some routine text error corrections are processed, such as near-synonym or synonym correction for some expressions, and removal of some meaningless words. For example, “the volume is deep and deep” is corrected to “the volume increases and increases”, and some meaningless words are removed, such as "ah” and "please”.
  • the corresponding first type of entity refers to the general term of various regulatory nouns included in the historical record of the user's voice request, such as "volume” and "screen”. Understandably, “volume” includes volume adjustments such as “navigation volume”, “system volume”, “media volume”, and “little P sound”. “Screen” includes the adjustment of "big screen”, “instrument”, “center console” and other screens. That is, entities with specific entity names are second-class entities.
  • the screen element query is performed according to the first type of entity to obtain the second type of entity, and one first type of entity corresponds to multiple second type of entities.
  • the first type of entity is "volume”
  • the corresponding multiple second type of entities include "navigation volume”, “system volume”, “media volume”, “little P sound” and so on.
  • Screen element query refers to establishing corresponding query elements for real-time vehicle status, for example, the identification of air volume, volume, etc. on the current screen page.
  • the voice interaction method of the present application first inquires whether it is on the adjustment page of the second type of entity at this time. If it is on the adjustment page of the specific entity, in one example, the current interface is a navigation interface, that is, the second type of entity is the navigation volume, then the volume In other words, directly inherit the page entity.
  • the user instruction can combine the "navigation volume” and the preliminary recognition text "the volume is very large” into the text to be recognized “navigation volume is very large”, and perform the correct intention recognition process according to the text to be recognized "navigation volume is very large” , perform voice interaction according to the corresponding intent recognition result.
  • Screen element query also includes retrieving current page controls and their corresponding names. For example, in the page of system settings, locate the name of the current main page "system settings”, detect the name of the control as “volume”, and assemble “system volume” according to the name of the main page and the name of the specific control. At this time, the user instruction can combine "system volume” and the preliminary recognition text "volume is very large” into the text to be recognized “system volume is very large”, and perform the correct intention recognition process based on the text to be recognized "system volume is very large” , perform voice interaction according to the corresponding intent recognition result.
  • step 02 includes:
  • the first-type entity determination module 12 includes a reduplicate word extraction unit 121 and a first determination unit 122 .
  • Step 021 can be implemented by the redundancies extraction unit 121
  • step 022 can be implemented by the first determining unit 122 . That is to say, the reduplicate word extraction unit 121 is used to perform redundant word extraction on the preliminary recognition text by means of character string matching or regular search to obtain preset text words; the first determination unit 122 is used to determine preliminary text words according to the preset text words. Recognizes first-class entities of text.
  • reduplicate word extraction is performed on the user's preliminary recognition text, which can be extracted by string matching or regular search.
  • the preset text words refer to reduplication words in the preliminary recognition text.
  • voice interaction methods include:
  • the voice interaction device 10 includes a first relationship table establishing module 101 .
  • Step 001 can be implemented by the relationship table building module 101 . That is, the relationship table building module 101 is used to create a first mapping relationship table between variable verbs and first-type entities, and one variable verb corresponds to multiple first-type entities.
  • Variable verbs are verbs that can change in size, height, front and back, etc.
  • One changeable verb corresponds to multiple first-type entities.
  • the changeable verb is "big”, and the corresponding first-type entity is "volume” or “wind volume”.
  • the variable verb is "front”, and the corresponding entity of the first category is “chair back”.
  • the variable verb is "high”, and the corresponding first-class entity is "temperature” or "seat”.
  • variable verbs include one or more, which is not limited here.
  • the first mapping relationship table between variable verbs and the first type of entities as shown below
  • the mapping relationship between entities can be:
  • step 022 comprises:
  • 0222 Determine the first type of entity according to the variable verb and the first mapping relationship table.
  • the first determining unit 122 includes a first determining subunit 1221 and a second determining subunit 1222 .
  • Step 0221 can be implemented by the first determining subunit 1221
  • step 0222 can be implemented by the second determining subunit 1222 . That is to say, the first determining subunit 1221 is used to normalize the preset text words to determine the variable verbs corresponding to the preset text words; the second determining subunit 1222 is used to Relational tables identify first-class entities.
  • the extracted preset text words are normalized and retrieved according to the mapping relationship table between variable verbs and entities.
  • the word “big” obtained after the normalization processing of the preset text word “big” is "big”
  • the normalized word “big” can be obtained according to the mapping relationship table between variable verbs and entities, and the corresponding The first type of entity is "Volume” or "Air Volume”.
  • voice interaction methods include:
  • the voice interaction device 10 includes a second relationship table establishing module 102 .
  • Step 002 can be implemented by the second relationship table building module 102 . That is, the second relationship table establishing module 102 is used to establish a second mapping relationship table between the first type of entity and the second type of entity.
  • the second mapping relationship table between the first type of entity and the second type of entity for example, the second type of entity corresponding to the first type of entity "volume” in the second mapping relationship table is "navigation volume”, “system volume”, “media Volume”, “Little P Sound”, etc.
  • the second type of entities corresponding to the first type of entity "screen” in the second mapping relationship table are "big screen", “instrument”, “center console” and so on.
  • the second type of entity can be set up or down according to the hardware the vehicle has.
  • step 03 includes:
  • step 031 can be realized by the second type of entity determination module 13, and the second type of entity determination module 13 is used to, in the case that the current page of the screen is a non-expandable page, according to the non-expandable page, the first type of entity and The second mapping relationship table determines the second type of entity.
  • a non-expandable interface refers to an interface that does not contain a pop-up control interface, but can contain draggable elements, such as the system volume adjustment interface.
  • the second type of entity can be directly determined at this time according to the non-expandable page, the first type of entity, and the second mapping relationship table, and the judgment is prioritized
  • the current page is a non-expandable page, which improves the efficiency of determining the second type of entity.
  • the voice interaction method of the present application firstly judges whether the current page of the screen is a non-expandable page, and determines the second type of entity when the current page of the screen is a non-expandable page, and does not need to first recognize the user's voice in the expandable page Whether the control in the control is on the screen, and then open the control, and perform a voice request operation on it, that is, the voice interaction method of the present application is simpler.
  • step 03 includes:
  • 033 Determine the second type of entity according to the main page name, the control name, and the second mapping relationship table.
  • step 032 and step 033 can be realized by the second type of entity determination module 13, and the second type of entity determination module 13 is used to obtain the main page name and name of the expandable page when the current page is an expandable page.
  • Control name determine the second type of entity according to the main page name, the control name and the second mapping relationship table.
  • An expandable page refers to an interface containing pop-up controls, such as the system setting interface, which contains volume adjustment controls and can be opened.
  • This application only reads the possible name of the control on the current page and the name of the main page, and does not need to create a script command for the executable node of the control, because the command issued by the voice request of this application is through "control name” + "redundant word” Implemented by assembly script.
  • the movement of the seat in various directions can be adjusted by vehicle components.
  • the car door does not have vehicle components such as knobs and buttons to achieve scale adjustment, but is usually only opened and closed through the door handle. Therefore, the seat adjustment belongs to the control range of vehicle components, while the door adjustment belongs to the non-control range of vehicle components.
  • determining the vehicle parts that can be scaled on the vehicle such as: “volume knob”, “screen brightness button”, “air conditioning air volume knob/button”, “seat adjustment knob/button”, etc.
  • determining the control range of vehicle components may include: car audio, screens in the vehicle, vehicle air conditioners, vehicle seats, ambient lights in the car, lights outside the vehicle, or windows, etc.
  • the non-control range of vehicle components can include: doors, rearview mirrors, trunks, etc.
  • voice prompts can be performed when the voice request is directed to the non-control range of the vehicle components.
  • the control range of vehicle components can be determined, that is, the control range that can be scaled through voice interaction.
  • the adjustable range of a vehicle component corresponds to the scale range adjusted by operating the vehicle component.
  • the adjustable range can be gear position or range. For example, if the screen brightness button is continuously pressed 5 times in total, and the screen brightness is sequentially adjusted from 1 to 5 gears to the maximum brightness, then the adjustable range of the screen brightness button is 1 to 5 gears. As another example, if the total scale value of the knob for adjusting the seat forward and backward is 90, then the adjustable range of the seat adjustment knob is scale value 1-90.
  • control range of vehicle components and the adjustable range of each vehicle component are mapped to the intent system that the intent recognition model can understand.
  • a corresponding preset intention is formulated for the objects in the control range of the vehicle component and the corresponding adjustable range of the vehicle component.
  • system volume up represents the default intent "volume up”
  • system volume down represents the default intent "volume down”. Therefore, a specific intent mapping system is formulated for the control range of parts and the adjustable range of vehicle parts.
  • the preset scale adjustment accuracy for example, when the voice interaction simulates the operation of vehicle parts, the volume is adjusted by 3 scale values at a time, and the total scale value is 60, then the preset scale adjustment accuracy range can be 1-20.
  • the voice interaction simulates the operation of vehicle parts the seat is adjusted 18 scales each time, the total scale value is 90, and the preset scale adjustment accuracy ranges from 1 to 5.
  • voice interaction methods include:
  • the intention recognition model is obtained by training the intention training data.
  • the intention training data is related to the vehicle parts and the adjustable range of the vehicle parts.
  • the speech interaction device 10 includes an intention recognition model acquisition module 103 .
  • Step 003 can be implemented by the intent recognition model acquisition module 103 . That is, the intention recognition model acquisition module 103 is used to obtain the intention recognition model by training the intention training data, and the intention training data is related to the vehicle components and the adjustable range of the vehicle components.
  • the intent recognition model is obtained by training the training data corresponding to the scale-adjustable vehicle parts and the adjustable range of the vehicle parts through machine learning, and then performs intent recognition on the rewritten voice request of the current wheel to realize Accurate identification of user intent.
  • model training can use models such as BERT, ALBERT, XLNet, and RoBERTa.
  • the intention training data is related to the vehicle components and the adjustable range of the components that can be scaled.
  • Vehicle parts refer to the parts that can be adjusted on the smart car, such as: “volume knob”, “screen brightness button”, “air conditioning air volume knob/button”, “seat adjustment knob/button” and so on.
  • the adjustable range of a vehicle component corresponds to the scale range adjusted by operating the vehicle component.
  • the adjustable range can be gear position or range.
  • Intention training data can collect a certain amount of historical records of user voice requests under the condition of obtaining relevant user permissions, and simply filter the collected user voice requests to obtain voice requests with clear semantics and specific purposes, specifically: in During the screening, voice requests with obvious semantic ambiguities and some short voice requests containing only modal particles, such as "ah” and "oh", were removed, leaving voice requests with clear semantics and specific purposes.
  • the voice request is "brighten the screen”
  • the corresponding intent can be marked as “brighten the screen”
  • perform quality inspection on the marked data again to filter out the labeled data that does not meet the preset intent, leaving the labeled data that can be used for training the intent model.
  • the voice request is "open the car door”
  • the corresponding intention of the label is "open the car door”
  • the parts that can be adjusted by the scale are not used to adjust the car door.
  • the voice request can be removed by filtering.
  • the labeled data that can be used for training the intent model is used as the intent training data and divided into an intent training set and an intent data set.
  • the division ratio can be set according to requirements, and is not limited here.
  • the intention training set is 80%
  • the intention verification set is 20%.
  • Model training can use models such as BERT, ALBERT, XLNet, and RoBERTa.
  • the established intent recognition model at least part of the data in the intent training set is used to train the intent recognition model, and then at least part of the data in the intent verification set is used to verify the accuracy of the trained intent recognition model.
  • the accuracy of intent verification does not reach the threshold of intent accuracy
  • the accuracy of the model is verified by intent. Repeat the process of training and intent verification in this way until the accuracy of intent verification reaches the threshold of intent accuracy, it can be considered that the intent recognition model has reached the standard, and the training of the intent recognition model is completed.
  • each data in the intent training set and intent verification set is only used once. If the intent recognition model fails to reach the training standard after traversing all the data in the intent training set and intent verification set, it can be used again with the user's permission. Collect more voice requests in the case of a situation, so as to screen and label more intent training data to train the intent recognition model, so as to ensure that the intent recognition model can accurately recognize the intent corresponding to the input voice request.
  • the above intent recognition model can be trained offline, and after the offline trained intent recognition model is deployed to the server or vehicle, the server or vehicle can use the intent recognition model to perform intent recognition on the received voice request.
  • step 05 includes:
  • the voice interaction module 15 includes a first acquisition unit 151 and an intention determination unit 152.
  • Step 051 can be implemented by the first acquiring unit 151
  • step 052 can be implemented by the intention determining unit 152 . That is, the first obtaining unit 151 is used to obtain the intention identification probability corresponding to each preset intention from the result of the intention identification; the intention determination unit 152 is used to determine a preset intention whose intention identification probability is greater than the first probability threshold as a voice request corresponding target intent.
  • the result of intent recognition includes the probability that the text to be recognized matches each preset intent, that is, multiple intent discrimination probabilities can be obtained. If the first probability threshold is 0.9, the result of the intent recognition is that the intention discrimination probability of a certain type of preset intent exceeds 0.9, and the server considers that the current user's voice request is the corresponding type of preset intent as the target intent.
  • the first probability threshold may also be other values.
  • the first probability threshold may be a default value, or may be set according to user needs, and no limitation is set here.
  • the preset intentions of this application may include: volume up, volume down, air volume up, air volume down, temperature up, temperature down, map zoom in, map zoom out, screen brighter, screen darker, screen slide up , screen slides down, gauge brightens, gauge dims, ambient light brightens, ambient light dims, seat forward, seat rearward, seat up, seat down, seat back forward, seat back rearward , at least one of window up and window down.
  • Step 05 also includes:
  • Step 053 can be realized by the intention determination unit 152, that is, the intention determination unit 152 is used to determine that the intention of the voice request is a non-scale adjustment intention when the intention discrimination probability of each preset intention is not greater than the first probability threshold .
  • the discriminant probabilities corresponding to the preset intentions of multiple categories are not greater than the first probability threshold, that is, the probability that the user’s intention recognition result according to the voice request matches the preset intentions of multiple categories is relatively low, lower than
  • the first probability threshold for example, the first probability threshold is 0.9
  • the non-scale adjustment intention refers to the user who does not use the vehicle parts that can be scaled to adjust the preset function of the vehicle Intent, for example, the voice request input by the user is "open the door", because the door cannot be adjusted by the vehicle parts with scales, therefore, the voice request "open the door” is a non-scale adjustment intent.
  • step 05 includes:
  • the voice interaction module 15 includes an accuracy identification unit 153 .
  • Step 054 can be implemented by the precision identification unit 153 . That is, the precision recognition unit 153 is configured to use the precision recognition model to perform precision recognition on the text to be recognized, and perform voice interaction according to the result of the intention recognition and the result of the precision recognition.
  • the application uses machine learning to obtain an accuracy recognition model from the training data corresponding to the vehicle parts that can be scaled, the adjustable range of the vehicle parts, and the scale adjustment accuracy range of the parts, and then voice requests for accuracy Identification, to achieve accurate identification of user scale adjustment accuracy.
  • voice requests for accuracy Identification to achieve accurate identification of user scale adjustment accuracy.
  • step 054 comprises:
  • Accuracy recognition model is obtained through training of precision training data, and the precision training data is related to the vehicle parts, the adjustable range of the vehicle parts and the scale adjustment precision range of the vehicle parts.
  • step 0541 can be implemented by the accuracy identification unit 153 . That is, the accuracy identification unit 153 is used to obtain an accuracy identification model by training the accuracy training data, and the accuracy training data is related to the vehicle parts, the adjustable range of the vehicle parts and the scale adjustment accuracy range of the vehicle parts.
  • the precision recognition model can be pre-trained through the precision training data to perform precision recognition on the text to be recognized, thereby identifying the adjustment precision of a certain vehicle component, obtaining the precision recognition result, and finally determining the target scale adjustment precision value.
  • the accuracy training data is related to the vehicle parts that can be adjusted by the scale of the vehicle parts and the adjustable range of the parts, which means that the accuracy training data includes all the vehicle parts that can be adjusted by the scale in the vehicle, such as "volume knob ", "Screen Brightness Button”, “Air Conditioner Air Volume Knob/Button”, “Seat Adjustment Knob/Button”, etc.
  • the adjustable range of a vehicle component corresponds to the scale range adjusted by operating the vehicle component.
  • the adjustable range can be gear position or range
  • the scale adjustment accuracy range can be the scale value of each adjustment.
  • the precision training data can collect a certain amount of historical records of user voice requests under the condition of obtaining relevant user permissions, and simply filter the collected user voice requests to obtain voice requests with clear semantics and specific purposes. Specifically, : Remove obviously semantically unclear voice requests and some short voice requests that only contain modal particles, such as "ah” and "oh", in the screening, leaving voice requests with clear semantics and specific purposes.
  • the history of user voice requests acquired during precision training can be the same as the history of user voice requests acquired during intention training, and the step of filtering the collected user voice requests during precision training can be compared with that of intention training. The steps for screening the collected user voice requests are the same.
  • the scale adjustment accuracy value of the corresponding label to adjust the brightness of the screen in the vehicle is 3.
  • an accuracy recognition model is established based on slot extraction. Algorithms that can be used for slot extraction include RNN slot filling, CRF, etc., and the marked data is used as accuracy training data and divided to obtain an accuracy training set and an accuracy data set.
  • the division ratio It can be set according to requirements, and is not limited here. For example, the accuracy training set is 80%, and the accuracy verification set is 20%. Use the data in the precision training set to train the precision recognition model.
  • the accuracy recognition model For the established precision recognition model, at least part of the data in the precision training set is used to train the precision recognition model, and then at least part of the data in the precision verification set is used to verify the accuracy of the trained precision recognition model.
  • the accuracy recognition model is trained again through at least another part of the data of the accuracy training set, and the accuracy recognition after retraining is performed again using another part of the data of the accuracy verification set The accuracy of the model is verified for accuracy.
  • the process of training and accuracy verification is repeated in this way until the accuracy of accuracy verification reaches the threshold of accuracy accuracy, it can be considered that the accuracy identification model has reached the standard, and the training of the accuracy identification model is completed.
  • each data in the accuracy training set and accuracy verification set is only used once.
  • the accuracy recognition model traverses all the data in the accuracy training set and accuracy verification set and fails to meet the training standards, it can be used again with the user's permission. Collect more voice information under the circumstances, so as to filter and label more precision training data to train the precision recognition model, so as to ensure that the precision recognition model can accurately recognize the scale adjustment precision corresponding to the input voice request.
  • the precision recognition model can be pre-trained through the precision training data to perform precision recognition on the text to be recognized, thereby identifying the adjustment precision of a certain vehicle component, obtaining the precision recognition result, and finally determining the target scale adjustment precision value.
  • step 054 comprises:
  • the accuracy identification unit 153 includes a second acquisition unit 1532 and an accuracy determination unit 1533 to implement. That is to say, the second acquisition unit 1532 is used to obtain the accuracy identification probability corresponding to a plurality of preset scale adjustment accuracy values of the accuracy identification result; The adjustment accuracy value is determined as the target scale adjustment accuracy value corresponding to the voice request.
  • the accuracy discrimination probability refers to the probability that the accuracy of recognizing the voice request matches the adjustment accuracy value of each preset scale.
  • the second probability threshold may be, for example, 0.7, 0.8, 0.9 or other numerical values, which are not limited here.
  • the accuracy discrimination probability is 1 and the second probability threshold is 0.9, that is, the accuracy discrimination probability is 1 and exceeds the second probability threshold 0.9, then it is determined that the target scale adjustment accuracy value for volume adjustment corresponding to the voice request "Volume is louder is louder" is 5.
  • Step 054 also includes:
  • Step 0544 can be implemented by the accuracy determination unit 1533 . That is to say, the precision determination unit 1533 is configured to determine that the precision recognition of the voice request is wrong when the precision discrimination probabilities of each preset scale adjustment precision value are not greater than the second probability threshold.
  • step 05 includes:
  • the voice interaction module 15 includes an instruction generation unit 154 .
  • Step 055 can be implemented by the instruction generation unit 154 . That is to say, the instruction generation unit 154 is used to fuse and generate control instructions according to the target intention and the target scale adjustment accuracy value, so as to control corresponding vehicle components.
  • the control command is fused to generate control instructions to obtain the control information combining intent and accuracy, so that the corresponding vehicle parts can be accurately controlled according to the user's voice interaction command with simplified words, so that Realize the real intention of the user.
  • the present application also provides a server 20 .
  • the server 20 includes a processor 21 and a memory 22.
  • a computer program 221 is stored on the memory 22.
  • the computer program 221 is executed by the processor 21, the voice interaction method described in any one of the above-mentioned embodiments is realized.
  • the server 20 may be installed inside the vehicle, or connected to the vehicle, which is not limited here.
  • the server 20 of the present application establishes the mapping relationship between multiple second-type entities and first-type entities, and combines the real second-type entities retrieved by screen elements to combine the second-type entities and the preliminary recognition text to generate the text to be recognized , which can accurately identify the real intention of the user.
  • the present application also provides a non-volatile computer-readable storage medium 30 containing a computer program.
  • the computer program 31 is executed by one or more processors 40, the voice interaction method of any of the above embodiments is realized.
  • the preset function refers to the function of simulating the scale adjustment of the operation of vehicle parts
  • the computer program 31 includes computer program codes.
  • the computer program code may be in source code form, object code form, executable file or some intermediate form, etc.
  • the computer-readable storage medium may include: any entity or device capable of carrying computer program code, recording medium, U disk, removable hard disk, magnetic disk, optical disk, computer memory, read-only memory (ROM, Read-Only Memory), random memory Access memory (RAM, Random Access Memory), and software distribution media, etc.
  • the second-type entities and the preliminary recognition text are combined to generate
  • the text to be recognized can accurately identify the user's true intention.

Landscapes

  • Engineering & Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Automation & Control Theory (AREA)
  • Transportation (AREA)
  • Mechanical Engineering (AREA)
  • Artificial Intelligence (AREA)
  • User Interface Of Digital Computer (AREA)
  • Telephonic Communication Services (AREA)

Abstract

一种语音交互方法及其装置、服务器和可读存储介质,该方法包括:对车辆预设功能调节的语音请求进行语音识别得到初步识别文本,预设功能指模拟对车辆零部件的操作进行刻度调节的功能(01);根据初步识别文本确定对应的第一类实体(02);根据第一类实体进行屏幕元素查询得到第二类实体,一个第一类实体对应多个第二类实体(03);将第二类实体和初步识别文本组合生成待识别文本(04);利用意图识别模型对待识别文本进行意图识别,根据意图识别的结果进行语音交互(05)。该方法将第二类实体和初步识别文本组合生成待识别文本,可以准确识别出用户真正意图。

Description

语音交互方法及其装置、服务器和可读存储介质
本申请要求于2021年12月28日提交国家知识产权局、申请号为202111617605.1、申请名称为“语音交互方法及其装置、服务器和可读存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及语音技术领域,特别涉及一种语音交互方法及其装置、服务器和可读存储介质。
背景技术
目前车载语音交互系统对用户在进行车辆控制时的意图不能进行精准识别,即系统不能正确识别出用户的真实需求,需要给出tts(text to speech,从文本到语音)回复引导用户进行第二轮对话。tts播报冗长,用户需要多轮交互,用户体验变差。例如,在车载屏幕中已经在某相关页面时,说“音量大大大”,不是希望默认的“媒体音量”调大,有可能是别的音量。如当在导航页面时,用户希望“音量大大大”是希望导航音量提高;而在系统设置界面,用户希望“音量大大大”,是希望系统音量增大。而在大多数情况下,用户不会特意区分系统音量,媒体音量,导航音量等,只希望音量在当前场景下提高。目前方案下,对于音量多实体的情况,默认会走引导,引导用户说“您具体想提高什么音量呢”“您可以试着这样说”等引导用户说出希望命中的语音请求,因此,根据传统的逻辑引导用户的做法并不能准确地捕捉用户的意图,使用户体验变差。
发明内容
为解决或部分解决相关技术中存在的问题,本申请提供一种语音交互方法及其装置、服务器和可读存储介质,能够准确识别出用户真正意图,提高用户体验。
本申请提供一种语音交互方法。所述语音交互方法包括:对车辆预设功能调节的语音请求进行语音识别得到初步识别文本,所述预设功能指模拟对车辆零部件的操作进行刻度调节的功能;根据所述初步识别文本确定对应的第一类实体;根据所述第一类实体进行屏幕元素查询得到第二类实体,一个所述第一类实体对应多个所述第二类实体;将所述第二类实体和所述初步识别文本组合生成待识别文本;利用意图识别模型对所述待识别文本进行意图识别,根据所述意图识别的结果进行语音交互。
如此,本申请的语音交互方法通过建立多个第二类实体与第一类实体的映射关系,并结合屏幕元素检索出的真正的第二类实体,将第二类实体和初步识别文本组合生成待识别文本,可以准确识别出用户真正意图。
所述根据所述初步识别文本确定对应的第一类实体,包括:对所述初步识别文本进行叠词抽取得到预设文本词;根据所述预设文本词确定所述初步识别文本的所述第一类实体。
如此,可以根据对用户的初步识别文本进行叠词抽取得到预设文本词,再根据预设文本词确定第一类实体,根据预设文本词的提取,可以快速定位到第一类实体范围,为后续根据第一类实体确定第二类实体奠定基础。
所述对所述初步识别文本进行叠词抽取得到预设文本词,包括:通过字符串匹配的方式或正则搜索的方式对所述初步识别文本进行叠词抽取得到预设文本词。
如此,可以灵活利用不同的方式对所述初步识别文本进行叠词抽取得到预设文本词。
所述语音交互方法包括:建立可变化动词与所述第一类实体的第一映 射关系表,一个所述可变化动词对应多个所述第一类实体。
如此,本申请的语音交互方法通过建立可变化动词与第一类实体的第一映射关系表,可以根据该第一映射关系表确定第一类实体,为精确识别用户意图奠定基础。
所述根据所述预设文本词确定所述初步识别文本的所述第一类实体,包括:将所述预设文本词进行归一化处理以确定所述预设文本词对应的可变化动词;根据所述可变化动词和所述第一映射关系表确定所述第一类实体。
如此,通过对预设文本词进行归一化处理,可以精确地确定预设文本词对应的可变化动词,从而根据可变化动词和第一映射关系表精确地确定第一类实体。
所述语音交互方法包括:建立所述第一类实体与所述第二类实体的第二映射关系表。
如此,通过建立第一类实体与第二类实体的第二映射关系表,便于后续根据第一类实体确定第二类实体,提升语音交互效率。
所述根据所述第一类实体进行屏幕元素查询得到第二类实体,包括:在屏幕的当前页面为不可展开页面的情况下,根据所述不可展开页面、所述第一类实体和所述第二映射关系表确定所述第二类实体。
如此,本申请的语音交互方法先判断了屏幕的当前页面为不可展开页面,由于不可展开页面不含可弹出控件界面,因此可以根据不可展开页面、第一类实体和第二映射关系表直接确定此时的第二类实体,先判断了为不可展开页面,提升了第二类实体确定的效率。其中,不可展开界面是指不含有可弹出控件界面,但可含有可拖动元素的界面,如系统音量调节界面。
所述根据所述第一类实体进行屏幕元素查询得到第二类实体,包括:在所述当前页面为可展开页面的情况下,获取所述可展开页面的主页面名称和控件名称;根据所述主页面名称、所述控件名称和所述第二映射关系表确定所述第二类实体。
如此,本申请的语音交互方法可以在当前页面为可展开页面的情况下,获取可展开页面的主页面名称和控件名称,根据主页面名称、控件名称和第二映射关系表确定第二类实体,即本申请的语音交互方法,只读取控件可能的名称与当前页面名称,不对控件建立可执行节点的脚本命令,操作简单。其中,可展开页面指含有可弹出控件的界面,如系统设置界面,含有音量调节控件,并可以打开。
所述语音交互方法包括:通过意图训练数据训练得到所述意图识别模型,所述意图训练数据与车辆零部件和所述车辆零部件的可调节范围相关。
如此,本申请的语音交互方法可以通过意图训练数据训练得到意图识别模型,根据意图识别模型进行意图识别,可以实现精确识别用户指令的意图。
所述利用意图识别模型对所述待识别文本进行意图识别,根据所述意图识别的结果进行语音交互,包括:获取所述意图识别的结果对应各个预设意图的意图判别概率;将所述意图判别概率大于第一概率阈值的一个所述预设意图确定为所述语音请求对应的目标意图。
如此,本申请的语音交互方法可以获取意图识别的结果对应各个预设意图的意图判别概率,将意图判别概率大于第一概率阈值的一个预设意图确定为语音请求对应的目标意图,从而实现识别用户精准调节车辆零部件的刻度的需求。
所述预设意图包括:音量调大、音量调小、风量调大、风量调小、温度调高、温度调低、地图放大、地图缩小、屏幕调亮、屏幕调暗、屏幕上滑、屏幕下滑、仪表调亮、仪表调暗、氛围灯调亮、氛围灯调暗、座椅向前、座椅向后、座椅升高、座椅降低、椅背向前、椅背向后、车窗上升和 车窗下降中的至少一种。
如此,设置了多种预设意图可以进一步为识别用户的语音交互意图奠定基础,完善可能遇到的语音交互场景。
所述利用意图识别模型对所述待识别文本进行意图识别,根据所述意图识别的结果进行语音交互,包括:利用精度识别模型对所述待识别文本进行精度识别,根据所述意图识别的结果和所述精度识别的结果进行语音交互。
如此,根据精度识别模型对待识别文本进行精度识别,根据意图识别的结果和精度识别的结果进行语音交互,从而根据精确的用户语音请求的意图和精度实现精确地语音交互。
所述语音交互方法包括:通过精度训练数据训练得到所述精度识别模型,所述精度训练数据与车辆零部件、所述车辆零部件的可调节范围和所述车辆零部件的刻度调节精度范围相关。
如此,根据精度识别模型对待识别文本进行精度识别,可以确定语音请求对应的刻度调节精度。
所述利用精度识别模型对所述待识别文本进行精度识别,根据所述意图识别的结果进行语音交互,包括:获取所述精度识别的结果对应多个预设刻度调节精度值的精度判别概率;将所述精度判别概率大于第二概率阈值的一个所述预设刻度调节精度值,确定为所述语音请求对应的目标刻度调节精度值。
如此,本申请的语音交互方法可以获取精度识别的结果对应各个预设刻度调节精度值的精度判别概率,确定精度判别概率大于第二概率阈值的预设刻度调节精度值为目标刻度调节精度值,从而进行精确的刻度调节。
所述根据所述意图识别的结果和所述精度识别的结果进行语音交互,包括:根据所述目标意图和所述目标刻度调节精度值融合生成控制指令,以控制对应的车辆零部件。
如此,本申请根据目标意图和目标刻度调节精度值融合生成控制指令从而控制对应的车辆零部件,可以实现对用户语音请求的精准刻度调节。
本申请还提供一种语音交互装置。所述语音交互装置包括:语音识别模块、确定模块、查询模块、组合模块和语音交互模块。所述语音识别模块用于对车辆预设功能调节的语音请求进行语音识别得到初步识别文本,所述预设功能指模拟对车辆零部件的操作进行刻度调节的功能;所述确定模块用于根据所述初步识别文本确定对应的第一类实体;所述查询模块用于根据所述第一类实体进行屏幕元素查询得到第二类实体,一个所述第一类实体对应多个所述第二类实体;所述组合模块用于将所述第二类实体和所述初步识别文本组合生成待识别文本;所述语音交互模块用于利用意图识别模型对所述待识别文本进行意图识别,根据所述意图识别的结果进行语音交互。
如此,本申请的语音交互装置通过建立多个第二类实体与第一类实体的映射关系,并结合屏幕元素检索出的真正的第二类实体,将第二类实体和初步识别文本组合生成待识别文本,可以准确识别出用户真正意图。
本申请还提供一种服务器。所述服务器包括处理器和存储器,所述存储器上存储有计算机程序,当所述计算机程序被所述处理器执行时,实现上述任意一项实施方式所述的语音交互方法。
如此,本申请的服务器通过建立多个第二类实体与第一类实体的映射关系,并结合屏幕元素检索出的真正的第二类实体,将第二类实体和初步识别文本组合生成待识别文本,可以准确识别出用户真正意图。
本申请还提供一种包含有计算机程序的非易失性计算机可读存储介质。当所述计算机程序被一个或多个处理器执行时,实现上述任意一项实施方式所述的语音交互方法。
如此,本申请的计算机可读存储介质通过建立多个第二类实体与第一类实体的映射关系,并结合屏幕元素检索出的真正的第二类实体,将第二类实体和初步识别文本组合生成待识别文本,可以准确识别出用户真正意图。
应当理解的是,以上的一般描述和后文的细节描述仅是示例性和解释性的,并不能限制本申请。
附图说明
通过结合附图对本申请示例性实施方式进行更详细的描述,本申请的上述以及其它目的、特征和优势将变得更加明显,其中,在本申请示例性实施方式中,相同的参考标号通常代表相同部件。
图1是本申请的语音交互方法的流程示意图之一;
图2是本申请的语音交互装置的结构示意图之一;
图3是本申请的语音交互方法的流程示意图之二;
图4是本申请的语音交互装置的结构示意图之二;
图5是本申请的语音交互方法的流程示意图之三;
图6是本申请的语音交互装置的结构示意图之三;
图7是本申请的语音交互方法的流程示意图之四;
图8是本申请的语音交互装置中第一确定单元的结构示意图;
图9是本申请的语音交互方法的流程示意图之五;
图10是本申请的语音交互装置的结构示意图之四;
图11是本申请的语音交互方法的流程示意图之六;
图12是本申请的语音交互方法的流程示意图之七;
图13是本申请的语音交互方法的流程示意图之八;
图14是本申请的语音交互装置的结构示意图之五;
图15是本申请的语音交互方法的流程示意图之九;
图16是本申请的语音交互装置中语音交互模块的结构示意图之一;
图17是本申请的语音交互方法的流程示意图之十;
图18是本申请的语音交互装置中语音交互模块的结构示意图之二;
图19是本申请的语音交互方法的流程示意图之十一;
图20是本申请的语音交互方法的流程示意图之十二;
图21是本申请的语音交互模块中精度识别单元的结构示意图;
图22是本申请的语音交互方法的流程示意图之十三;
图23是本申请的语音交互装置中语音交互模块的结构示意图之三;
图24是本申请的服务器的结构示意图;
图25是本申请的计算机可读存储介质的结构示意图。
具体实施方式
下面将参照附图更详细地描述本申请的实施方式。虽然附图中显示了本申请的实施方式,然而应该理解,可以以各种形式实现本申请而不应被这里阐述的实施方式所限制。相反,提供这些实施方式是为了使本申请更加透彻和完整,并且能够将本申请的范围完整地传达给本领域的技术人员。
在本申请使用的术语是仅仅出于描述特定实施例的目的,而非旨在限制本申请。在本申请和所附权利要求书中所使用的单数形式的“一种”、“所述”和“该”也旨在包括多数形式,除非上下文清楚地表示其他含义。还应当理解,本文中使用的术语“和/或”是指并包含一个或多个相关联的列出项目的任何或所有可能组合。
应当理解,尽管在本申请可能采用术语“第一”、“第二”、“第三”等来描述各种信息,但这些信息不应限于这些术语。这些术语仅用来将同一类型的信息彼此区分开。例如,在不脱离本申请范围的情况下,第一信 息也可以被称为第二信息,类似地,第二信息也可以被称为第一信息。由此,限定有“第一”、“第二”的特征可以明示或者隐含地包括一个或者更多个该特征。在本申请的描述中,“多个”的含义是两个或两个以上,除非另有明确具体的限定。
下面详细描述本申请,本申请的示例在附图中示出,其中,相同或类似的标号自始至终表示相同或类似的元件或具有相同或类似功能的元件。下面通过参考附图描述的实施方式是示例性的,仅用于解释本申请,而不能理解为对本申请的限制。
请参阅图1,本申请提供了一种语音交互方法。该语音交互方法包括:
01:对车辆预设功能调节的语音请求进行语音识别得到初步识别文本,预设功能指模拟对车辆零部件的操作进行刻度调节的功能;
02:根据初步识别文本确定对应的第一类实体;
03:根据第一类实体进行屏幕元素查询得到第二类实体,一个第一类实体对应多个第二类实体;
04:将第二类实体和初步识别文本组合生成待识别文本;
05:利用意图识别模型对待识别文本进行意图识别,根据意图识别的结果进行语音交互。
请参阅图2,本申请还提供一种语音交互装置10。语音交互装置10包括:语音识别模块11、第一类实体确定模块12、第二类实体确定模块13、组合模块14、语音交互模块15。
步骤01可以由语音识别模块11实现,步骤02可以由第一类实体确定模块12实现,步骤03可以由第二类实体确定模块13实现,步骤04可以由组合模块14实现,步骤05可以由语音交互模块15实现。也即是说,语音识别模块11用于对车辆预设功能调节的语音请求进行语音识别得到初步识别文本,预设功能指模拟对车辆零部件的操作进行刻度调节的功能;第一类实体确定模块12用于根据初步识别文本确定对应的第一类实体;第二类实体确定模块13用于根据第一类实体进行屏幕元素查询得到第二类实体,一个第一类实体对应多个第二类实体;组合模块14用于将第二类实体和初步识别文本组合生成待识别文本;语音交互模块15用于利用意图识别模型对待识别文本进行意图识别,根据意图识别的结果进行语音交互。
车辆预设功能调节的语音请求例如可以为“屏幕亮亮亮”、“音量大大大”、“屏幕亮亮亮亮”、“空调风量大大大”、“座椅后后后”,即为带有精简词的语音请求。其中,预设功能指模拟对车辆零部件的操作进行刻度调节的功能,其中的车辆零部件可以指机械旋钮或按钮等实体部件,这些是可以进行调节刻度的车辆零部件。
首先,对该语音请求进行语音识别得到初步识别文本,也即是,例如,对用户输入的具有对车辆预设功能调节的语音请求“屏幕亮亮亮”进行语音识别,得到的初步识别文本即为“屏幕亮亮亮”。
可以理解地,在实际交互环境中,可能受车辆硬件限制,或者因为网络的不稳定性,用户表述口语化或者方言化等原因,导致语音识别后得到的待识别文本不够清晰准确,需要通过预处理进行一些常规文本纠错,例如将一些表达进行近义或同义纠正,以及一些无意义词语的去除等。例如“音量深深深深深”纠正为“音量增增增增增”,以及一些无意义词语的去除等,例如“啊”,“请”等。
然后,根据初步识别文本确定对应的第一类实体。其中,第一类实体指的是用户语音请求的历史记录中“音量”和“屏幕”等包含的多种调节名词的总称。可以理解地,“音量”包括“导航音量”、“系统音量”、“媒体音量”、“小P声音”等音量的调节。“屏幕”包括“大屏”、“仪表”、“中控台”等屏幕的调节。即,带有具体的实体名称的实体为第二 类实体。
接着,根据第一类实体进行屏幕元素查询得到第二类实体,一个第一类实体对应多个第二类实体。例如,第一类实体为“音量”,对应的多个第二类实体包括“导航音量”、“系统音量”、“媒体音量”、“小P声音”等。
屏幕元素查询指的是对实时车辆状态建立相应的查询元素,例如,风量、音量等在当前屏幕页面的标识。本申请的语音交互方法首先查询此时是否在第二类实体的调节页面,若是在具体实体的调节页面,在一个示例中,当前界面为导航界面,即第二类实体为导航音量,则音量大大大说法,直接继承该页面实体。此时,用户指令可以将“导航音量”与初步识别文本“音量大大大”组合为待识别文本“导航音量大大大”,并根据该待识别文本“导航音量大大大”进行正确的意图识别过程,根据对应的意图识别结果进行语音交互。
屏幕元素查询还包括检索当前页面控件及其对应名称。例如,在系统设置的页面中,定位当前主页面名称“系统设置”,检测控件名称为“音量”,根据主页面名称与具体控件名称组装成“系统音量”。此时,用户指令可以将“系统音量”与初步识别文本“音量大大大”组合为待识别文本“系统音量大大大”,并根据该待识别文本“系统音量大大大”进行正确的意图识别过程,根据对应的意图识别结果进行语音交互。
请参阅图3,步骤02包括:
021:通过字符串匹配的方式或正则搜索的方式对初步识别文本进行叠词抽取得到预设文本词;
022:根据预设文本词确定初步识别文本的第一类实体。
请结合图4,第一类实体确定模块12包括叠词抽取单元121和第一确定单元122。
步骤021可以由叠词抽取单元121实现,步骤022可以由第一确定单元122实现。也即是,叠词抽取单元121用于通过字符串匹配的方式或正则搜索的方式对初步识别文本进行叠词抽取得到预设文本词;第一确定单元122用于根据预设文本词确定初步识别文本的第一类实体。
首先对用户的初步识别文本进行叠词提取,可采用字符串匹配或正则搜索的方式,进行叠词抽取,例如对初步识别文本“音量大大大”进行叠词抽取,则抽取的预设文本词为“大大大”。也即是,预设文本词指的是初步识别文本中的叠词。
请参阅图5,语音交互方法包括:
001:建立可变化动词与第一类实体的第一映射关系表,一个可变化动词对应多个第一类实体。
请结合图6,语音交互装置10包括第一关系表建立模块101。
步骤001可以由关系表建立模块101实现。也即是,关系表建立模块101用于建立可变化动词与第一类实体的第一映射关系表,一个可变化动词对应多个第一类实体。
可变化动词为大小、高低、前后等可以变化的动词。一个可变化动词对应多个第一类实体,例如第一映射关系表中,可变化动词为“大”,对应的第一类实体为“音量”或者“风量”。可变化动词为“前”,对应的第一类实体为“椅背”。可变化动词为“高”,对应的第一类实体为“温度”或“座椅”。其中,可变化动词包括一个或多个,在此不作限制。
可变化动词与第一类实体的第一映射关系表,如下所示可变化动词“大”、“小”、“高”、“低”、“亮”和“暗”与对应的第一类实体间的映射关系可以为:
{“大”:[音量,风量]
“小”:[音量,风量]
“高”:[温度,音量]
“低”:[温度,音量]
“亮”:[屏幕,氛围灯]
“暗”:[屏幕,氛围灯]}。
请参阅图7,步骤022包括:
0221:将预设文本词进行归一化处理以确定预设文本词对应的可变化动词;
0222:根据可变化动词和第一映射关系表确定第一类实体。
请结合图8,第一确定单元122包括第一确定子单元1221和第二确定子单元1222。
步骤0221可以由第一确定子单元1221实现,步骤0222可以由第二确定子单元1222实现。也即是,第一确定子单元1221用于将预设文本词进行归一化处理以确定预设文本词对应的可变化动词;第二确定子单元1222用于根据可变化动词和第一映射关系表确定第一类实体。
对抽取的预设文本词进行归一化处理,并根据可变化动词与实体间的映射关系表进行检索。如对预设文本词“大大大”归一化处理后得到的词为“大”,则根据可变化动词与实体间的映射关系表可以得到归一化处理后的词“大”,对应的第一类实体为“音量”或者“风量”。
请参阅图9,语音交互方法包括:
002:建立第一类实体与第二类实体的第二映射关系表。
请结合图10,语音交互装置10包括第二关系表建立模块102。
步骤002可以由第二关系表建立模块102实现。也即是,第二关系表建立模块102用于建立第一类实体与第二类实体的第二映射关系表。
第一类实体与第二类实体的第二映射关系表,例如,第二映射关系表中第一类实体“音量”对应的第二类实体为“导航音量”、“系统音量”、“媒体音量”、“小P声音”等。第二映射关系表中第一类实体“屏幕”对应的第二类实体为“大屏”、“仪表”、“中控台”等。第二类实体可以根据车辆具有的硬件进行增减设置。
请参阅图11,步骤03包括:
031:在屏幕的当前页面为不可展开页面的情况下,根据不可展开页面、第一类实体和第二映射关系表确定第二类实体。
请结合图2,步骤031可以由第二类实体确定模块13实现,第二类实体确定模块13用于在屏幕的当前页面为不可展开页面的情况下,根据不可展开页面、第一类实体和第二映射关系表确定第二类实体。
不可展开界面是指不含有可弹出控件界面,但可含有可拖动元素的界面,如系统音量调节界面。在屏幕的当前页面为不可展开页面的情况下,由于不含可弹出控件界面,因此可以根据不可展开页面、第一类实体和第二映射关系表直接确定此时的第二类实体,优先判断当前页面为不可展开页面,提升了第二类实体确定的效率。
另外,本申请的语音交互方法优先判断是屏幕的当前页面否是不可展开页面,在屏幕的当前页面为不可展开页面的情况下确定第二类实体,不需要在可展开页面中先识别用户语音中的控件是否在屏幕上,然后打开该控件,对其进行语音请求操作,即,本申请的语音交互方法更为简单。
请参阅图12,步骤03包括:
032:在当前页面为可展开页面的情况下,获取可展开页面的主页面名称和控件名称;
033:根据主页面名称、控件名称和第二映射关系表确定第二类实体。
请结合图2,步骤032和步骤033可以由第二类实体确定模块13实现,第二类实体确定模块13用于在当前页面为可展开页面的情况下,获取可展开页面的主页面名称和控件名称;根据主页面名称、控件名称和第二映 射关系表确定第二类实体。
可展开页面指含有可弹出控件的界面,如系统设置界面,含有音量调节控件,并可以打开。
本申请只读取当前页面的控件可能的名称与主页面名称,不需要对控件建立可执行节点的脚本命令,因为本申请语音请求的命令下发,是通过“控件名字”+“叠词”组装脚本实现的。
如此,相比于传统的语音识别逻辑需要对屏幕元素的每一个控件建立可执行脚本,再对用户的说法中进行控件名的匹配,操作复杂、速度慢,本申请只读取屏幕元素的名称,自行组装命令会更加快速,更加便捷,其省去大量存储每个控件脚本的空间。
可以理解地,车辆并非所有功能的调节都可以、能够或有需要进行精准的刻度调节。例如,座椅在各个方向上的移动可以通过车辆零部件进行调节。而车门则没有类似旋钮、按键等车辆零部件来实现刻度调节,而通常仅通过车门把手进行开关。因此,座椅调节是属于车辆零部件的控制范围、而车门调节则属于车辆零部件的非控制范围。
获取车辆零部件的信息,根据车辆零部件的信息,确定可通过车辆零部件进行刻度调节的硬件,确定为车辆零部件的控制范围,将不可通过车辆零部件进行调节的硬件确定为非控制范围。
首先,确定在车辆上可以进行刻度调节的车辆零部件,例如:“音量旋钮”、“屏幕亮度按钮”、“空调风量旋钮/按钮”、“座椅调节旋钮/按钮”等。进一步,确定车辆零部件的控制范围可包括:车载音响、车辆内的屏幕、车辆空调、车辆座椅、车内的氛围灯、车辆外部的车灯、或车窗等。车辆零部件的非控制范围可包括:车门、后视镜、后备箱等。
在后续语音交互的过程中,可在语音请求针对车辆零部件的非控制范围的情况下进行语音提示。
如此,通过收集车辆零部件信息,确认可通过车辆零部件进行刻度调节的功能,从而确定车辆零部件的控制范围,也即是可通过语音交互进行刻度调节的控制范围。
在确定车辆零部件的控制范围和非控制范围后,需要针对控制范围中的每一个车辆零部件确定可调节范围。车辆零部件的可调节范围与通过操作该车辆零部件进行调节的刻度范围相对应。对应不同车辆零部件,可调节范围可以是档位或量程。例如,屏幕亮度按钮累计连续按压5次,屏幕亮度依次调整1至5个档位的亮度至最大亮度,则该屏幕亮度按钮的可调节范围为1至5个档位。又如,对座椅进行前后调节的旋钮的总刻度值为90,则该座椅调节旋钮的可调节范围为刻度值1~90。
然后,将车辆零部件的控制范围和每个车辆零部件的可调节范围,映射到意图识别模型所能够理解的意图体系。针对车辆零部件的控制范围中的对象和对应的车辆零部件的可调节范围均制定一个相应的预设意图。例如:system volume up代表着预设意图“音量调大”和system volume down代表着预设意图“音量调小”。从而针对零部件控制范围和车辆零部件的可调节范围制定了一套具体的意图映射体系。
对于预设刻度调节精度,例如,语音交互模拟对车辆零部件的操作时音量每次调节3个刻度值,总刻度值为60,则预设刻度调节精度范围可以为1~20。又例如,语音交互模拟对车辆零部件的操作时座椅前后每次调节18个刻度,总刻度值为90,则预设刻度调节精度范围为1~5。
请参阅图13,语音交互方法包括:
003:通过意图训练数据训练得到意图识别模型,意图训练数据与车辆零部件和车辆零部件的可调节范围相关。
请结合图14,语音交互装置10包括意图识别模型获取模块103。
步骤003可以由意图识别模型获取模块103实现。也即是,意图识别 模型获取模块103用于通过意图训练数据训练得到意图识别模型,意图训练数据与车辆零部件和车辆零部件的可调节范围相关。
本申请通过机器学习的方式,由可进行刻度调节的车辆零部件和车辆零部件的可调节范围对应的训练数据训练得到意图识别模型,进而对改写后的当前轮的语音请求进行意图识别,实现用户意图的准确识别。其中,模型训练可以利用BERT、ALBERT、XLNet、RoBERTa等模型。
其中,意图训练数据与可进行刻度调节的车辆零部件和零部件的可调节范围相关。车辆零部件指的是在智能汽车上可以进行刻度调节的零部件,例如:“音量旋钮”、“屏幕亮度按钮”、“空调风量旋钮/按钮”、“座椅调节旋钮/按钮”等。车辆零部件的可调节范围与与通过操作该车辆零部件进行调节的刻度范围相对应。对应不同车辆零部件,可调节范围可以是档位或量程。
本申请中的意图识别模型,在使用前预先训练。意图训练的数据可以在取得相关用户权限的情况下,收集一定数量的用户语音请求的历史记录,对收集到的用户语音请求进行简单的筛选得到语义明确且包含具体目的语音请求,具体为:在筛选中去掉明显语义不明确的语音请求,以及一些只包含语气词,例如“啊”、“哦”等较短的语音请求,留下语义明确同时包含具体目的语音请求。
然后,对筛选后的语音请求参照制定的预设意图进行标注,例如,语音请求为“屏幕亮亮亮”,可标注对应的意图为“屏幕调亮”,然后,对标注的数据进行质检,再次筛选去掉不符合预设意图的标注数据,留下可用于意图模型训练的标注数据。例如,语音请求为“车门开”,标注对应的意图为“打开车门”,而可进行刻度调节的零部件不用于调节车门,此时,可通过筛选将该语音请求去掉。
在训练过程中,将可用于意图模型训练的的标注数据作为意图训练数据并划分为意图训练集和意图数据集,划分比例可根据需求设定,在此不作限定。例如意图训练集80%,意图验证集为20%。利用意图训练集中的数据进行意图识别模型的训练。模型训练可以利用BERT、ALBERT、XLNet、RoBERTa等模型。
例如,对于建立好的意图识别模型,先利用意图训练集中的至少部分数据用于训练意图识别模型,然后利用意图验证集的至少部分数据对训练后的意图识别模型的准确率进行意图验证。在意图验证的准确率没有达到意图准确率阈值的情况下,再次通过意图训练集的至少另一部分数据对意图识别模型进行训练,以及再次利用意图验证集的另一部分数据对再次训练后的意图识别模型的准确率进行意图验证。如此重复训练和意图验证的过程,直到意图验证的准确率达到意图准确率阈值时,可以认为意图识别模型已经达标,完成意图识别模型的训练。
需要说明的是,意图训练集和意图验证集中的每个数据均只使用一次,在意图识别模型遍历意图训练集和意图验证集的所有数据均未能训练达标的情况下,可以再次在用户允许的情况下收集更多的语音请求,从而筛选并标注得到更多的意图训练数据对意图识别模型进行训练,从而保证意图识别模型能够准确识别输入的语音请求对应的意图。
可以理解,上述意图识别模型可以离线进行训练,将离线训练好的意图识别模型部署到服务器或车辆后,服务器或车辆可以对接收到的语音请求,利用意图识别模型进行意图识别。
请参阅图15,步骤05包括:
051:获取意图识别的结果对应各个预设意图的意图判别概率;
052:将意图判别概率大于第一概率阈值的一个预设意图确定为语音请求对应的目标意图。
请结合图16,语音交互模块15包括第一获取单元151和意图确定单 元152。
步骤051可以由第一获取单元151实现,步骤052可以由意图确定单元152实现。也即是,第一获取单元151用于获取意图识别的结果对应各个预设意图的意图判别概率;意图确定单元152用于将意图判别概率大于第一概率阈值的一个预设意图确定为语音请求对应的目标意图。
使用训练好的的模型针对待识别文本进行意图识别得到意图识别的结果,意图识别的结果中包括待识别文本与各个预设意图相匹配的概率,即可以得到多个意图判别概率。若第一概率阈值为0.9,则意图识别的结果为某个类别的预设意图的意图判别概率超过0.9,那么服务端认为当前用户的语音请求为对应类别的预设意图就是目标意图。第一概率阈值也可以为其他数值,第一概率阈值可以为默认设置的数值,也可以根据用户需要自行设定,在此不作限制。
本申请的预设意图可包括:音量调大、音量调小、风量调大、风量调小、温度调高、温度调低、地图放大、地图缩小、屏幕调亮、屏幕调暗、屏幕上滑、屏幕下滑、仪表调亮、仪表调暗、氛围灯调亮、氛围灯调暗、座椅向前、座椅向后、座椅升高、座椅降低、椅背向前、椅背向后、车窗上升和车窗下降中的至少一种。
应当理解地,本申请中的预设意图仅为示意性说明,对于车辆中可进行刻度调节的对象都可以根据其实际的操作设定相应的预设意图。
如此,可根据车辆的具体情况制定多个预设意图,完善可能遇到的语音交互场景。
步骤05还包括:
053:在各个预设意图的意图判别概率均不大于第一概率阈值的情况下,确定语音请求的意图为非刻度调节意图。
步骤053可以由意图确定单元152实现,也即是,意图确定单元152用于在各个预设意图的意图判别概率均不大于第一概率阈值的情况下,确定语音请求的意图为非刻度调节意图。
例如,当多个类别的预设意图对应的判别概率均不大于第一概率阈值的情况,即根据语音请求得到用户的意图识别结果与多个类别预设意图相匹配的概率比较低,低于第一概率阈值,例如第一概率阈值为0.9,则确定该语音请求的意图为非刻度调节意图,非刻度调节意图指的是不用可进行刻度调节的车辆零部件来调节车辆预设功能的用户意图,例如,用户输入的语音请求为“车门开开开”,因为车门不能用带有刻度的车辆零部件进行调节,因此,该语音请求“车门开开开”的意图是非刻度调节意图。
请参阅图17,步骤05包括:
054:利用精度识别模型对待识别文本进行精度识别,根据意图识别的结果和精度识别的结果进行语音交互。
请参阅图18,语音交互模块15包括精度识别单元153。
步骤054可以由精度识别单元153实现。也即是,精度识别单元153用于利用精度识别模型对待识别文本进行精度识别,根据意图识别的结果和精度识别的结果进行语音交互。
如此,本申请通过机器学习的方式,由可进行刻度调节的车辆零部件、车辆零部件的可调节范围和零部件的刻度调节精度范围对应的训练数据训练得到精度识别模型,进而语音请求进行精度识别,实现用户刻度调节精度的准确识别。建立精度识别模型以对语音请求。
请参阅图19,步骤054包括:
0541:通过精度训练数据训练得到精度识别模型,精度训练数据与车辆零部件、车辆零部件的可调节范围和车辆零部件的刻度调节精度范围相关。
请结合图18,步骤0541可以由精度识别单元153实现。也即是,精 度识别单元153用于通过精度训练数据训练得到精度识别模型,精度训练数据与车辆零部件、车辆零部件的可调节范围和车辆零部件的刻度调节精度范围相关。
如此,可以通过精度训练数据预先训练好精度识别模型对待识别文本进行精度识别,从而识别出某个车辆零部件的调节精度,得到精度识别结果,最终确定目标刻度调节精度值。
其中,精度训练数据与可通过车辆零部件进行刻度调节的车辆零部件、零部件的可调节范围相关,指的是精度训练数据包括车辆中所有可以进行刻度调节的车辆零部件,例如“音量旋钮”、“屏幕亮度按钮”、“空调风量旋钮/按钮”、“座椅调节旋钮/按钮”等。车辆零部件的可调节范围与与通过操作该车辆零部件进行调节的刻度范围相对应。对应不同车辆零部件,可调节范围可以是档位或量程,刻度调节精度范围可以是每次调节的刻度值。
其中,精度训练的数据可以在取得相关用户权限的情况下,收集一定数量的用户语音请求的历史记录,对收集到的用户语音请求进行简单的筛选得到语义明确且包含具体目的语音请求,具体为:在筛选中去掉明显语义不明确的语音请求,以及一些只包含语气词,例如“啊”、“哦”等较短的语音请求,留下语义明确同时包含具体目的语音请求。此时,精度训练时获取的用户语音请求的历史记录可以与意图训练时获取的用户语音请求的历史记录相同,以及精度训练时对收集到的用户语音请求进行筛选的步骤可以与意图训练时对收集到的用户语音请求进行筛选的步骤相同。
然后对筛选后的语音请求进行人工标注,需标注出用户想要调节的刻度调节精度值。例如,语音请求为“屏幕亮亮亮”,对应标注对车辆内屏幕亮度进行调节的刻度调节精度值为3。然后,基于槽位提取的方式建立精度识别模型,槽位提取可以使用的算法包括RNN槽填充,CRF等,将标注好的数据作为精度训练数据并划分得到精度训练集和精度数据集,划分比例可根据需求设定,在此不作限定。例如精度训练集80%,精度验证集为20%。利用精度训练集中的数据进行精度识别模型的训练。对于建立好的精度识别模型,先利用精度训练集中的至少部分数据用于训练精度识别模型,然后利用精度验证集的至少部分数据对训练后的精度识别模型的准确率进行精度验证。在精度验证的准确率没有达到精度准确率阈值的情况下,再次通过精度训练集的至少另一部分数据对精度识别模型进行训练,以及再次利用精度验证集的另一部分数据对再次训练后的精度识别模型的准确率进行精度验证。如此重复训练和精度验证的过程,直到精度验证的准确率达到精度准确率阈值时,可以认为精度识别模型已经达标,完成精度识别模型的训练。
需要说明的是,精度训练集和精度验证集中的每个数据均只使用一次,在精度识别模型遍历精度训练集和精度验证集的所有数据均未能训练达标的情况下,可以再次在用户允许的情况下收集更多的语音信息,从而筛选并标注得到更多的精度训练数据对精度识别模型进行训练,从而保证精度识别模型能够准确识别输入的语音请求对应的刻度调节精度。
如此,可以通过精度训练数据预先训练好精度识别模型对待识别文本进行精度识别,从而识别出某个车辆零部件的调节精度,得到精度识别结果,最终确定目标刻度调节精度值。
请参阅图20,步骤054包括:
0542:获取精度识别的结果对应多个预设刻度调节精度值的精度判别概率;
0543:将精度判别概率大于第二概率阈值的一个预设刻度调节精度值,确定为语音请求对应的目标刻度调节精度值。
请结合图21,精度识别单元153包括第二获取单元1532和精度确定 单元1533实现。也即是,第二获取单元1532用于获取精度识别的结果对应多个预设刻度调节精度值的精度判别概率;精度确定单元1533用于将精度判别概率大于第二概率阈值的一个预设刻度调节精度值,确定为语音请求对应的目标刻度调节精度值。
精度判别概率指的是识别该语音请求的精度与各个预设刻度调节精度值相匹配的概率。第二概率阈值例如可以为0.7、0.8、0.9或其他数值,在此不作限制。
当精度判别概率为1,第二概率阈值为0.9时,即精度判别概率为1超过第二概率阈值0.9,则确定语音请求“音量大大大大大”对应音量调节的目标刻度调节精度值为5。
步骤054还包括:
0544:在各个预设刻度调节精度值的精度判别概率均不大于第二概率阈值的情况下,确定语音请求的精度识别错误。
步骤0544可以由精度确定单元1533实现。也即是说,精度确定单元1533用于在各个预设刻度调节精度值的精度判别概率均不大于第二概率阈值的情况下,确定语音请求的精度识别错误。
各个预设刻度调节精度值的精度判别概率均不大于第二概率阈值的情况,说明输入的语音请求的精度识别有误,可以排除非刻度调节精度相关的语音请求。
请参阅图22,步骤05包括:
055:根据目标意图和目标刻度调节精度值融合生成控制指令,以控制对应的车辆零部件。
请参阅图23,语音交互模块15包括指令生成单元154。
步骤055可以由指令生成单元154实现。也即是,指令生成单元154用于根据目标意图和目标刻度调节精度值融合生成控制指令,以控制对应的车辆零部件。
根据前面步骤中得到的目标意图和目标刻度调节精度值融合生成控制指令得到结合意图与精度的控制信息,从而可以实现根据用户带有精简词的语音交互指令精确地控制对应的车辆零部件,从而实现用户的真正意图。
请参阅图24,本申请还提供一种服务器20。该服务器20包括处理器21和存储器22,存储器22上存储有计算机程序221,当计算机程序221被处理器21执行时,实现上述任意一个实施例中所述的语音交互方法。服务器20可以安装在车辆内部,也可以与车辆外接设置,在此不作限制。
本申请的服务器20通过建立多个第二类实体与第一类实体的映射关系,并结合屏幕元素检索出的真正的第二类实体,将第二类实体和初步识别文本组合生成待识别文本,可以准确识别出用户真正意图。
请参阅图25,本申请还提供一种包含有计算机程序的非易失性计算机可读存储介质30。当计算机程序31被一个或多个处理器40执行时,实现上述任意实施条例的语音交互方法。
例如,计算机程序31被处理器40执行时实现以下语音交互方法的步骤:
01:对车辆预设功能调节的语音请求进行语音识别得到初步识别文本,预设功能指模拟对车辆零部件的操作进行刻度调节的功能;
02:根据初步识别文本确定对应的第一类实体;
03:根据第一类实体进行屏幕元素查询得到第二类实体,一个第一类实体对应多个第二类实体;
04:将第二类实体和初步识别文本组合生成待识别文本;
05:利用意图识别模型对待识别文本进行意图识别,根据意图识别的结果进行语音交互。
可以理解地,计算机程序31包括计算机程序代码。计算机程序代码可以为源代码形式、对象代码形式、可执行文件或某些中间形式等。计算机可读存储介质可以包括:能够携带计算机程序代码的任何实体或装置、记录介质、U盘、移动硬盘、磁碟、光盘、计算机存储器、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、以及软件分发介质等。
本申请的计算机可读存储介质通过建立多个第二类实体与第一类实体的映射关系,并结合屏幕元素检索出的真正的第二类实体,将第二类实体和初步识别文本组合生成待识别文本,可以准确识别出用户真正意图。
以上已经描述了本申请的各实施例,上述说明是示例性的,并非穷尽性的,并且也不限于所披露的各实施例。在不偏离所说明的各实施例的范围和精神的情况下,对于本技术领域的普通技术人员来说许多修改和变更都是显而易见的。本文中所用术语的选择,旨在最好地解释各实施例的原理、实际应用或对市场中的技术的改进,或者使本技术领域的其它普通技术人员能理解本文披露的各实施例。

Claims (18)

  1. 一种语音交互方法,其特征在于,包括:
    对车辆预设功能调节的语音请求进行语音识别得到初步识别文本,所述预设功能指模拟对车辆零部件的操作进行刻度调节的功能;
    根据所述初步识别文本确定对应的第一类实体;
    根据所述第一类实体进行屏幕元素查询得到第二类实体,一个所述第一类实体对应多个所述第二类实体;
    将所述第二类实体和所述初步识别文本组合生成待识别文本;
    利用意图识别模型对所述待识别文本进行意图识别,根据所述意图识别的结果进行语音交互。
  2. 根据权利要求1所述的语音交互方法,其特征在于,所述根据所述初步识别文本确定对应的第一类实体,包括:
    对所述初步识别文本进行叠词抽取得到预设文本词;
    根据所述预设文本词确定所述初步识别文本的所述第一类实体。
  3. 根据权利要求2所述的语音交互方法,其特征在于,所述对所述初步识别文本进行叠词抽取得到预设文本词,包括:通过字符串匹配的方式或正则搜索的方式对所述初步识别文本进行叠词抽取得到预设文本词。
  4. 根据权利要求1所述的语音交互方法,其特征在于,所述语音交互方法包括:
    建立可变化动词与所述第一类实体的第一映射关系表,一个所述可变化动词对应多个所述第一类实体。
  5. 根据权利要求4所述的语音交互方法,其特征在于,所述根据所述预设文本词确定所述初步识别文本的所述第一类实体,包括:
    将所述预设文本词进行归一化处理以确定所述预设文本词对应的可变化动词;
    根据所述可变化动词和所述第一映射关系表确定所述第一类实体。
  6. 根据权利要求1所述的语音交互方法,其特征在于,所述语音交互方法包括:
    建立所述第一类实体与所述第二类实体的第二映射关系表。
  7. 根据权利要求6所述的语音交互方法,其特征在于,所述根据所述第一类实体进行屏幕元素查询得到第二类实体,包括:
    在屏幕的当前页面为不可展开页面的情况下,根据所述不可展开页面、所述第一类实体和所述第二映射关系表确定所述第二类实体。
  8. 根据权利要求7所述的语音交互方法,其特征在于,所述根据所述第一类实体进行屏幕元素查询得到第二类实体,包括:
    在所述当前页面为可展开页面的情况下,获取所述可展开页面的主页面名称和控件名称;
    根据所述主页面名称、所述控件名称和所述第二映射关系表确定所述第二类实体。
  9. 根据权利要求1所述的语音交互方法,其特征在于,所述语音交互方法包括:
    通过意图训练数据训练得到所述意图识别模型,所述意图训练数据与车辆零部件和所述车辆零部件的可调节范围相关。
  10. 根据权利要求1所述的语音交互方法,其特征在于,所述利用意图识别模型对所述待识别文本进行意图识别,根据所述意图识别的结果进行语音交互,包括:
    获取所述意图识别的结果对应各个预设意图的意图判别概率;
    将所述意图判别概率大于第一概率阈值的一个所述预设意图确定为所述语音请求对应的目标意图。
  11. 根据权利要求10所述的语音交互方法,其特征在于,所述预设意 图包括:音量调大、音量调小、风量调大、风量调小、温度调高、温度调低、地图放大、地图缩小、屏幕调亮、屏幕调暗、屏幕上滑、屏幕下滑、仪表调亮、仪表调暗、氛围灯调亮、氛围灯调暗、座椅向前、座椅向后、座椅升高、座椅降低、椅背向前、椅背向后、车窗上升和车窗下降中的至少一种。
  12. 根据权利要求10所述的语音交互方法,其特征在于,所述利用意图识别模型对所述待识别文本进行意图识别,根据所述意图识别的结果进行语音交互,包括:
    利用精度识别模型对所述待识别文本进行精度识别,根据所述意图识别的结果和所述精度识别的结果进行语音交互。
  13. 根据权利要求12所述的语音交互方法,其特征在于,所述语音交互方法包括:
    通过精度训练数据训练得到所述精度识别模型,所述精度训练数据与车辆零部件、所述车辆零部件的可调节范围和所述车辆零部件的刻度调节精度范围相关。
  14. 根据权利要求12所述的语音交互方法,其特征在于,所述利用精度识别模型对所述待识别文本进行精度识别,根据所述意图识别的结果和所述精度识别的结果进行语音交互,包括:
    获取所述精度识别的结果对应多个预设刻度调节精度值的精度判别概率;
    将所述精度判别概率大于第二概率阈值的一个所述预设刻度调节精度值,确定为所述语音请求对应的目标刻度调节精度值。
  15. 根据权利要求14所述的语音交互方法,其特征在于,根据所述意图识别的结果和所述精度识别的结果进行语音交互,包括:
    根据所述目标意图和所述目标刻度调节精度值融合生成控制指令,以控制对应的车辆零部件。
  16. 一种语音交互装置,其特征在于,包括:
    语音识别模块,所述语音识别模块用于对车辆预设功能调节的语音请求进行语音识别得到初步识别文本,所述预设功能指模拟对车辆零部件的操作进行刻度调节的功能;
    确定模块,所述确定模块用于根据所述初步识别文本确定对应的第一类实体;
    查询模块,所述查询模块用于根据所述第一类实体进行屏幕元素查询得到第二类实体,一个所述第一类实体对应多个所述第二类实体;
    组合模块,所述组合模块用于将所述第二类实体和所述初步识别文本组合生成待识别文本;
    语音交互模块,所述语音交互模块用于利用意图识别模型对所述待识别文本进行意图识别,根据所述意图识别的结果进行语音交互。
  17. 一种服务器,其特征在于,所述服务器包括处理器和存储器,所述存储器上存储有计算机程序,当所述计算机程序被所述处理器执行时,实现权利要求1-15任一项所述的语音交互方法。
  18. 一种包含有计算机程序的非易失性计算机可读存储介质,其特征在于,当所述计算机程序被一个或多个处理器执行时,实现权利要求1-15任一项所述的语音交互方法。
PCT/CN2022/138587 2021-12-28 2022-12-13 语音交互方法及其装置、服务器和可读存储介质 WO2023124957A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111617605.1 2021-12-28
CN202111617605.1A CN113990301B (zh) 2021-12-28 2021-12-28 语音交互方法及其装置、服务器和可读存储介质

Publications (1)

Publication Number Publication Date
WO2023124957A1 true WO2023124957A1 (zh) 2023-07-06

Family

ID=79734616

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/138587 WO2023124957A1 (zh) 2021-12-28 2022-12-13 语音交互方法及其装置、服务器和可读存储介质

Country Status (2)

Country Link
CN (1) CN113990301B (zh)
WO (1) WO2023124957A1 (zh)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113990301B (zh) * 2021-12-28 2022-05-13 广州小鹏汽车科技有限公司 语音交互方法及其装置、服务器和可读存储介质
CN115565532B (zh) * 2022-12-02 2023-05-12 广州小鹏汽车科技有限公司 语音交互方法、服务器及计算机可读存储介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109840064A (zh) * 2017-11-28 2019-06-04 通用汽车环球科技运作有限责任公司 基于用户简档对音量等级进行控制
CN110288985A (zh) * 2019-06-28 2019-09-27 北京猎户星空科技有限公司 语音数据处理方法、装置、电子设备及存储介质
CN111768780A (zh) * 2020-06-28 2020-10-13 广州小鹏车联网科技有限公司 语音控制方法、信息处理方法、车辆和服务器
CN112102832A (zh) * 2020-09-18 2020-12-18 广州小鹏汽车科技有限公司 语音识别方法、装置、服务器和计算机可读存储介质
CN112463106A (zh) * 2020-11-12 2021-03-09 深圳Tcl新技术有限公司 基于智能屏幕的语音交互方法、装置、设备及存储介质
CN113990301A (zh) * 2021-12-28 2022-01-28 广州小鹏汽车科技有限公司 语音交互方法及其装置、服务器和可读存储介质

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20190061706A (ko) * 2017-11-28 2019-06-05 현대자동차주식회사 복수의도를 포함하는 명령어를 분석하는 음성 인식 시스템 및 방법
CN109473100A (zh) * 2018-11-12 2019-03-15 四川驹马科技有限公司 基于语音识别的业务场景语音人机交互方法及其系统
KR20210147678A (ko) * 2020-05-29 2021-12-07 엘지전자 주식회사 인공 지능 장치
CN112882679B (zh) * 2020-12-21 2022-07-01 广州橙行智动汽车科技有限公司 一种语音交互的方法和装置
CN112685535A (zh) * 2020-12-25 2021-04-20 广州橙行智动汽车科技有限公司 语音交互方法、服务器、语音交互系统和存储介质
CN113239178A (zh) * 2021-07-09 2021-08-10 肇庆小鹏新能源投资有限公司 意图生成方法、服务器、语音控制系统和可读存储介质

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109840064A (zh) * 2017-11-28 2019-06-04 通用汽车环球科技运作有限责任公司 基于用户简档对音量等级进行控制
CN110288985A (zh) * 2019-06-28 2019-09-27 北京猎户星空科技有限公司 语音数据处理方法、装置、电子设备及存储介质
CN111768780A (zh) * 2020-06-28 2020-10-13 广州小鹏车联网科技有限公司 语音控制方法、信息处理方法、车辆和服务器
CN112102832A (zh) * 2020-09-18 2020-12-18 广州小鹏汽车科技有限公司 语音识别方法、装置、服务器和计算机可读存储介质
CN112463106A (zh) * 2020-11-12 2021-03-09 深圳Tcl新技术有限公司 基于智能屏幕的语音交互方法、装置、设备及存储介质
CN113990301A (zh) * 2021-12-28 2022-01-28 广州小鹏汽车科技有限公司 语音交互方法及其装置、服务器和可读存储介质

Also Published As

Publication number Publication date
CN113990301A (zh) 2022-01-28
CN113990301B (zh) 2022-05-13

Similar Documents

Publication Publication Date Title
WO2023124957A1 (zh) 语音交互方法及其装置、服务器和可读存储介质
WO2023116500A1 (zh) 语音交互方法及其装置、服务器和可读存储介质
WO2023116523A1 (zh) 语音交互方法及其装置、服务器和可读存储介质
DE112020004504T5 (de) Kontoverbindung mit Gerät
WO2022057152A1 (zh) 语音交互方法、服务器和计算机可读存储介质
CN104123936A (zh) 对话系统自动训练方法、对话系统及用于车辆的控制装置
WO2023125002A1 (zh) 语音交互方法及其装置、模型训练方法、车辆和存储介质
CN112102832B (zh) 语音识别方法、装置、服务器和计算机可读存储介质
WO2023130951A1 (zh) 语音断句方法、装置、电子设备及存储介质
WO2023272502A1 (zh) 一种人机交互方法及装置、设备及车辆
CN116028821B (zh) 融合领域知识的预训练模型训练方法、数据处理方法
DE112015003357B4 (de) Verfahren und System zum Erkennen einer eine Wortabfolge enthaltenden Sprachansage
CN114049894A (zh) 语音交互方法及其装置、车辆和存储介质
CN114550718A (zh) 热词语音识别方法、装置、设备与计算机可读存储介质
CN109493848A (zh) 语音识别方法、系统及电子装置
CN114299929A (zh) 语音交互方法及装置、服务器及存储介质
CN114360518A (zh) 语音交互方法及其装置、服务器和可读存储介质
CN115457960B (zh) 语音交互方法、服务器及计算机可读存储介质
CN114360519A (zh) 语音交互方法及其装置、服务器和可读存储介质
CN114299931A (zh) 语音交互方法及其装置、服务器和可读存储介质
CN114005448A (zh) 语音交互方法及其装置、模型训练方法、车辆和存储介质
CN112562668A (zh) 一种语义信息纠偏方法和装置
CN111933121A (zh) 一种声学模型训练方法及装置
CN110704585A (zh) 一种问答方法、装置及计算机可读介质
CN112669832A (zh) 一种智能设备的语义理解方法、智能设备、管理平台

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22914212

Country of ref document: EP

Kind code of ref document: A1