CN116092495B - Voice interaction method, server and computer readable storage medium - Google Patents

Voice interaction method, server and computer readable storage medium Download PDF

Info

Publication number
CN116092495B
CN116092495B CN202310374373.4A CN202310374373A CN116092495B CN 116092495 B CN116092495 B CN 116092495B CN 202310374373 A CN202310374373 A CN 202310374373A CN 116092495 B CN116092495 B CN 116092495B
Authority
CN
China
Prior art keywords
slot
application program
program interface
voice
voice request
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310374373.4A
Other languages
Chinese (zh)
Other versions
CN116092495A (en
Inventor
丁鹏傑
赵群
宁洪珂
樊骏锋
朱麒宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Xiaopeng Motors Technology Co Ltd
Original Assignee
Guangzhou Xiaopeng Motors Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Xiaopeng Motors Technology Co Ltd filed Critical Guangzhou Xiaopeng Motors Technology Co Ltd
Priority to CN202310374373.4A priority Critical patent/CN116092495B/en
Publication of CN116092495A publication Critical patent/CN116092495A/en
Application granted granted Critical
Publication of CN116092495B publication Critical patent/CN116092495B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Navigation (AREA)
  • Telephonic Communication Services (AREA)

Abstract

The application discloses a voice interaction method, which comprises the following steps: receiving a voice request forwarded by a vehicle; carrying out coding processing on the voice request according to a preset model to obtain a coding sequence matrix for carrying out slot category identification; performing slot recognition on the voice request according to the coding sequence matrix; carrying out application program interface prediction on the voice request; and selecting the predicted application program interface to execute application program interface parameter filling according to the result of the slot position identification and the predicted application program interface, outputting an execution result and transmitting the execution result to the vehicle to complete voice interaction. In the application, the voice request is coded to obtain a coding sequence matrix, the slot position is identified according to the coding sequence matrix, the parameter filling is carried out according to the result of the slot position identification and the predicted application program interface, and finally the execution result is output and issued to the vehicle to complete the voice interaction. The introduction of the coding sequence matrix can effectively improve the accuracy of nested slot identification and improve the voice interaction experience of users.

Description

Voice interaction method, server and computer readable storage medium
Technical Field
The present application relates to the field of vehicle-mounted voice technologies, and in particular, to a voice interaction method, a server, and a computer readable storage medium.
Background
The current dialogue system uses a natural language generation module to analyze the sentence of the user into a semantic label which can be understood by a machine, maintains an internal dialogue state as a compact representation of the whole dialogue history through a dialogue state tracking module, uses a dialogue strategy module to select a proper dialogue action according to the state, and finally converts the dialogue action into a natural language reply through the natural language generation module. In an actual dialogue scene, the situation that sentences are partially overlapped or contain relations may exist in a user voice request, and recognition results in related technologies may be wrong, so that a desired slot position result cannot be extracted, and therefore voice interaction in a vehicle-mounted environment is lack of fluency, and vehicle control requirements in the vehicle-mounted scene are difficult to meet.
Disclosure of Invention
The application provides a voice interaction method, a server and a computer readable storage medium.
The voice interaction method of the application comprises the following steps:
receiving a voice request forwarded by a vehicle;
performing coding processing on the voice request according to a preset model to obtain a coding sequence matrix for identifying the slot class;
performing slot recognition on the voice request according to the coding sequence matrix;
performing application program interface prediction on the voice request;
and selecting the predicted application program interface to execute application program interface parameter filling according to the slot position recognition result and the predicted application program interface, outputting an execution result and transmitting the execution result to a vehicle to complete voice interaction.
Thus, the voice interaction method can carry out coding processing on the voice request to obtain a coding sequence matrix, and carry out slot recognition on the voice request according to the coding sequence matrix. And the application program interface can be subjected to parameter filling according to the result of the slot position identification and the prediction of the slot position identification, and finally the execution result is output and issued to the vehicle to complete voice interaction. According to the application, the voice request is subjected to coding processing to obtain the coding sequence matrix, and the slot recognition is carried out according to the coding sequence matrix, so that the accuracy of nested slot recognition can be effectively improved, and the voice interaction experience of a user is improved.
The step of performing coding processing on the voice request according to a preset model to obtain a coding sequence matrix for identifying the slot class comprises the following steps:
performing text sequence coding processing on the voice request to obtain a first coding vector;
inputting the first coding vector into a pre-training model to obtain an output matrix corresponding to each code in the text sequence;
and obtaining the coding sequence matrix according to the output matrix and the preset model.
Thus, the voice request can be encoded according to the preset model to obtain the encoding sequence matrix for identifying the slot class.
The obtaining the coding sequence matrix according to the output matrix and the preset model comprises the following steps:
extracting a head matrix corresponding to a first code and a tail matrix corresponding to a last code in the output matrix;
and carrying out coding processing on the head matrix and the tail matrix according to the preset model so as to obtain the coding sequence matrix.
Thus, the matrix corresponding to the first code and the last code in the output matrix can be extracted, and the extracted matrix is subjected to coding processing to obtain a coding sequence matrix so as to identify the slot position of the voice request.
The step of performing slot recognition on the voice request according to the coding sequence matrix comprises the following steps:
and identifying the slot position value of the voice request and the slot position type corresponding to the slot position value according to the coding sequence matrix.
Thus, the slot position recognition process can be completed by recognizing the slot position value of the voice request and the slot position type corresponding to the slot position value according to the coding sequence matrix.
The identifying the slot value of the voice request and the slot type corresponding to the slot value according to the coding sequence matrix comprises the following steps:
and identifying all semantic vectors in the voice request according to the coding sequence matrix to identify the slot position value of the voice request.
Thus, the slot position value of the voice request can be identified according to all semantic vectors of the voice request in the coding sequence matrix so as to determine the slot position type corresponding to the slot position value, and the slot position identification process is completed.
The identifying the slot value of the voice request and the slot type corresponding to the slot value according to the coding sequence matrix comprises the following steps:
performing slot type mapping processing on each semantic vector to determine a slot type vector corresponding to each semantic vector;
and determining the slot type corresponding to the slot value according to the slot type vector.
Therefore, the semantic vector can be mapped to obtain the slot type vector, the slot type corresponding to the slot value is determined according to the slot type vector, the slot identification process of the nested slot is completed, and the accuracy of the slot identification process is improved.
The step of performing slot type mapping processing on each semantic vector to determine a slot type vector corresponding to each semantic vector includes:
performing slot type mapping processing on each semantic vector to obtain the confidence coefficient of each slot type vector relative to the semantic vector;
and determining a slot type vector corresponding to each semantic vector according to the confidence.
Therefore, the semantic vector can be subjected to slot type mapping processing, and the corresponding slot type vector is determined according to the confidence level, so that the slot type is finally determined, the process of slot identification is completed, and the accuracy of the nested slot identification process is improved.
And selecting the predicted application program interface to execute application program interface parameter filling according to the slot position identification result and the predicted application program interface, outputting an execution result and transmitting the execution result to a vehicle to complete voice interaction, wherein the method comprises the following steps of:
determining target parameters of slot filling according to the voice request, the slot recognition result, the predicted application program interface and the predicted application program interface type;
and selecting the predicted application program interface to execute application program interface parameter filling according to the slot position identification result and the target parameter, outputting an execution result and transmitting the execution result to a vehicle to complete voice interaction.
Therefore, the method and the device can select the predicted application program interface to execute the application program interface parameter filling according to the result of the slot position identification and the target parameter, directly output the execution result and issue the execution result to the vehicle to complete the voice interaction, reduce the delay of the vehicle-mounted system and improve the response speed to the user instruction.
The server of the present application comprises a processor and a memory, wherein the memory stores a computer program which, when executed by the processor, implements the method described above.
The computer readable storage medium of the present application stores a computer program which, when executed by one or more processors, implements the voice interaction method of any of the above embodiments.
Additional aspects and advantages of embodiments of the application will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the application.
Drawings
The foregoing and/or additional aspects and advantages of the present application will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings, in which:
FIG. 1 is a schematic diagram of a dialogue system in the related art;
FIG. 2 is a schematic diagram of the architecture of the dialog system of the end-to-end architecture of the present application;
FIG. 3 is a flow chart of a voice interaction method of the present application;
FIG. 4 is a second flowchart of the voice interaction method of the present application;
FIG. 5 is a third flow chart of the voice interaction method of the present application;
FIG. 6 is a process flow diagram of a voice interaction method of the present application;
FIG. 7 is a flow chart of a voice interaction method according to the present application;
FIG. 8 is a schematic diagram of a coding sequence matrix of the voice interaction method of the present application;
FIG. 9 is a flow chart of a voice interaction method of the present application;
FIG. 10 is a flowchart of a voice interaction method according to the present application;
FIG. 11 is a flow chart of a voice interaction method according to the present application;
FIG. 12 is a flowchart illustrating a voice interaction method according to the present application.
Detailed Description
Embodiments of the present application are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are exemplary only for explaining the embodiments of the present application and are not to be construed as limiting the embodiments of the present application.
Referring to fig. 1, the conventional vehicle-mounted voice architecture is based on a conventional modularized policy, and the entire dialogue flow, such as natural language understanding, state tracking, dialogue policy, natural language generation, etc., is implemented between components by division of labor. These components are either manually made according to rules or generated by training models on a supervised dataset. Training of each component requires a large amount of annotation data, which however tends to be expensive, which also limits the scalability of the system. Meanwhile, the traditional vehicle-mounted voice system depends on a large number of rules and business logic to ensure the accuracy and stability of the system, and the scale and the functions of the system are further limited.
From the whole processing link of the dialogue, the traditional vehicle-mounted voice architecture takes user input, and needs to perform natural language understanding, namely domain classification, intention recognition and slot recognition, then select and execute an application program interface (Application Programming Interface, API) meeting the user input requirement in the dialogue management module in combination with the dialogue state and dialogue strategy, and return system output interacting with the user through the natural language generation module.
In view of this, referring to fig. 2, the dialogue system based on the end-to-end architecture of the present application comprises three core algorithm modules: the slot position recognition module is used for extracting slot position information in a voice request input by a user; the action prediction (Action Prediction, AP) module is used for predicting an application program interface which corresponds to the user input and realizes the current target of the user; the parameter Filling (AF) module is used to identify that the slot information in the user input corresponds to the parameters in the application program interface obtained in the previous step.
The slot identification module is used for acquiring slot information of an action execution main body required to be called in the application program interface, the action prediction module determines whether the application program interface called by the subsequent realization of user voice input is correct, and the parameter filling module selects which vehicle parts are used as parameters of the application program interface for execution.
However, for a user voice request for a nested slot that exists, the slot identification process may not be able to identify multiple nested slots at the same time, which may be a problem with slot identification accuracy. Taking the slot information of the vehicle control and graphic user interface as an example, the phenomenon of slot nesting is very easy to generate. For example, in the user voice request "close secondary leg rest", the "secondary driving" may exist in two slots of the vehicle control slot "secondary driving" and the graphical user interface slot "secondary leg rest" at the same time, and the slot recognition may be wrong, so that the user interaction experience is poor.
Based on the above problems, referring to fig. 3, the present application provides a voice interaction method. The voice interaction method comprises the following steps:
01: receiving a voice request forwarded by a vehicle;
02: carrying out coding processing on the voice request according to a preset model to obtain a coding sequence matrix for carrying out slot category identification;
03: performing slot recognition on the voice request according to the coding sequence matrix;
04: carrying out application program interface prediction on the voice request;
05: and selecting the predicted application program interface to execute application program interface parameter filling according to the result of the slot position identification and the predicted application program interface, outputting an execution result and transmitting the execution result to the vehicle to complete voice interaction.
The application also provides a server. The server includes a processor and a memory having a computer program stored thereon. The processor is used for receiving the voice request forwarded by the vehicle, carrying out coding processing on the voice request according to a preset model to obtain a coding sequence matrix for carrying out slot category recognition, carrying out slot recognition on the voice request according to the coding sequence matrix, carrying out application program interface prediction on the voice request, selecting the predicted application program interface to execute application program interface parameter filling according to the result of the slot recognition and the predicted application program interface, and outputting an execution result to be issued to the vehicle to complete voice interaction.
Firstly, receiving a user voice request forwarded by a vehicle, and carrying out coding processing on the voice request according to a preset model. The slot information and other information in the user voice request statement can be distinguished after the encoding process. And obtaining a coding sequence matrix for identifying the slot position category after the coding treatment. The coding sequence matrix includes, in addition to text content, flag characters such as "[ CLS ]", "[ SEP ]", and the like. Wherein the [ CLS ] character is a flag character of text start. For a plurality of consecutive voice requests, a [ SEP ] identifier is also included between each voice request for separating two sentences.
In particular, when a user voice request has a slot nesting problem, the voice request can be slot identified according to information in the coding sequence matrix. In the coding sequence matrix, each element can be regarded as a vector representation of a sub-word sequence in the voice request, and the user voice request can be subjected to slot recognition according to the vector representation result of the sub-sequence. For example, for a voice request of closing the secondary leg rest, the secondary leg rest and the secondary leg rest are nested with each other, and the slot positions for distinguishing the two nested relations of the secondary leg rest and the secondary leg rest can be expressed by different vectors, so that the voice request is subjected to slot position recognition. And obtaining the slot position identification result as a vehicle control slot position 'auxiliary driving' and a graphic user interface slot position 'auxiliary driving leg support'.
In order to solve the problem that the manpower cost and the data cost are too high because each vertical domain in the slot position identification needs to be independently designed, the slot position identification scheme adopts an end-to-end structure, does not distinguish the vertical domains, and does not need to train a model in the vertical domain.
After the slot identification is completed, the application program interface prediction can be performed on the voice request according to the slot identification result of the user voice request. First, an Application Program Interface (API) required for the voice request may be predicted by a Action Prediction (AP) module based on the result of slot recognition. For example, an application program interface predicted by an application program interface for a user voice request "play song a" is the application program interface 1 for playing music. The application program interface predicted by the application program interface for the user voice request 'navigation to destination A' is the application program interface 2 for navigation.
In addition, the image forming (AF) module can fill parameters in the application program interface by selecting the slot recognition result, and finally output the execution result to be issued to the vehicle to complete voice interaction.
The end-to-end architecture of the application can simplify intermediate modules of the traditional dialogue system architecture, such as a natural language understanding module, a dialogue management module, a car machine instruction generation module, a natural language generation module and the like, reduce the call of a plurality of models with different vertical domains, reduce the delay of a vehicle-mounted system and improve the response speed to user instructions.
In summary, the voice interaction method of the application can perform coding processing on the voice request to obtain a coding sequence matrix, and perform slot recognition on the voice request according to the coding sequence matrix. And the application program interface can be subjected to parameter filling according to the result of the slot position identification and the prediction of the slot position identification, and finally the execution result is output and issued to the vehicle to complete voice interaction. According to the application, the voice request is subjected to coding processing to obtain the coding sequence matrix, and the slot recognition is carried out according to the coding sequence matrix, so that the accuracy of nested slot recognition can be effectively improved, and the voice interaction experience of a user is improved.
Referring to fig. 4, step 02 includes:
021: performing text sequence coding processing on the voice request to obtain a first coding vector;
022: inputting the first coding vector into a pre-training model to obtain an output matrix corresponding to each code in the text sequence;
023: and obtaining a coding sequence matrix according to the output matrix and a preset model.
The processor is used for carrying out text sequence coding processing on the voice request to obtain a first coding vector; inputting the first coding vector into a pre-training model to obtain an output matrix corresponding to each code in the text sequence, and obtaining a coding sequence matrix according to the output matrix and the pre-training model.
Specifically, after receiving a user voice request forwarded by a vehicle, a voice assistant first needs to perform text sequence coding processing on the voice request to obtain a character sequence corresponding to the voice request, which is called a first coding vector. In one example, the voice request sent by the user is "close secondary leg rest", as shown in table 1, the text sequence encoding process is performed on the voice request, so as to obtain a character sequence "Token" corresponding to the voice request, that is, a first encoding vector, the content of which is "[ CLS ], close, secondary, leg rest, [ SEP ]".
After the text sequence coding process is performed to obtain a first coding vector, the first coding vector is required to be input into a pre-training model. As shown in FIG. 5, the pre-training model may be chosen from BERT, the choice of a particular model is not limited herein. After the pre-training is finished, an output matrix corresponding to each code in the text sequence is obtained. That is, the output matrix includes all text sequence encoded information of the voice request.
Finally, as shown in fig. 5, a coding sequence matrix may be obtained according to the output matrix and a preset model. The predetermined model may be Biaffine, and the specific model is not limited herein. And obtaining a coding sequence matrix after the output matrix passes through a preset model. The information of the coding sequence matrix is the basis for carrying out slot recognition on the voice request of the user.
Thus, the voice request can be encoded according to the preset model to obtain the encoding sequence matrix for identifying the slot class.
Referring to fig. 6, step 023 includes:
0231: extracting a head matrix corresponding to a first code and a tail matrix corresponding to a last code in the output matrix;
0232: and carrying out coding processing on the head matrix and the tail matrix according to a preset model to obtain a coding sequence matrix.
The processor is used for extracting a head matrix corresponding to the first code and a tail matrix corresponding to the last code in the output matrix, and carrying out coding processing on the head matrix and the tail matrix according to a preset model so as to obtain a coding sequence matrix.
Specifically, the first coding vector obtained after the text sequence coding processing of the voice request can be pre-trained to obtain an output matrix. The output matrix contains information of each code in the first code vector. The head matrix corresponding to the first code and the tail matrix corresponding to the last code in the output matrix are extracted. The head matrix and the tail matrix can show the nested relation of the slots more completely. For example, for a voice request "close secondary leg rest", its corresponding first encoding vector is "[ CLS ], closed, secondary, leg, rest, [ SEP ]". Thus, the header matrix size corresponding to the first code "[ CLS ]" is [ batch_size, seq_len, hidden ], which may include the first full slot value "secondary drive" in sequential codes. Similarly, the header matrix size corresponding to the last code "[ SEP ]" is [ batch_size, hidden, seq_len ], which may contain the last full slot value "secondary leg rest" in the sequential code.
The encoding process may introduce a "Biaffine matrix" of size [ hidden, hidden, num_label ]. According to a preset model, which may be a Biaffine model, the head matrix and the tail matrix are subjected to coding processing to obtain a coding sequence matrix. The coding matrix contains the slot information of the voice request, and the shape of the slot information is [ batch_size, seq_len, num_label ].
Thus, the matrix corresponding to the first code and the last code in the output matrix can be extracted, and the extracted matrix is subjected to coding processing to obtain a coding sequence matrix so as to identify the slot position of the voice request.
Referring to fig. 7, step 03 includes:
031: and identifying the slot position value of the voice request according to the coding sequence matrix and the slot position type corresponding to the slot position value.
The processor is used for identifying the slot position value of the voice request and the slot position type corresponding to the slot position value according to the coding sequence matrix.
Specifically, the slot information of the voice request may be determined according to information in the coding sequence matrix having the shape of [ batch_size, seq_len, num_label ]. The slot information includes a slot value and a slot type corresponding to the slot value.
In one example, the user sends out a voice request "close secondary leg rest" and the corresponding first code vector is "[ CLS ], close, secondary, leg rest, [ SEP ]", and a code matrix sequence diagram corresponding to the voice request as shown in fig. 8 can be obtained. The information in the figure shows that in the voice request "close secondary leg rest", there is a slot value "secondary drive" starting with "secondary" ending with "drive", and there is a slot value "secondary leg rest" starting with "secondary" ending with "rest". And the two slot values correspond to different slot types.
Thus, the slot position recognition process can be completed by recognizing the slot position value of the voice request and the slot position type corresponding to the slot position value according to the coding sequence matrix.
Referring to fig. 9, step 031 includes:
0311: all semantic vectors in the voice request are identified according to the coding sequence matrix to identify slot values of the voice request.
The processor is configured to identify all semantic vectors in the voice request based on the code sequence matrix to identify slot values for the voice request.
Specifically, in the code sequence matrix, in a "j"column-corresponding character starting with"i"semantic vector corresponding to the sub-code sequence of the end of the character corresponding to the line" can be recorded as coordinatesi,j). In the coding sequence matrix, whenWhen the text corresponding to the sub-coding sequence contains actual semantics, a specific slot value can be determined by the semantic vector.
In one example, the user sends out a voice request "turn off the secondary leg rest" to obtain a schematic diagram of a coding matrix sequence corresponding to the voice request as shown in fig. 7, where semantic vectors corresponding to 8×8 (64 total) subcode sequences exist. Wherein there are two slot values, e.g., (4, 3) for a slot value corresponding to a subcode sequence beginning with 3 and ending with 4, i.e., a slot value corresponding to a character "secondary" beginning with a character "secondary" corresponding to column 3, a slot value corresponding to a character "secondary" ending with a character "secondary" corresponding to row 4, and (6, 3) for a slot value "secondary leg rest" corresponding to a subcode sequence beginning with 3 and ending with 6. The text corresponding to the rest semantic vectors does not contain actual semantics, does not correspond to a specific slot value, and does not have a corresponding slot type, so that 0 is set at the corresponding intersection position, for example, the slot value corresponding to (2, 3) is "secondary closure".
Thus, the slot position value of the voice request can be identified according to all semantic vectors of the voice request in the coding sequence matrix so as to determine the slot position type corresponding to the slot position value, and the slot position identification process is completed.
Referring to fig. 10, step 031 includes:
0312: performing slot type mapping processing on each semantic vector to determine a slot type vector corresponding to each semantic vector;
0313: and determining the slot type corresponding to the slot value according to the slot type vector.
The processor is used for carrying out slot type mapping processing on each semantic vector to determine a slot type vector corresponding to each semantic vector, and determining a slot type corresponding to a slot value according to the slot type vector.
Specifically, in the code sequence matrix, the semantic vector of the slot value needs to be mapped to a tag space through mapping processing, so as to obtain a slot type vector corresponding to each semantic vector. When the semantic vector does not correspond to a slot value, the slot type vector is 0. In the above mapping process, the mapping process may be performed by full connection layer mapping, and the specific mapping process is not limited herein. Further, decoding the slot type vector can obtain the slot type corresponding to the slot type vector. The decoded slot type and the slot value can jointly form a slot identification result.
In particular, in the code sequence matrix, there may be cases of slot nesting. For example, whenikWhen the coordinates are%i,j) The corresponding slot position value and coordinate of the sub-coding sequence are @, respectivelykj) The slot values corresponding to the sub-code sequences are nested with each other.
In one example, a user sends a voice request to close the secondary leg rest to obtain two slot values, namely, a slot value of "secondary driving" corresponding to coordinates (3, 2) and a slot value of "secondary leg rest" corresponding to coordinates (5, 2), wherein the two slot values are in a nested relationship. Mapping the semantic vector of the nested slot into a tag space of the nested slot through mapping processing to obtain a slot type vector 1 corresponding to the auxiliary driving, and a slot type vector 2 corresponding to the auxiliary leg support, wherein the auxiliary driving and the auxiliary leg support are different in slot types. Further, the slot type vector is decoded, and the slot type corresponding to the slot type vector 1 is "vehicle control (Device)", and the slot type corresponding to the slot type vector 2 is "scene through (GUI)".
Therefore, the semantic vector can be mapped to obtain the slot type vector, the slot type corresponding to the slot value is determined according to the slot type vector, the slot identification process of the nested slot is completed, and the accuracy of the slot identification process is improved.
Referring to fig. 11, step 0312 includes:
03121: performing slot type mapping processing on each semantic vector to obtain the confidence coefficient of each slot type vector relative to the semantic vector;
03122: and determining a slot type vector corresponding to each semantic vector according to the confidence.
The processor is used for carrying out slot type mapping processing on each semantic vector to obtain the confidence coefficient of each slot type vector relative to the semantic vector, and determining the slot type vector corresponding to each semantic vector according to the confidence coefficient.
Specifically, in the coding sequence matrix, the semantic vectors of the slot values need to be mapped into a tag space through mapping processing, so as to obtain the slot type vector corresponding to each semantic vector. The different slot types are represented by their corresponding different slot type vectors.
In particular, when there is a slot nesting of the encoding sequence matrix, there may be a case where correspondence of a plurality of slot values and slot types is extracted. At this time, the confidence of the slot type vector relative to the semantic vector needs to be introduced so as to determine the slot recognition result. For example, when there is a slot nesting in the code sequence matrix, a higher confidence slot type vector is determined to be the unique slot type vector. The criterion for determining the slot type vector according to the confidence level may also be a comparison with a certain preset threshold, and the specific criterion is not limited herein.
In one example, the user issues a voice request to "close the secondary leg rest" resulting in two slot values. There is a possible case that the confidence coefficient of the slot type vector 1 corresponding to the "assistant driving" is 0.1, and the confidence coefficient of the slot type vector 2 corresponding to the "assistant driving leg support" is 0.8, and then the slot type vector 2 with higher confidence coefficient is determined to be the slot type vector corresponding to the semantic vector in the user voice request.
Therefore, the semantic vector can be subjected to slot type mapping processing, and the corresponding slot type vector is determined according to the confidence level, so that the slot type is finally determined, the process of slot identification is completed, and the accuracy of the nested slot identification process is improved.
Referring to fig. 12, step 05 includes:
051: determining target parameters of slot filling according to the voice request, the slot recognition result, the predicted application program interface and the predicted application program interface type;
052: and selecting the predicted application program interface to execute application program interface parameter filling according to the result of the slot position identification and the target parameter, outputting an execution result and transmitting the execution result to the vehicle to complete voice interaction.
The processor is used for determining target parameters of slot filling according to the voice request, the slot recognition result, the predicted application program interface and the predicted application program interface type, selecting the predicted application program interface to execute the application program interface parameter filling according to the slot recognition result and the target parameters, and outputting an execution result to be issued to the vehicle to complete voice interaction.
Specifically, the target parameters for slot filling may be determined based on the user voice request, the results of slot recognition, and the predicted application program interface and interface type. The target parameter is the slot name corresponding to the slot identification result. Finally, according to the result of the slot position identification and the target parameter, a predicted application program interface is selected, the filled target parameter is executed, and the output execution result is issued to the vehicle so as to complete the voice interaction.
For example, for a user voice request "help me turn off fragrance smell", the result of slot recognition is: the parameters of the application program interface 1 include 'vehicle control', the corresponding application program interface type is 'vehicle control (Device)', the target parameter of the application program interface 1, which is required to be filled with 'vehicle control', in the result of the slot position identification is judged to be 'vehicle control', and the action of closing the vehicle-mounted fragrance can be correspondingly executed after the 'fragrance' in the result of the slot position identification is filled into the application program interface 1 of the vehicle control. Because the voice request also has a slot position [ "fragrance smell" -Graphical User Interface (GUI) ], further application program interface prediction needs to be further performed, so that parameters of the application program interface 2 include the "graphical user interface", the corresponding application program interface type is the "Graphical User Interface (GUI)" type, further the target parameter of the "fragrance smell" in the result of slot position identification, which needs to be filled into the application program interface 2, is judged to be the "graphical user interface", and after the "fragrance smell" in the result of slot position identification is filled into the application program interface 2 controlled by the vehicle, actions of closing the vehicle-mounted fragrance smell can be correspondingly displayed on the vehicle-mounted system user interaction interface, and finally the voice interaction process is completed.
As another example, for a user voice request "navigate to go to intermediate country", the result of slot recognition: the parameters of the application program interface 2 include 2 parameters of a departure Place and a destination, the corresponding application program interface type is a navigation type, and further the target parameter which is required to be filled into the application program interface 2 in the result of the slot identification is judged to be the destination, so that the navigation task for navigating to the middle-Guanyu can be correspondingly executed after the middle-Guanyu in the result of the slot identification is filled into the navigation application program interface 2, and the voice interaction is completed.
Therefore, the method and the device can select the predicted application program interface to execute the application program interface parameter filling according to the result of the slot position identification and the target parameter, directly output the execution result and issue the execution result to the vehicle to complete the voice interaction, reduce the delay of the vehicle-mounted system and improve the response speed to the user instruction.
The computer readable storage medium of the present application stores a computer program which, when executed by one or more processors, implements the method described above.
In the description of the present specification, reference to the terms "above," "specifically," "particularly," "further," and the like, means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present application. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiments or examples. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, the different embodiments or examples described in this specification and the features of the different embodiments or examples may be combined and combined by those skilled in the art without contradiction.
Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable requests for implementing specific logical functions or steps of the process, and further implementations are included within the scope of the preferred embodiment of the present application in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present application.
While embodiments of the present application have been shown and described above, it will be understood that the above embodiments are illustrative and not to be construed as limiting the application, and that variations, modifications, alternatives and variations may be made to the above embodiments by one of ordinary skill in the art within the scope of the application.

Claims (8)

1. A method of voice interaction, comprising:
receiving a voice request forwarded by a vehicle;
performing text sequence coding processing on the voice request to obtain a first coding vector;
inputting the first coding vector into a pre-training model to obtain an output matrix corresponding to each code in the text sequence;
extracting a head matrix corresponding to a first code and a tail matrix corresponding to a last code in the output matrix;
coding the head matrix and the tail matrix according to a preset model to obtain a coding sequence matrix;
performing slot recognition on the voice request according to the coding sequence matrix;
performing application program interface prediction on the voice request;
and selecting the predicted application program interface to execute application program interface parameter filling according to the slot position recognition result and the predicted application program interface, outputting an execution result and transmitting the execution result to a vehicle to complete voice interaction.
2. The voice interaction method according to claim 1, wherein the performing slot recognition on the voice request according to the coding sequence matrix includes:
and identifying the slot position value of the voice request and the slot position type corresponding to the slot position value according to the coding sequence matrix.
3. The voice interaction method of claim 2, wherein the identifying the slot value of the voice request and the slot type corresponding to the slot value according to the code sequence matrix comprises:
and identifying all semantic vectors in the voice request according to the coding sequence matrix to identify the slot position value of the voice request.
4. A method of voice interaction according to claim 3, wherein said identifying a slot value of the voice request and a slot type corresponding to the slot value from the code sequence matrix comprises:
performing slot type mapping processing on each semantic vector to determine a slot type vector corresponding to each semantic vector;
and determining the slot type corresponding to the slot value according to the slot type vector.
5. The voice interaction method of claim 4, wherein the performing a slot type mapping process on each of the semantic vectors to determine a slot type vector corresponding to each of the semantic vectors comprises:
performing slot type mapping processing on each semantic vector to obtain the confidence coefficient of each slot type vector relative to the semantic vector;
and determining a slot type vector corresponding to each semantic vector according to the confidence.
6. The voice interaction method according to claim 1, wherein the selecting the predicted application program interface to execute application program interface parameter filling according to the result of the slot recognition and the predicted application program interface, outputting the execution result and transmitting to a vehicle to complete voice interaction comprises:
determining target parameters of slot filling according to the voice request, the slot recognition result, the predicted application program interface and the predicted application program interface type;
and selecting the predicted application program interface to execute application program interface parameter filling according to the slot position identification result and the target parameter, outputting an execution result and transmitting the execution result to a vehicle to complete voice interaction.
7. A server comprising a processor and a memory, the memory having stored thereon a computer program which, when executed by the processor, implements the voice interaction method of any of claims 1-6.
8. A non-transitory computer readable storage medium containing a computer program, characterized in that the voice interaction method of any of claims 1-6 is implemented when the computer program is executed by one or more processors.
CN202310374373.4A 2023-04-07 2023-04-07 Voice interaction method, server and computer readable storage medium Active CN116092495B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310374373.4A CN116092495B (en) 2023-04-07 2023-04-07 Voice interaction method, server and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310374373.4A CN116092495B (en) 2023-04-07 2023-04-07 Voice interaction method, server and computer readable storage medium

Publications (2)

Publication Number Publication Date
CN116092495A CN116092495A (en) 2023-05-09
CN116092495B true CN116092495B (en) 2023-08-29

Family

ID=86202957

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310374373.4A Active CN116092495B (en) 2023-04-07 2023-04-07 Voice interaction method, server and computer readable storage medium

Country Status (1)

Country Link
CN (1) CN116092495B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111666381A (en) * 2020-06-17 2020-09-15 中国电子科技集团公司第二十八研究所 Task type question-answer interaction system oriented to intelligent control
CN113486669A (en) * 2021-07-06 2021-10-08 上海市东方医院(同济大学附属东方医院) Semantic recognition method for emergency rescue input voice
CN113505591A (en) * 2020-03-23 2021-10-15 华为技术有限公司 Slot position identification method and electronic equipment
CN113887237A (en) * 2021-09-29 2022-01-04 平安普惠企业管理有限公司 Slot position prediction method and device for multi-intention text and computer equipment
CN115064166A (en) * 2022-08-17 2022-09-16 广州小鹏汽车科技有限公司 Vehicle voice interaction method, server and storage medium
CN115457959A (en) * 2022-11-08 2022-12-09 广州小鹏汽车科技有限公司 Voice interaction method, server and computer readable storage medium

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111950269A (en) * 2020-08-21 2020-11-17 清华大学 Text statement processing method and device, computer equipment and storage medium
US20230069049A1 (en) * 2021-08-23 2023-03-02 Robert Bosch Gmbh System and method for a natural language understanding system based on iterative intent detection and slot filling neural layers

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113505591A (en) * 2020-03-23 2021-10-15 华为技术有限公司 Slot position identification method and electronic equipment
CN111666381A (en) * 2020-06-17 2020-09-15 中国电子科技集团公司第二十八研究所 Task type question-answer interaction system oriented to intelligent control
CN113486669A (en) * 2021-07-06 2021-10-08 上海市东方医院(同济大学附属东方医院) Semantic recognition method for emergency rescue input voice
CN113887237A (en) * 2021-09-29 2022-01-04 平安普惠企业管理有限公司 Slot position prediction method and device for multi-intention text and computer equipment
CN115064166A (en) * 2022-08-17 2022-09-16 广州小鹏汽车科技有限公司 Vehicle voice interaction method, server and storage medium
CN115457959A (en) * 2022-11-08 2022-12-09 广州小鹏汽车科技有限公司 Voice interaction method, server and computer readable storage medium

Also Published As

Publication number Publication date
CN116092495A (en) 2023-05-09

Similar Documents

Publication Publication Date Title
CN110413746B (en) Method and device for identifying intention of user problem
CN110032633B (en) Multi-turn dialogue processing method, device and equipment
CN115064166B (en) Vehicle voice interaction method, server and storage medium
CN112632961B (en) Natural language understanding processing method, device and equipment based on context reasoning
CN115083413B (en) Voice interaction method, server and storage medium
JP7204802B2 (en) Dialogue generation method, device, electronic device and medium
CN115064167B (en) Voice interaction method, server and storage medium
CN110110331B (en) Text generation method, device, medium and computing equipment
CN111984770B (en) Man-machine conversation method and device
CN111651573A (en) Intelligent customer service dialogue reply generation method and device and electronic equipment
Gulyaev et al. Goal-oriented multi-task bert-based dialogue state tracker
WO2021117180A1 (en) Dialog processing device, learning device, dialog processing method, learning method, and program
CN117216212A (en) Dialogue processing method, dialogue model training method, device, equipment and medium
CN116956942A (en) Multi-domain dialogue state tracking method, device, equipment and storage medium based on slot sharing span prediction
WO2024067471A1 (en) Speech recognition method, and server, speech recognition system and readable storage medium
CN112580368B (en) Method, device, equipment and storage medium for identifying intention sequence of conversation text
CN116092495B (en) Voice interaction method, server and computer readable storage medium
CN117370512A (en) Method, device, equipment and storage medium for replying to dialogue
CN116092493B (en) Voice interaction method, server and computer readable storage medium
CN115064168B (en) Voice interaction method, server and storage medium
CN116110397B (en) Voice interaction method, server and computer readable storage medium
CN116665667A (en) Voice interaction method, voice interaction device, server and computer readable storage medium
CN116092494B (en) Voice interaction method, server and computer readable storage medium
CN117476004A (en) Voice interaction method, server and computer readable storage medium
CN114090727A (en) Model distillation method, device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant