CN116665667A - Voice interaction method, voice interaction device, server and computer readable storage medium - Google Patents

Voice interaction method, voice interaction device, server and computer readable storage medium Download PDF

Info

Publication number
CN116665667A
CN116665667A CN202310599110.3A CN202310599110A CN116665667A CN 116665667 A CN116665667 A CN 116665667A CN 202310599110 A CN202310599110 A CN 202310599110A CN 116665667 A CN116665667 A CN 116665667A
Authority
CN
China
Prior art keywords
matrix
slot
voice request
application program
coding
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310599110.3A
Other languages
Chinese (zh)
Inventor
丁鹏傑
赵群
宁洪珂
樊骏锋
朱麒宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Xiaopeng Motors Technology Co Ltd
Original Assignee
Guangzhou Xiaopeng Motors Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Xiaopeng Motors Technology Co Ltd filed Critical Guangzhou Xiaopeng Motors Technology Co Ltd
Priority to CN202310599110.3A priority Critical patent/CN116665667A/en
Publication of CN116665667A publication Critical patent/CN116665667A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G08SIGNALLING
    • G08GTRAFFIC CONTROL SYSTEMS
    • G08G1/00Traffic control systems for road vehicles
    • G08G1/09Arrangements for giving variable traffic instructions
    • G08G1/0962Arrangements for giving variable traffic instructions having an indicator mounted inside the vehicle, e.g. giving voice messages
    • G08G1/0967Systems involving transmission of highway information, e.g. weather, speed limits
    • G08G1/096708Systems involving transmission of highway information, e.g. weather, speed limits where the received information might be used to generate an automatic action on the vehicle control
    • G08G1/096725Systems involving transmission of highway information, e.g. weather, speed limits where the received information might be used to generate an automatic action on the vehicle control where the received information generates an automatic action on the vehicle control
    • GPHYSICS
    • G08SIGNALLING
    • G08GTRAFFIC CONTROL SYSTEMS
    • G08G1/00Traffic control systems for road vehicles
    • G08G1/09Arrangements for giving variable traffic instructions
    • G08G1/0962Arrangements for giving variable traffic instructions having an indicator mounted inside the vehicle, e.g. giving voice messages
    • G08G1/0968Systems involving transmission of navigation instructions to the vehicle
    • G08G1/096877Systems involving transmission of navigation instructions to the vehicle where the input to the navigation device is provided by a suitable I/O arrangement
    • GPHYSICS
    • G08SIGNALLING
    • G08GTRAFFIC CONTROL SYSTEMS
    • G08G1/00Traffic control systems for road vehicles
    • G08G1/09Arrangements for giving variable traffic instructions
    • G08G1/0962Arrangements for giving variable traffic instructions having an indicator mounted inside the vehicle, e.g. giving voice messages
    • G08G1/0968Systems involving transmission of navigation instructions to the vehicle
    • G08G1/0969Systems involving transmission of navigation instructions to the vehicle having a display in the form of a map
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Remote Sensing (AREA)
  • Radar, Positioning & Navigation (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Human Computer Interaction (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Atmospheric Sciences (AREA)
  • Telephonic Communication Services (AREA)

Abstract

The application discloses a voice interaction method, which comprises the following steps: receiving a voice request forwarded by a vehicle; carrying out coding processing on the voice request according to a preset model to obtain a coding sequence matrix for carrying out slot recognition; dividing the coding sequence matrix; carrying out slot recognition on the voice request according to the code sequence matrix after the division processing; carrying out application program interface prediction on the voice request; and selecting the predicted application program interface to execute application program interface parameter filling according to the result of the slot position identification and the predicted application program interface, outputting an execution result and transmitting the execution result to the vehicle to complete voice interaction. According to the application, the voice request is encoded, the slot recognition is carried out according to the encoding sequence matrix after the division processing, the application program interface parameter filling is carried out according to the result of the slot recognition, the execution result is finally output and the vehicle is issued, so that the accuracy of recognizing the slot composed of discontinuous words in the voice request is effectively improved, and the voice interaction experience of the user is improved.

Description

Voice interaction method, voice interaction device, server and computer readable storage medium
Technical Field
The present application relates to the field of vehicle-mounted voice technologies, and in particular, to a voice interaction method, a voice interaction device, a server, and a computer readable storage medium.
Background
The current dialogue system uses a natural language generation module to analyze the sentence of the user into a semantic label which can be understood by a machine, maintains an internal dialogue state as a compact representation of the whole dialogue history through a dialogue state tracking module, uses a dialogue strategy module to select a proper dialogue action according to the state, and finally converts the dialogue action into a natural language reply through the natural language generation module. In an actual dialogue scene, a user voice request may not accurately hit other slot information, if a situation that a slot to be identified is discontinuous in the user voice request exists, an identification result in the related technology may be wrong, and an expected slot result cannot be extracted, so that voice interaction in a vehicle-mounted environment lacks fluency, and vehicle control requirements in the vehicle-mounted scene are difficult to meet.
Disclosure of Invention
The embodiment of the application provides a voice interaction method, a voice interaction device, a server and a computer readable storage medium.
The voice interaction method of the embodiment of the application comprises the following steps:
receiving a voice request forwarded by a vehicle;
performing coding processing on the voice request according to a preset model to obtain a coding sequence matrix for carrying out slot recognition;
dividing the coding sequence matrix;
performing slot recognition on the voice request according to the code sequence matrix after the division processing;
performing application program interface prediction on the voice request;
and selecting the predicted application program interface to execute application program interface parameter filling according to the slot position recognition result and the predicted application program interface, outputting an execution result and transmitting the execution result to a vehicle to complete voice interaction.
Therefore, the voice interaction method of the embodiment of the application can carry out coding processing on the voice request to obtain the coding sequence matrix, carry out slot recognition on the voice request according to the coding sequence matrix after the division processing, finally carry out parameter filling on an application program interface according to the result of the slot recognition and the prediction of the slot recognition result, finally output the execution result and send the execution result to a vehicle to finish voice interaction. In the embodiment of the application, the voice request is encoded to obtain the encoding sequence matrix, and the slot recognition is carried out according to the encoding sequence matrix, so that the accuracy of recognizing the slot formed by discontinuous words in the voice request of the user can be effectively improved, and the voice interaction experience of the user is improved.
In some embodiments, the encoding the voice request according to a preset model to obtain a coding sequence matrix for performing slot recognition includes:
performing text sequence coding processing on the voice request to obtain a first coding vector;
inputting the first coding vector into a pre-training model to obtain an output matrix corresponding to each code in the text sequence;
and obtaining the coding sequence matrix according to the output matrix and the preset model.
Thus, the voice request can be encoded according to the preset model to obtain the encoding sequence matrix for carrying out slot recognition.
In some embodiments, the obtaining the coding sequence matrix according to the output matrix and the preset model includes:
extracting a head matrix corresponding to a first code and a tail matrix corresponding to a last code in the output matrix;
and carrying out coding processing on the head matrix and the tail matrix according to the preset model so as to obtain the coding sequence matrix.
Thus, the matrix corresponding to the first code and the last code in the output matrix can be extracted, and the extracted matrix is subjected to coding processing to obtain a coding sequence matrix so as to identify the slot position of the voice request.
In some embodiments, the dividing the coding sequence matrix includes:
dividing the coding sequence matrix according to diagonal lines of the coding sequence matrix to obtain a first sub-matrix and a second sub-matrix;
the method comprises the steps that the voice request is subjected to slot recognition according to the code sequence matrix after the division processing;
and carrying out slot recognition on the voice request according to the first sub-matrix and the second sub-matrix.
Therefore, the coding sequence matrix can be divided, and the voice request is subjected to slot identification through the subarrays obtained by division, so that the slot identification result is more accurate.
In some embodiments, the performing slot recognition on the voice request according to the first sub-matrix and the second sub-matrix includes:
determining a target coding subsequence for slot identification according to the first submatrix;
and carrying out slot recognition on the voice request according to the target coding subsequence and the second submatrix.
Thus, the target coding subsequence can be obtained according to the first submatrix, and the slot position recognition process is completed by combining the second submatrix, so that the recognition result of the slot position formed by discontinuous words is more accurate.
In some embodiments, the determining a target coding sub-sequence for slot identification according to the first sub-matrix includes:
identifying all semantic vectors in the voice request;
marking head and tail position identifiers of target slot values in the first submatrix according to a preset slot value table, wherein the target slot values are determined according to the corresponding relation between the semantic vector and the slot value table;
and determining the target coding subsequence where the target slot position value is located according to the head-tail position identification.
Therefore, the target coding subsequence where the target slot position value is located in the first sub-matrix can be identified by using the head and tail position identification through all the identified semantic vectors and the slot position value table, so that the target slot position value can be determined later, the slot position identification process of discontinuous words is completed, and the accuracy of the slot position identification process and the fluency of the voice interaction process are improved.
In some embodiments, the performing slot recognition on the voice request according to the target coding subsequence and the second submatrix includes:
marking adjacent relation identifiers of codes of the target coding sub-sequence according to the target slot position value in the second sub-matrix;
Splicing codes with the adjacent relation according to the adjacent relation identification to obtain the target slot position value so as to obtain a result of carrying out slot position identification on the voice request.
Therefore, adjacent relations of all codes of the target coding sub-sequence can be identified through the second sub-matrix, codes with adjacent relations are spliced to obtain the target slot position value, the slot position identification process formed by discontinuous words is completed, and the slot position identification accuracy and the smoothness of the voice interaction process are improved.
In some embodiments, the selecting the predicted application program interface to execute application program interface parameter filling according to the result of the slot identification and the predicted application program interface, outputting the execution result, and issuing the execution result to the vehicle to complete voice interaction includes:
determining target parameters of slot filling according to the voice request, the slot recognition result, the predicted application program interface and the predicted application program interface type;
and selecting the predicted application program interface to execute application program interface parameter filling according to the slot position identification result and the target parameter, outputting an execution result and transmitting the execution result to a vehicle to complete voice interaction.
Therefore, according to the embodiment of the application, the predicted application program interface is selected to execute the application program interface parameter filling according to the result of the slot position identification and the target parameter, the execution result is directly output and issued to the vehicle to complete the voice interaction, the delay of the vehicle-mounted system can be reduced, and the response speed to the user instruction is improved.
The voice interaction device of the embodiment of the application comprises:
the receiving module is used for receiving the voice request forwarded by the vehicle;
the coding module is used for carrying out coding processing on the voice request according to a preset model so as to obtain a coding sequence matrix for carrying out slot identification;
the processing module is used for dividing the coding sequence matrix;
the slot position recognition module is used for recognizing the slot position of the voice request according to the code sequence matrix after the division processing;
the interface prediction module predicts an application program interface of the voice request;
and the parameter filling module is used for selecting the predicted application program interface to execute application program interface parameter filling according to the slot position identification result and the predicted application program interface, outputting an execution result and transmitting the execution result to the vehicle to complete voice interaction.
The server of the embodiment of the application comprises a processor and a memory, wherein the memory stores a computer program, and the computer program realizes the method when being executed by the processor.
The computer-readable storage medium of an embodiment of the present application stores a computer program that, when executed by one or more processors, implements the voice interaction method of any of the above embodiments.
Additional aspects and advantages of embodiments of the application will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the application.
Drawings
The foregoing and/or additional aspects and advantages of the present application will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings, in which:
FIG. 1 is a schematic diagram of a dialogue system in the related art;
FIG. 2 is a schematic diagram of the architecture of a dialog system of an end-to-end architecture of an embodiment of the present application;
FIG. 3 is a flow chart of a voice interaction method according to an embodiment of the present application;
FIG. 4 is a schematic block diagram of a voice interaction method according to an embodiment of the present application;
FIG. 5 is a schematic diagram of a coding sequence matrix of a voice interaction method according to an embodiment of the present application;
FIG. 6 is a flow chart of a voice interaction method according to an embodiment of the present application;
FIG. 7 is a schematic diagram of a process flow of a voice interaction method according to an embodiment of the present application;
FIG. 8 is a flow chart of a voice interaction method according to an embodiment of the present application;
FIG. 9 is a flow chart of a voice interaction method according to an embodiment of the present application;
FIG. 10 is a flow chart of a voice interaction method according to an embodiment of the present application;
FIG. 11 is a flow chart of a voice interaction method according to an embodiment of the present application;
FIG. 12 is a flow chart of a voice interaction method according to an embodiment of the present application;
FIG. 13 is a flow chart of a voice interaction method according to an embodiment of the present application;
fig. 14 is a schematic diagram of a coding sequence matrix of a voice interaction method according to an embodiment of the present application.
Detailed Description
Embodiments of the present application are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are exemplary only for explaining the embodiments of the present application and are not to be construed as limiting the embodiments of the present application.
Referring to fig. 1, the conventional vehicle-mounted voice architecture is based on a conventional modularized policy, and the entire dialogue flow, such as natural language understanding, state tracking, dialogue policy, natural language generation, etc., is implemented between components by division of labor. These components are either manually made according to rules or generated by training models on a supervised dataset. Training of each component requires a large amount of annotation data, which however tends to be expensive, which also limits the scalability of the system. Meanwhile, the traditional vehicle-mounted voice system depends on a large number of rules and business logic to ensure the accuracy and stability of the system, and the scale and the functions of the system are further limited.
From the whole processing link of the dialogue, the traditional vehicle-mounted voice architecture takes user input, and needs to perform natural language understanding, namely domain classification, intention recognition and slot recognition, then select and execute an application program interface (Application Programming Interface, API) meeting the user input requirement in the dialogue management module in combination with the dialogue state and dialogue strategy, and return system output interacting with the user through the natural language generation module.
In view of this, referring to fig. 2, the dialogue system based on the end-to-end architecture according to the embodiment of the present application includes three core algorithm modules: the slot position recognition module is used for extracting slot position information in a voice request input by a user; the action prediction (Action Prediction, AP) module is used for predicting an application program interface which corresponds to the user input and realizes the current target of the user; the parameter Filling (AF) module is used to identify that the slot information in the user input corresponds to the parameters in the application program interface obtained in the previous step.
The slot identification module is used for acquiring slot information of an action execution main body required to be called in the application program interface, the action prediction module determines whether the application program interface called by the subsequent realization of user voice input is correct, and the parameter filling module selects which vehicle parts are used as parameters of the application program interface for execution.
However, for a diversified voice request issued by a user, the slot recognition process may not recognize a slot composed of discontinuous characters in the voice request, and there is a problem of slot recognition accuracy. For example, for a target slot position "refrigerator lock", a user may sometimes not directly say "open/close refrigerator lock" and other standard expressions, but say "unlock refrigerator", however, the slot position value "refrigerator lock" cannot be accurately identified because "unlock" cannot be skipped in the related art, and the slot position identification may be wrong or cannot be identified, so that the extracted slot position cannot meet the requirement. In the related art, in order to meet the recognition requirement of users on such expressions, the "refrigerator unlock" is usually normalized to the "refrigerator lock" in the normalization stage, but this method increases service logic and increases development cost and maintenance cost.
Based on the above problems, referring to fig. 3, an embodiment of the present application provides a voice interaction method. The voice interaction method comprises the following steps:
01: receiving a voice request forwarded by a vehicle;
02: carrying out coding processing on the voice request according to a preset model to obtain a coding sequence matrix for carrying out slot recognition;
03: dividing the coding sequence matrix;
04: carrying out slot recognition on the voice request according to the code sequence matrix after the division processing;
05: carrying out application program interface prediction on the voice request;
06: and selecting the predicted application program interface to execute application program interface parameter filling according to the result of the slot position identification and the predicted application program interface, outputting an execution result and transmitting the execution result to the vehicle to complete voice interaction.
Referring to fig. 4, an embodiment of the present application provides a voice interaction device 100. The voice interaction method of the embodiment of the present application may be implemented by the voice interaction apparatus 100 of the embodiment of the present application. Specifically, the voice interaction device 100 includes a receiving module 101, an encoding module 102, a processing module 103, a slot recognition module 104, an interface prediction module 105, and a parameter filling module 106. The receiving module 101 is configured to receive a voice request forwarded by a vehicle. The encoding module 102 is configured to perform encoding processing on the voice request according to a preset model, so as to obtain a coding sequence matrix for performing slot recognition. And the processing module 103 is used for carrying out division processing on the coding sequence matrix. The slot identification module 104 performs slot identification on the voice request according to the code sequence matrix after the division processing. The interface prediction module 105 performs application program interface prediction on the voice request. And the parameter filling module 106 is used for selecting the predicted application program interface to execute the application program interface parameter filling according to the result of the slot position identification and the predicted application program interface, outputting the execution result and transmitting the execution result to the vehicle to complete the voice interaction.
The embodiment of the application also provides a server. The server includes a processor and a memory having a computer program stored thereon. The processor is used for receiving the voice request forwarded by the vehicle, carrying out coding processing on the voice request according to a preset model to obtain a coding sequence matrix for carrying out slot recognition, carrying out dividing processing on the coding sequence matrix, carrying out slot recognition on the voice request according to the coding sequence matrix after dividing processing, carrying out application program interface prediction on the voice request, selecting the predicted application program interface to execute application program interface parameter filling according to the result of slot recognition and the predicted application program interface, and outputting an execution result to be issued to the vehicle to complete voice interaction.
Specifically, firstly, a server receives a user voice request forwarded by a vehicle, and encodes the voice request according to a preset model. The slot information and other information in the user voice request statement can be distinguished after the encoding process. And obtaining a coding sequence matrix for identifying the slot after coding. The coding sequence matrix includes, in addition to text content, flag characters such as "[ CLS ]", "[ SEP ]", and the like. Wherein the [ CLS ] character is a flag character of text start. For a plurality of consecutive voice requests, a [ SEP ] identifier is also included between each voice request for separating two sentences. Fig. 5 shows a coding sequence matrix corresponding to the voice request of "unlock refrigerator".
After the voice request is encoded to obtain the encoding sequence corresponding to the voice request, in order to locate continuous or discontinuous slot information in the voice request, the encoding matrix sequence needs to be divided. In some embodiments, the coding sequence matrix may be divided into an upper triangular matrix and a lower triangular matrix according to the diagonal lines, and the result obtained after the coding sequence matrix corresponding to the voice request "unlock refrigerator" is divided according to the diagonal lines is shown in fig. 5. The method of dividing the coding sequence matrix that may be adopted in different embodiments may be different, and the specific method is not limited herein.
Particularly, when the slot information to be identified has a discontinuity in the user voice request, the voice request can be identified according to the information in the code sequence matrix after the division processing. After the coding sequence is divided by dividing the upper triangular matrix and the lower triangular matrix according to diagonal lines, each element can be regarded as the vector representation of one sub-word sequence in the voice request in the lower triangular coding sequence matrix, and the slot recognition can be carried out on the voice request of the user according to the vector representation result of the sub-word sequence. As shown in fig. 5, for the voice request "unlock refrigerator", two discontinuous slots of "refrigerator" and "lock" in the voice request can be identified, and the slot identification result is finally obtained as "refrigerator lock", and it can be understood that the result of slot identification of the voice request is the same as the slot identification result of similar voice requests including complete target slots, such as "open refrigerator lock".
In order to solve the problem that the manpower cost and the data cost are too high because each vertical domain in the slot position identification need to be designed with a slot position identification model, the slot position identification scheme of the embodiment of the application adopts an end-to-end structure, does not distinguish the vertical domains, and does not need to train a model in the vertical domain.
After the slot identification is completed, the application program interface prediction can be performed on the voice request according to the slot identification result of the user voice request. First, an Application Program Interface (API) required for the voice request may be predicted by a Action Prediction (AP) module based on the result of slot recognition. For example, an application program interface predicted by an application program interface for a user voice request "play song a" is the application program interface 1 for playing music. The application program interface predicted by the application program interface for the user voice request 'navigation to destination A' is the application program interface 2 for navigation.
In addition, the image forming (AF) module can fill parameters in the application program interface by selecting the slot recognition result, and finally output the execution result to be issued to the vehicle to complete voice interaction.
The end-to-end architecture of the embodiment of the application can simplify intermediate modules of the traditional dialogue system architecture, such as a natural language understanding module, a dialogue management module, a car machine instruction generation module, a natural language generation module and the like, reduce the call of a plurality of models with different vertical domains, reduce the delay of a car-mounted system and improve the response speed to user instructions.
In summary, the voice interaction method of the embodiment of the application can perform coding processing on the voice request to obtain a coding sequence matrix, perform slot recognition on the voice request according to the coding sequence matrix after division processing, and finally perform parameter filling on an application program interface according to the result of the slot recognition and the prediction of the slot recognition result, finally output an execution result and issue to a vehicle to complete voice interaction. In the embodiment of the application, the voice request is encoded to obtain the encoding sequence matrix, and the slot recognition is carried out according to the encoding sequence matrix, so that the accuracy of recognizing the slot formed by discontinuous words in the voice request of the user can be effectively improved, and the voice interaction experience of the user is improved. In addition, special treatment on the target slot position in the normalization stage is not needed, and development and maintenance cost is effectively saved.
Referring to fig. 6, in some embodiments, step 02 includes:
021: performing text sequence coding processing on the voice request to obtain a first coding vector;
022: inputting the first coding vector into a pre-training model to obtain an output matrix corresponding to each code in the text sequence;
023: and obtaining a coding sequence matrix according to the output matrix and a preset model.
In some embodiments, the encoding module 102 is configured to perform text sequence encoding processing on the voice request to obtain a first encoded vector; inputting the first coding vector into a pre-training model to obtain an output matrix corresponding to each code in the text sequence, and obtaining a coding sequence matrix according to the output matrix and the pre-training model.
In some embodiments, the processor is configured to perform text sequence encoding processing on the voice request to obtain a first encoded vector; inputting the first coding vector into a pre-training model to obtain an output matrix corresponding to each code in the text sequence, and obtaining a coding sequence matrix according to the output matrix and the pre-training model.
Specifically, after receiving a user voice request forwarded by a vehicle, the server first needs to perform text sequence coding processing on the voice request to obtain a character sequence corresponding to the voice request, which is called a first coding vector. In one example, the voice request sent by the user is "unlock refrigerator", as shown in fig. 7, the text sequence encoding process is performed on the voice request, so as to obtain a character sequence "Token", i.e., a first encoding vector, whose content is "[ CLS ], handle, ice, box, unlock, lock, [ SEP ]", corresponding to the voice request.
After the text sequence coding process is performed to obtain a first coding vector, the first coding vector is required to be input into a pre-training model. As shown in FIG. 7, the pre-training model may be chosen from BERT, the choice of a particular model is not limited herein. After the pre-training is finished, an output matrix corresponding to each code in the text sequence is obtained. That is, the output matrix includes all text sequence encoded information of the voice request.
Finally, a coding sequence matrix can be obtained according to the output matrix and the preset model, as shown in fig. 5 and 7. The predetermined model may be Biaffine, and the specific model is not limited herein. And obtaining a coding sequence matrix after the output matrix passes through a preset model. The information of the coding sequence matrix is the basis for carrying out slot recognition on the voice request of the user.
Thus, the voice request can be encoded according to the preset model to obtain the encoding sequence matrix for carrying out slot recognition.
Referring to fig. 8, in some embodiments, step 023 includes:
0231: extracting a head matrix corresponding to a first code and a tail matrix corresponding to a last code in the output matrix;
0232: and carrying out coding processing on the head matrix and the tail matrix according to a preset model to obtain a coding sequence matrix.
In some embodiments, the encoding module 102 is configured to extract a head matrix corresponding to a first encoding and a tail matrix corresponding to a last encoding in the output matrix, and perform encoding processing on the head matrix and the tail matrix according to a preset model to obtain a coding sequence matrix.
In some embodiments, the processor is configured to extract a head matrix corresponding to a first code and a tail matrix corresponding to a last code in the output matrix, and perform a coding process on the head matrix and the tail matrix according to a preset model to obtain a coding sequence matrix.
Specifically, the first coding vector obtained after the text sequence coding processing of the voice request can be pre-trained to obtain an output matrix. The output matrix contains information of each code in the first code vector. The head matrix corresponding to the first code and the tail matrix corresponding to the last code in the output matrix are extracted. The head matrix and the tail matrix can show the relation between different slots in the sentence more completely. For example, for a voice request "unlock refrigerator", its corresponding first encoding vector is "[ CLS ], handle, ice, bin, unlock, [ SEP ]". Thus, the header matrix size corresponding to the first code "[ CLS ]" is [ batch_size, seq_len, hidden ], which may contain the first full slot value "refrigerator" in sequential codes. Similarly, the header matrix size corresponding to the last code "[ SEP ]" is [ batch_size, hidden, seq_len ], which may contain the last full slot value "refrigerator unlock" in the sequential code.
The encoding process may introduce a "Biaffine matrix" of size [ hidden, hidden, num_label ]. According to a preset model, which may be a Biaffine model, the head matrix and the tail matrix are subjected to coding processing to obtain a coding sequence matrix. The coding matrix contains the slot information of the voice request, and the shape of the slot information is [ batch_size, seq_len, num_label ].
Thus, the matrix corresponding to the first code and the last code in the output matrix can be extracted, and the extracted matrix is subjected to coding processing to obtain a coding sequence matrix so as to identify the slot position of the voice request.
Referring to fig. 9, in some embodiments, step 03 includes:
031: dividing the coding sequence matrix according to diagonal lines of the coding sequence matrix to obtain a first submatrix and a second submatrix;
032: carrying out slot recognition on the voice request according to the code sequence matrix after the division processing;
033: and carrying out slot recognition on the voice request according to the first sub-matrix and the second sub-matrix.
In some embodiments, the processing module 103 is configured to divide the coding sequence matrix according to a diagonal of the coding sequence matrix to obtain a first sub-matrix and a second sub-matrix, and perform slot recognition on the voice request according to the divided coding sequence matrix, and perform slot recognition on the voice request according to the first sub-matrix and the second sub-matrix.
In some embodiments, the processor is configured to divide the code sequence matrix according to a diagonal of the code sequence matrix to obtain a first sub-matrix and a second sub-matrix, and perform slot recognition on the voice request according to the divided code sequence matrix, and perform slot recognition on the voice request according to the first sub-matrix and the second sub-matrix.
Specifically, the slot information of the voice request may be determined according to information in the coding sequence matrix having the shape of [ batch_size, seq_len, num_label ].
In one example, the user issues a voice request "unlock refrigerator" corresponding to a first code vector "[ CLS ], and" ice, box, unlock, lock, [ SEP ] ", a code matrix sequence diagram corresponding to the voice request as shown in fig. 5 is obtained. The code sequence matrix may be divided into a first sub-matrix, the lower triangular matrix, and a second sub-matrix, the upper triangular matrix, based on the diagonal in the code sequence matrix.
According to the information displayed in the first submatrix of fig. 5, in the voice request "unlock refrigerator", there are a slot value "refrigerator" starting with "ice" and ending with "bin", and a single word slot value "lock".
Therefore, the coding sequence matrix can be divided, and the voice request is subjected to slot identification through the subarrays obtained by division, so that the slot identification result is more accurate.
Referring to fig. 10, in some embodiments, step 033 includes:
0331: determining a target coding subsequence for slot identification according to the first submatrix;
0332: and carrying out slot identification on the voice request according to the target coding subsequence and the second submatrix.
In some embodiments, the processing module 103 is configured to determine a target coding sub-sequence for performing slot identification according to the first sub-matrix, and perform slot identification on the voice request according to the target coding sub-sequence and the second sub-matrix.
In some embodiments, the processor is configured to determine a target coding sub-sequence for slot identification based on the first sub-matrix, and to slot identify the voice request based on the target coding sub-sequence and the second sub-matrix.
Specifically, the first sub-matrix is a lower triangular matrix obtained by dividing the coding sequence matrix according to diagonal lines. In this lower triangular matrix, a specific identification may be used to label to determine the target coding subsequence for slot identification. The target coding subsequence is a coding subsequence containing a slot recognition result in the coding sequence matrix, for example, in the case of a voice request of "unlocking a refrigerator", the slot recognition result is "refrigerator-locked", and then the target coding subsequence beginning with "ice" and ending with "lock" can be represented in the coding sequence matrix by using a specific identifier according to the first sub-matrix.
Further, after determining the target coding subsequence for performing slot identification according to the first submatrix, i.e. the lower triangular matrix, the slot identification may be further performed on the voice request in combination with the second submatrix, i.e. the upper triangular matrix. In the process of identifying the slot positions through the upper triangular matrix, the identification is mainly carried out through a specific identification, and whether a certain connection relationship exists between two words in terms of semantics is indicated. For example, "ice" and "box" and "lock" in the voice request "unlock refrigerator" are in connection relationship and the other two characters are not in connection relationship included in the target slot. It should be noted that, the connection relationship referred to herein includes not only sequential adjacency between two words, but also semantic continuity. If the two words are the box and the lock, after the refrigerator unlocking of the slot position in the statement is identified, the slot position can be converted, and finally the refrigerator locking slot position is obtained.
Thus, the target coding subsequence can be obtained according to the first submatrix, and the slot position recognition process is completed by combining the second submatrix, so that the recognition result of the slot position formed by discontinuous words is more accurate.
Referring to fig. 11, in some embodiments, step 0331 includes:
03311: identifying all semantic vectors in the voice request;
03312: marking head and tail position marks of target slot position values in the first sub-matrix according to a preset slot position value table;
03313: and determining a target coding subsequence where the target slot position value is located according to the head-tail position identification.
In some embodiments, the processing module 103 is configured to identify all semantic vectors in the voice request, and mark the head-to-tail position identifier of the target slot value in the first submatrix according to the preset slot value table, where the target slot value is determined according to the corresponding relationship between the semantic vector and the slot value table, and determine the target coding subsequence in which the target slot value is located according to the head-to-tail position identifier,
In some embodiments, the processor is configured to identify all semantic vectors in the voice request, and mark, according to a preset slot value table, a head-tail position identifier of a target slot value in the first submatrix, where the target slot value is determined according to a correspondence between the semantic vectors and the slot value table, and determine, according to the head-tail position identifier, a target coding subsequence in which the target slot value is located.
Specifically, in the first submatrix obtained after the coding sequence matrix is divided, the semantic vector corresponding to the target coding submatrix starting with the character corresponding to the "j" column and ending with the character corresponding to the "i" row may be denoted as coordinates (i, j). A specific slot value may be included in the target coding subsequence.
In one example, a user sends a voice request "unlock refrigerator" to obtain a schematic diagram of a coding matrix sequence corresponding to the voice request, where 7×7 (49) semantic vectors corresponding to the target coding subsequences exist, as shown in fig. 5. And marking the head and tail position marks of the target slot position value in the first sub-matrix according to a preset slot position value table. The preset slot position value table may include all possible slot position values in the current vehicle-mounted system, where the target slot position value may be determined according to the corresponding relationship between the semantic vector and the slot position value table, and may be used as a final result of slot position identification. For example, in the current voice request "unlock refrigerator", the (5, 2) represents the target coding subsequence "unlock refrigerator" starting with 2 and ending with 5. A Head-Tail position identifier may be set at the (5, 2) position in the first sub-matrix, where the identifier may be THW (Tail-Head-Word), and finally it may be determined that the sub-sequence corresponding to the position where the Head-Tail position identifier is located is a target coding sub-sequence, that is, the target slot value may be obtained in the target coding sub-sequence. The text corresponding to the rest of the semantic vectors does not contain actual semantics, and therefore does not correspond to a specific target coding subsequence.
Therefore, the target coding subsequence where the target slot position value is located in the first sub-matrix can be identified by using the head and tail position identification through all the identified semantic vectors and the slot position value table, so that the target slot position value can be determined later, the slot position identification process of discontinuous words is completed, and the accuracy of the slot position identification process and the fluency of the voice interaction process are improved.
Referring to fig. 12, in some embodiments, step 0332 includes:
03321: marking adjacent relation identifiers of codes of the target coding subsequences according to the target slot position value in the second submatrix;
03322: splicing codes with adjacent relations according to the adjacent relation identifiers to obtain target slot position values so as to obtain a result of carrying out slot position identification on the voice request.
In some embodiments, the processing module 103 is configured to identify all semantic vectors in the voice request, and mark the head-to-tail position identifiers of the target slot values in the first submatrix according to a preset slot value table, where the target slot values are determined according to the corresponding relationship between the semantic vectors and the slot value table, and determine the target coding subsequence where the target slot values are located according to the head-to-tail position identifiers
In some embodiments, the processor is configured to mark adjacent relation identifiers of each code of the target code subsequence in the second submatrix according to the target slot position value, and splice codes having adjacent relations according to the adjacent relation identifiers, so as to obtain the target slot position value, so as to obtain a result of carrying out slot position recognition on the voice request.
Specifically, in the second sub-matrix obtained after the coding sequence matrix is divided, the adjacent relation identifiers of the codes of the corresponding target coding sub-sequences can be marked according to the target slot position value. At this time, the position (k, l) where the adjacency relation identifier is located may represent whether or not there is an "adjacency" relation between the character corresponding to the "k" th row and the character corresponding to the "l" th column in the coding sequence matrix. When an adjacent relation exists between the character corresponding to the k-th row and the character corresponding to the l-th column in the coding sequence matrix, corresponding adjacent relation identifiers can be assigned.
The adjacency identifier may include a NNW (Next-neighbor-Word) identifier, and the specific identifier used is not limited herein. NNW ultimately maps to a binary vector, e.g., the binary vector has a value of 0 or 1, where the "adjacent" relationship may be represented by a 1 and 0 may correspond to "non-adjacent". When the value of the adjacent relation identifier is 1, the related characters are required to be connected in the result of the slot identification. The assignment of the adjacency identity does not have a direct relationship with whether the corresponding character is connected in the sentence in which the voice request is located, i.e. the two characters of the adjacency identity that are indicated as having a "adjacency" relationship may not be continuous in the user voice request.
Further, codes with adjacent relations can be spliced according to the adjacent relation identification. The specific operation is that the characters with the adjacent relation mark value representing the relation that the corresponding two characters are adjacent are spliced, and finally the slot position recognition result is obtained. In the foregoing example, the voice request "unlock refrigerator" slot identification result exists in "unlock refrigerator", two sets of characters of adjacent relationship are "ice-box" and "box-lock", and finally the target slot value "refrigerator lock" can be obtained by splicing, so as to complete the slot identification process.
Therefore, adjacent relations of all codes of the target coding sub-sequence can be identified through the second sub-matrix, codes with adjacent relations are spliced to obtain the target slot position value, the slot position identification process formed by discontinuous words is completed, and the slot position identification accuracy and the smoothness of the voice interaction process are improved.
Referring to fig. 13, in some embodiments, step 05 includes:
051: determining target parameters of slot filling according to the voice request, the slot recognition result, the predicted application program interface and the predicted application program interface type;
052: and selecting the predicted application program interface to execute application program interface parameter filling according to the result of the slot position identification and the target parameter, outputting an execution result and transmitting the execution result to the vehicle to complete voice interaction.
In some embodiments, the interface prediction module 105 is configured to determine target parameters of slot filling according to the voice request, the result of slot recognition, the predicted application program interface, and the predicted application program interface type, select the predicted application program interface to execute the application program interface parameter filling according to the result of slot recognition and the target parameters, output the execution result, and issue the execution result to the vehicle to complete voice interaction
In some embodiments, the processor is configured to determine a target parameter of slot filling according to the voice request, the result of slot recognition, the predicted application program interface, and the predicted application program interface type, select the predicted application program interface to execute the application program interface parameter filling according to the result of slot recognition and the target parameter, and output the execution result to the vehicle to complete voice interaction.
Specifically, the target parameters for slot filling may be determined based on the user voice request, the results of slot recognition, and the predicted application program interface and interface type. The target parameter is the slot name corresponding to the slot identification result. Finally, according to the result of the slot position identification and the target parameter, a predicted application program interface is selected, the filled target parameter is executed, and the output execution result is issued to the vehicle so as to complete the voice interaction.
For example, for a user voice request of "unlock a refrigerator", the recognized slot value includes "refrigerator lock", after filling the "refrigerator lock" in the above slot recognition result into the corresponding application program interface, the output execution result, that is, the corresponding "unlock refrigerator lock" control instruction is issued to the vehicle, and the vehicle can execute the action of unlocking the refrigerator lock, so as to finally complete the voice interaction process.
For another example, for the user voice request "close music and navigation page", the identified slot values include "music page" and "navigation page", after the "music page" and "navigation page" in the above slot identification result are filled into the corresponding application program interfaces, the output execution results, that is, the corresponding "close music page" control instruction and "close navigation page" control instruction, are issued to the vehicle, and the vehicle can execute the action of closing the corresponding page of the vehicle-mounted system, so as to finally complete the voice interaction process.
Therefore, according to the embodiment of the application, the predicted application program interface is selected to execute the application program interface parameter filling according to the result of the slot position identification and the target parameter, the execution result is directly output and issued to the vehicle to complete the voice interaction, the delay of the vehicle-mounted system can be reduced, and the response speed to the user instruction is improved.
The following is an additional description of the recognition of a slot formed by a discontinuous character in a voice request, through a complete scene example. As shown in fig. 14, a matrix of code sequences for the voice request "close music and navigation page". In the first submatrix, namely the lower triangular matrix, there is a target coding subsequence of music and navigation pages where the target slot values corresponding to the semantic vectors (9, 3) are located. In the second sub-matrix, the upper triangular matrix, the values of the adjacency identifications of (3, 4), (4, 8), (8, 9) are all 1, and then the characters with three groups of adjacency are "music", "music-page", and "page-face". And finally obtaining a slot position value of 'music page' to obtain a final slot position identification result.
The computer-readable storage medium of an embodiment of the present application stores a computer program that, when executed by one or more processors, implements the method described above.
In the description of the present specification, reference to the terms "foregoing," "specifically," "particularly," "further," and the like, means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present application. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiments or examples. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, the different embodiments or examples described in this specification and the features of the different embodiments or examples may be combined and combined by those skilled in the art without contradiction.
Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable requests for implementing specific logical functions or steps of the process, and further implementations are included within the scope of the preferred embodiment of the present application in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present application.
While embodiments of the present application have been shown and described above, it will be understood that the above embodiments are illustrative and not to be construed as limiting the application, and that variations, modifications, alternatives and variations may be made to the above embodiments by one of ordinary skill in the art within the scope of the application.

Claims (11)

1. A method of voice interaction, comprising:
receiving a voice request forwarded by a vehicle;
performing coding processing on the voice request according to a preset model to obtain a coding sequence matrix for carrying out slot recognition;
dividing the coding sequence matrix;
Performing slot recognition on the voice request according to the code sequence matrix after the division processing;
performing application program interface prediction on the voice request;
and selecting the predicted application program interface to execute application program interface parameter filling according to the slot position recognition result and the predicted application program interface, outputting an execution result and transmitting the execution result to a vehicle to complete voice interaction.
2. The voice interaction method according to claim 1, wherein the encoding the voice request according to a preset model to obtain a coding sequence matrix for performing slot recognition comprises:
performing text sequence coding processing on the voice request to obtain a first coding vector;
inputting the first coding vector into a pre-training model to obtain an output matrix corresponding to each code in the text sequence;
and obtaining the coding sequence matrix according to the output matrix and the preset model.
3. The method according to claim 2, wherein the obtaining the coding sequence matrix according to the output matrix and the preset model includes:
extracting a head matrix corresponding to a first code and a tail matrix corresponding to a last code in the output matrix;
And carrying out coding processing on the head matrix and the tail matrix according to the preset model so as to obtain the coding sequence matrix.
4. The voice interaction method according to claim 1, wherein the dividing the coding sequence matrix comprises:
dividing the coding sequence matrix according to diagonal lines of the coding sequence matrix to obtain a first sub-matrix and a second sub-matrix;
the method comprises the steps that the voice request is subjected to slot recognition according to the code sequence matrix after the division processing;
and carrying out slot recognition on the voice request according to the first sub-matrix and the second sub-matrix.
5. The voice interaction method of claim 4, wherein the performing slot recognition on the voice request according to the first sub-matrix and the second sub-matrix comprises:
determining a target coding subsequence for slot identification according to the first submatrix;
and carrying out slot recognition on the voice request according to the target coding subsequence and the second submatrix.
6. The voice interaction method of claim 5, wherein the determining a target coding sub-sequence for slot recognition according to the first sub-matrix comprises:
Identifying all semantic vectors in the voice request;
marking head and tail position identifiers of target slot values in the first submatrix according to a preset slot value table, wherein the target slot values are determined according to the corresponding relation between the semantic vector and the slot value table;
and determining the target coding subsequence where the target slot position value is located according to the head-tail position identification.
7. The voice interaction method according to claim 6, wherein the performing slot recognition on the voice request according to the target coding sub-sequence and the second sub-matrix includes:
marking adjacent relation identifiers of codes of the target coding sub-sequence according to the target slot position value in the second sub-matrix;
splicing codes with the adjacent relation according to the adjacent relation identification to obtain the target slot position value so as to obtain a result of carrying out slot position identification on the voice request.
8. The voice interaction method according to claim 1, wherein the selecting the predicted application program interface to execute application program interface parameter filling according to the result of the slot recognition and the predicted application program interface, outputting the execution result and transmitting to a vehicle to complete voice interaction comprises:
Determining target parameters of slot filling according to the voice request, the slot recognition result, the predicted application program interface and the predicted application program interface type;
and selecting the predicted application program interface to execute application program interface parameter filling according to the slot position identification result and the target parameter, outputting an execution result and transmitting the execution result to a vehicle to complete voice interaction.
9. A voice interaction device, comprising:
the receiving module is used for receiving the voice request forwarded by the vehicle;
the coding module is used for carrying out coding processing on the voice request according to a preset model so as to obtain a coding sequence matrix for carrying out slot identification;
the processing module is used for dividing the coding sequence matrix;
the slot position recognition module is used for recognizing the slot position of the voice request according to the code sequence matrix after the division processing;
the interface prediction module predicts an application program interface of the voice request;
and the parameter filling module is used for selecting the predicted application program interface to execute application program interface parameter filling according to the slot position identification result and the predicted application program interface, outputting an execution result and transmitting the execution result to the vehicle to complete voice interaction.
10. A server comprising a processor and a memory, the memory having stored thereon a computer program which, when executed by the processor, implements the voice interaction method of any of claims 1-8.
11. A non-transitory computer readable storage medium containing a computer program, characterized in that the voice interaction method of any of claims 1-8 is implemented when the computer program is executed by one or more processors.
CN202310599110.3A 2023-05-24 2023-05-24 Voice interaction method, voice interaction device, server and computer readable storage medium Pending CN116665667A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310599110.3A CN116665667A (en) 2023-05-24 2023-05-24 Voice interaction method, voice interaction device, server and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310599110.3A CN116665667A (en) 2023-05-24 2023-05-24 Voice interaction method, voice interaction device, server and computer readable storage medium

Publications (1)

Publication Number Publication Date
CN116665667A true CN116665667A (en) 2023-08-29

Family

ID=87716504

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310599110.3A Pending CN116665667A (en) 2023-05-24 2023-05-24 Voice interaction method, voice interaction device, server and computer readable storage medium

Country Status (1)

Country Link
CN (1) CN116665667A (en)

Similar Documents

Publication Publication Date Title
CN110196894B (en) Language model training method and language model prediction method
CN112100349A (en) Multi-turn dialogue method and device, electronic equipment and storage medium
CN115083413B (en) Voice interaction method, server and storage medium
CN115064166B (en) Vehicle voice interaction method, server and storage medium
CN110309170B (en) Complex intention recognition method in task-based multi-turn conversation
CN115064167B (en) Voice interaction method, server and storage medium
CN110263218A (en) Video presentation document creation method, device, equipment and medium
CN113821616A (en) Domain-adaptive slot filling method, device, equipment and storage medium
CN111368066B (en) Method, apparatus and computer readable storage medium for obtaining dialogue abstract
CN113705222B (en) Training method and device for slot identification model and slot filling method and device
CN116092494B (en) Voice interaction method, server and computer readable storage medium
CN112580368B (en) Method, device, equipment and storage medium for identifying intention sequence of conversation text
CN112395880B (en) Error correction method and device for structured triples, computer equipment and storage medium
CN115294964B (en) Speech recognition method, server, speech recognition system, and readable storage medium
CN110210035B (en) Sequence labeling method and device and training method of sequence labeling model
CN116665667A (en) Voice interaction method, voice interaction device, server and computer readable storage medium
CN115906855A (en) Word information fused Chinese address named entity recognition method and device
CN116304014A (en) Method for training entity type recognition model, entity type recognition method and device
CN116092495B (en) Voice interaction method, server and computer readable storage medium
CN116110397B (en) Voice interaction method, server and computer readable storage medium
CN116092493B (en) Voice interaction method, server and computer readable storage medium
CN112966520B (en) Natural language generation method and device
CN115329755B (en) Entity link model processing method and device and entity link processing method and device
CN112966119B (en) Information acquisition method, equipment and medium
CN114090727A (en) Model distillation method, device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination