CN116110397B

CN116110397B - Voice interaction method, server and computer readable storage medium

Info

Publication number: CN116110397B
Application number: CN202310373009.6A
Authority: CN
Inventors: 丁鹏傑; 宁洪珂; 赵群; 樊骏锋; 郭梦雪
Original assignee: Guangzhou Xiaopeng Motors Technology Co Ltd
Current assignee: Guangzhou Xiaopeng Motors Technology Co Ltd
Priority date: 2023-04-07
Filing date: 2023-04-07
Publication date: 2023-08-25
Anticipated expiration: 2043-04-07
Also published as: CN116110397A

Abstract

The application discloses a voice interaction method, which comprises the following steps: receiving a voice request forwarded by a vehicle, and carrying out slot recognition on the voice request; carrying out application program interface prediction on the voice request; and selecting the predicted application program interface to execute application program interface parameter filling according to the result of the slot position identification and the predicted application program interface, outputting an execution result and transmitting the execution result to the vehicle to complete voice interaction. The voice interaction method can acquire the additional information of the voice request according to the preset resource library, perform slot recognition on the voice request according to the additional information, fill the obtained slot recognition result as the parameters of the predicted application program interface, finally output the execution result and issue the execution result to the vehicle to complete voice interaction. According to the voice interaction method, the slot recognition model is adopted, the additional information of the voice request stored in the resource library is introduced to carry out slot recognition, the accuracy of slot recognition can be effectively improved, and the voice interaction experience of a user is improved.

Description

Voice interaction method, server and computer readable storage medium

Technical Field

The present application relates to the field of vehicle-mounted voice technologies, and in particular, to a voice interaction method, a server, and a computer readable storage medium.

Background

The current dialogue system uses a natural language generation module to analyze the sentence of the user into a semantic label which can be understood by a machine, maintains an internal dialogue state as a compact representation of the whole dialogue history through a dialogue state tracking module, uses a dialogue strategy module to select a proper dialogue action according to the state, and finally converts the dialogue action into a natural language reply through the natural language generation module. However, a complicated named entity may exist in the user voice request, and the recognition result in the related technology may be wrong, so that the desired slot position result cannot be extracted, and it is difficult to meet the vehicle control requirement in the vehicle-mounted scene.

Disclosure of Invention

The application provides a voice interaction method, a server and a computer readable storage medium.

The voice interaction method of the application comprises the following steps:

receiving a voice request forwarded by a vehicle;

acquiring additional information of the voice request according to a preset resource library;

performing slot recognition on the voice request according to the voice request and the additional information;

performing application program interface prediction on the voice request;

and selecting the predicted application program interface to execute application program interface parameter filling according to the slot position recognition result and the predicted application program interface, outputting an execution result and transmitting the execution result to a vehicle to complete voice interaction.

Therefore, the voice interaction method can acquire the additional information of the voice request according to the preset resource library, and accordingly, the voice request is subjected to slot recognition. And the application program interface can be subjected to parameter filling according to the result of the slot position identification and the prediction of the slot position identification, and finally the execution result is output and issued to the vehicle to complete voice interaction. According to the voice interaction method, the slot recognition model is adopted, and the additional information of the voice request stored in the resource library is introduced in the process of slot recognition, so that the accuracy of slot recognition can be effectively improved, and the voice interaction experience of a user is improved.

The resource library stores keywords meeting preset conditions, and the obtaining additional information of the voice request according to the preset resource library comprises the following steps:

matching the voice request with keywords in the resource library;

and determining the type information of the sub-fragments in the voice request according to the matching result so as to acquire the additional information.

Therefore, the voice request can be matched with the keywords stored in the resource library, and the additional information of the voice request is acquired according to the type information of the successfully matched sub-segments, so that the slot recognition can be carried out on the voice request of the user.

The determining the type information of the sub-fragment in the voice request according to the matching result to obtain the additional information includes:

determining type information of a first sub-segment in the voice request, wherein the first sub-segment is in a matching relation with keywords in the resource library, and the type information of the first sub-segment is determined according to the labeling type of the keywords;

preprocessing type information of a second sub-segment which does not form a matching relation with keywords in the resource library in the voice request;

and determining the additional information according to the type information of the first sub-segment and the type information of the second sub-segment.

The method further comprises the steps of:

and determining an additional feature vector corresponding to the additional information according to the type information of the sub-segment.

Therefore, the additional feature vector corresponding to the additional information can be determined according to the type information of the sub-fragment in the user voice request, so that the additional feature vector can be utilized in the subsequent slot identification process, a more accurate slot identification result is obtained, and the interaction experience of the user is improved.

The step of performing slot recognition on the voice request according to the voice request and the additional information includes:

and carrying out slot recognition on the voice request according to the original feature vector corresponding to the original information of the voice request and the additional feature vector corresponding to the additional information.

Thus, the original feature vector of the voice request can be fused with the additional feature vector to perform slot recognition on the voice request. When the sub-segment successfully matched with the keyword stored in the resource library exists in the voice request of the user, the accuracy of the slot identification can be improved.

The step of performing slot recognition on the voice request according to the original feature vector corresponding to the original information of the voice request and the additional feature vector corresponding to the additional information includes:

performing text sequence coding on the voice request to obtain a first feature vector in the original feature vector;

determining a position vector of the voice request according to the character sequence of the voice request to obtain a second feature vector in the original feature vector;

and carrying out slot recognition on the voice request according to the first feature vector, the second feature vector and the additional feature vector.

Therefore, word embedding and encoding processing can be carried out on the voice request to obtain a first feature vector and a second feature vector of the voice request, and the first feature vector, the second feature vector and the additional feature vector are fused to carry out slot recognition on the voice request. And when the sub-segments successfully matched with the keywords stored in the resource library in the voice request of the user can be identified, the accuracy of slot identification is improved.

The performing slot recognition on the voice request according to the first feature vector, the second feature vector and the additional feature vector includes:

performing predetermined processing on the first feature vector, the second feature vector and the additional feature vector to obtain an input for performing the slot recognition;

and carrying out reasoning processing on the input by using a slot identification model to obtain a slot identification result, wherein the slot identification result comprises a slot value and a slot type corresponding to the slot value.

Therefore, the original feature vector of the user voice request and the additional feature vector obtained based on the external resource feature can be preprocessed, the preprocessed result is used as the input of the slot position recognition model, and finally the slot position recognition result is obtained. The introduction of external resource features avoids the false recognition of partial special words in the voice request, and the accuracy of slot recognition is obviously improved.

And selecting the predicted application program interface to execute application program interface parameter filling according to the slot position identification result and the predicted application program interface, outputting an execution result and transmitting the execution result to a vehicle to complete voice interaction, wherein the method comprises the following steps of:

determining target parameters of slot filling according to the voice request, the slot recognition result, the predicted application program interface and the predicted application program interface type;

and selecting the predicted application program interface to execute application program interface parameter filling according to the slot position identification result and the target parameter, outputting an execution result and transmitting the execution result to a vehicle to complete voice interaction.

Therefore, the method and the device can select the predicted application program interface to execute the application program interface parameter filling according to the result of the slot position identification and the target parameter, directly output the execution result and issue the execution result to the vehicle to complete the voice interaction, reduce the delay of the vehicle-mounted system and improve the response speed to the user instruction.

The server of the present application comprises a processor and a memory, wherein the memory stores a computer program which, when executed by the processor, implements the method described above.

The computer readable storage medium of the present application stores a computer program which, when executed by one or more processors, implements the voice interaction method of any of the above embodiments.

Therefore, the storage medium of the application adopts the end-to-end architecture to reduce the delay of the vehicle-mounted system, improve the response speed to the user command, integrate the slot recognition result of the user voice request and the predicted additional characteristics of the application program interface, effectively improve the precision of the application program interface parameter filling task and meet the vehicle control requirement.

Additional aspects and advantages of embodiments of the application will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the application.

Drawings

The foregoing and/or additional aspects and advantages of the present application will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings, in which:

FIG. 1 is a schematic diagram of a dialogue system in the related art;

FIG. 2 is a schematic diagram of the architecture of the dialog system of the end-to-end architecture of the present application;

FIG. 3 is a flow chart of a voice interaction method of the present application;

FIG. 4 is a second flowchart of the voice interaction method of the present application;

FIG. 5 is a third flow chart of the voice interaction method of the present application;

FIG. 6 is a flow chart of a voice interaction method of the present application;

FIG. 7 is a flow chart of a voice interaction method of the present application;

FIG. 8 is a flowchart of a voice interaction method according to the present application;

FIG. 9 is a flow chart of a voice interaction method according to the present application;

FIG. 10 is a schematic diagram of a slot filling model of the voice interaction method of the present application;

FIG. 11 is a flowchart illustrating a voice interaction method according to the present application.

Detailed Description

Embodiments of the present application are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are exemplary only for explaining the embodiments of the present application and are not to be construed as limiting the embodiments of the present application.

Referring to fig. 1, the conventional vehicle-mounted voice architecture is based on a conventional modularized policy, and the entire dialogue flow, such as natural language understanding, state tracking, dialogue policy, natural language generation, etc., is implemented between components by division of labor. These components are either mainly manually built on rules or generated by training models on a supervised dataset. Training of each component requires a large amount of annotation data, which however tends to be expensive, which also limits the scalability of the system. Meanwhile, the traditional vehicle-mounted voice system depends on a large number of rules and business logic to ensure the accuracy and stability of the system, and the scale and the functions of the system are further limited.

From the whole processing link of the dialogue, the traditional vehicle-mounted voice architecture takes user input, and needs to perform natural language understanding, namely domain classification, intention recognition and slot recognition, then select and execute an application program interface (Application Programming Interface, API) meeting the user input requirement in the dialogue management module in combination with the dialogue state and dialogue strategy, and return system output interacting with the user through the natural language generation module.

In view of this, referring to fig. 2, the end-to-end based dialog system of the present application includes three core algorithm modules: the slot position recognition module is used for recognizing an entity in a voice request input by a user; the action prediction (Action Prediction, AP) module is used for predicting an application program interface which corresponds to the user input and realizes the current target of the user; the parameter Filling (AF) module is used to identify the entity in the user input corresponds to the parameter in the application program interface obtained in the previous step.

The slot position identification module is used for acquiring the entity which needs to be called in the application program interface, the action prediction module determines whether the application program interface which is called by the subsequent realization of the user voice input is correct, and the parameter filling module selects which entity is used for being executed as the parameter of the application program interface.

However, there may be problems with slot recognition containing complex named entities that may exist in a user voice request. For example, recognition of proper nouns having special significance, which are composed of a plurality of entity words or instructional sentences, may not be accurate enough. Taking a service scenario of playing music as an example, when a song name hits the "xx (song playing xx (singer)" format, the slot identification may be wrong, as shown in table 1:

TABLE 1

Among the "play a shepherd in cocoa tuina" requests by the user, the "shepherd in cocoa tuina" with the song name "hits the" xx (song) format of "xx (singer)", and the slot identification may be incorrect, identifying "cocoa tuina" as the singer slot, and identifying "shepherd" as the song name slot. Similarly, in the voice request "listen to mom's words," mom "may be identified as singer, and" words "may be identified as song names. Therefore, the error of slot recognition is caused, and the voice interaction cannot be normally performed in the vehicle-mounted environment.

Based on the above problems, referring to fig. 3, the present application provides a voice interaction method. The voice interaction method comprises the following steps:

01: receiving a voice request forwarded by a vehicle;

02: acquiring additional information of a voice request according to a preset resource library;

03: performing slot recognition on the voice request according to the voice request and the additional information;

04: carrying out application program interface prediction on the voice request;

05: and selecting the predicted application program interface to execute application program interface parameter filling according to the result of the slot position identification and the predicted application program interface, outputting an execution result and transmitting the execution result to the vehicle to complete voice interaction.

The application also provides a server. The server includes a processor and a memory having a computer program stored thereon. The processor is used for receiving the voice request forwarded by the vehicle and acquiring additional information of the voice request according to a preset resource library; and carrying out slot recognition on the voice request according to the voice request and the additional information, carrying out application program interface prediction on the voice request, selecting the predicted application program interface to execute application program interface parameter filling according to the result of the slot recognition and the predicted application program interface, and outputting an execution result to be issued to the vehicle to complete voice interaction.

Firstly, receiving a user voice request forwarded by a vehicle, and acquiring additional information of the voice request according to a preset resource library. The preset resource library is independent of a database outside the vehicle-mounted system and stores keywords meeting preset conditions. The predetermined condition may be a proper noun having a special meaning for some of the plurality of entity words or instructional sentences, such as the song name "a caucasian shepherd", "listening to mom's words", or the like. When the user's voice request hits the keywords stored in the database, additional information of the voice request can be obtained, for example, for the voice request of "playing the shepherd of cocoa tuoha", wherein "the shepherd of cocoa tuoha" hits the information in the preset resource library, the obtained additional information of the voice request can be "the shepherd of cocoa tuoha can be identified as song name", etc., which indicates that a special slot exists in the voice request.

After the additional information of the voice request is obtained, the voice request and the additional information can be combined to perform slot recognition on the voice request of the user. In one example, the user makes a voice request "play Qili of Zhou Jielun," hits sentence format of "play xx (singer) xx (song)", and can perform slot recognition, resulting in singer slot "Zhou Jielun" and song name slot "Qili".

In particular, in the music vertical domain, when a user sends a voice request to hit a sentence format of "xx (song) playing xx (singer)", in the process of identifying the slot of the voice request, besides the analysis of the voice request itself, the obtained additional information corresponding to the voice request needs to be referred to. In one example, a user sends a voice request "play a shepherd of cocoa tolhai", and if the slot recognition is directly performed without referring to additional information, possible slot recognition results include [ "cocoa tolhai" -singer (singer) ] and [ "shepherd" -song name (song) ], which cause a slot recognition error, affecting the whole voice interaction process. The whole of the 'cocoa bracket sea shepherd' can be identified as the song name according to the additional information, and the correct slot identification result can be finally obtained, namely, the singer slot is lost, [ 'cocoa bracket sea shepherd' -song name (song) ].

In order to solve the problem that the manpower cost and the data cost are too high because each vertical domain needs to be designed independently in the slot position identification, the slot position identification scheme adopts an end-to-end structure, does not distinguish the vertical domains, and does not need to train a model in the vertical domain.

The speech request may be predicted for the application program interface based on entities in the sentence obtained by slot recognition of the user speech request. First, the entity words included in the result obtained by the slot recognition may be predicted, i.e., by a Action Prediction (AP) module, as an Application Program Interface (API) required for the voice request. For example, an application program interface predicted by an application program interface for a user voice request for "play song a" is the application program interface 1 for playing music. The application program interface predicted by the application program interface for the user voice request 'navigation to destination A' is the application program interface 2 for navigation.

In addition, the image forming (AF) module can fill parameters in the application program interface by selecting entities, and finally output execution results to be issued to the vehicle to complete voice interaction.

The end-to-end architecture of the application can simplify intermediate modules of the traditional dialogue system architecture, such as a natural language understanding module, a dialogue management module, a car machine instruction generation module, a natural language generation module and the like, reduce the call of a plurality of models with different vertical domains, reduce the delay of a vehicle-mounted system and improve the response speed to user instructions.

In summary, the voice interaction method of the application can obtain the additional information of the voice request according to the preset resource library, and accordingly, the voice request is subjected to slot recognition. And the application program interface can be subjected to parameter filling according to the result of the slot position identification and the prediction of the slot position identification, and finally the execution result is output and issued to the vehicle to complete voice interaction. According to the voice interaction method, the slot recognition model is adopted, and the additional information of the voice request stored in the resource library is introduced in the process of slot recognition, so that the accuracy of slot recognition can be effectively improved, and the voice interaction experience of a user is improved.

Referring to fig. 4, the repository stores keywords meeting predetermined conditions, and step 02 includes:

021: matching the voice request with keywords in a resource library;

022: and determining the type information of the sub-fragments in the voice request according to the matching result so as to acquire the additional information.

The processor is used for matching the voice request with the keywords in the resource library, and determining the type information of the sub-fragments in the voice request according to the matching result so as to acquire the additional information.

Specifically, after receiving a user voice request forwarded by a vehicle, the voice assistant needs to search a keyword in a resource library to match with the voice request. Wherein, the resource library stores keywords meeting preset conditions. For example, in a business scenario where music is played, a user issues a voice request for playing a certain song by a certain singer. The stored keywords meeting the predetermined conditions can be found in the repository to obtain the singer name and song name.

Further, the voice request can be divided into a plurality of sub-segments according to whether the voice request is successfully matched with the keywords in the resource library, and additional information contained in the voice request can be obtained by distinguishing the sub-segments of the voice request, for example, proper nouns such as a name of a person, a place name, a song name and the like in the voice request can be matched with the keywords in the resource library.

Referring to fig. 5, step 022 includes:

0221: the method comprises the steps of forming a first sub-segment of a matching relation with keywords in a resource library in a voice request, and determining type information of the first sub-segment according to the labeling type of the keywords;

0222: preprocessing the type information of a second sub-segment which does not form a matching relation with the keywords in the resource library in the voice request;

0223: and determining the additional information according to the type information of the first sub-segment and the type information of the second sub-segment.

The processor is used for forming a first sub-segment of a matching relation with keywords in the resource library in the voice request, determining type information of the first sub-segment according to the labeling type of the keywords, and forming a second sub-segment of the matching relation with the keywords in the resource library in the voice request, preprocessing the type information of the second sub-segment, and determining additional information according to the type information of the first sub-segment and the type information of the second sub-segment.

Specifically, the voice request is matched with keywords in the resource library, and the matching result is that the user voice request contains multiple types of sub-fragments, wherein the different types of sub-fragments comprise a first sub-fragment which forms a matching relation with the keywords in the resource library and a second sub-fragment which cannot form a matching relation with the keywords in the resource library.

For the first sub-segment which can form a matching relation with the keywords in the resource library in the user voice request, the type information of the first sub-segment can be determined according to the labeling type of the keywords.

For example, in one example, a user issues a voice request "play a mother-of-hearing" where "mother-of-hearing" is a keyword already existing in the repository, and then determines "mother-of-hearing" as the first sub-segment, and determines "mother-of-hearing" type as "song name" according to the label type of the keyword.

And preprocessing the type information of the second sub-segment which cannot form a matching relation with the keywords in the resource library in the voice request of the user.

In the above example, in the user voice request "play mother's voice," play "is determined as the second sub-segment, and the type information of the second sub-segment needs to be preprocessed.

Finally, additional information of the voice request can be determined based on the information of the first sub-segment and the second sub-segment. For example, the additional information of the voice request "play listen to mom's words" may be: "listen to mom" is the name of the song.

Referring to fig. 6, the method further includes:

06: and determining an additional feature vector corresponding to the additional information according to the type information of the sub-segment.

The processor is used for determining an additional feature vector corresponding to the additional information according to the type information of the sub-segment.

Specifically, when performing speech recognition on a speech request issued by a user, there may be a case where the speech request slot information cannot be extracted. In this regard, the above-described case may be described using additional information. The additional feature vector corresponding to the additional information may be determined according to the type information of each sub-segment in the user's voice request. The additional feature vector corresponding to the additional information may be an additional information identification for the voice request additional information.

When there is a sub-segment in the user's voice request that successfully matches the keywords stored in the repository, an additional feature value may be set for each word in the voice request, constituting an additional feature vector to distinguish the first sub-segment from the second sub-segment in the sentence. For example, the value of the additional feature vector for each word in the first sub-segment of the key in the hit repository may be assigned a 1, while the value of the additional feature vector for each word in the second sub-segment of the key in the miss repository may be assigned a 0. The choice of specific values in the assignment of additional feature vectors is not limited herein.

In one example, the user makes a voice request "navigate to a sister's menu", where "sister" of "sister's menu" may be individually identified as a person name, resulting in that "sister's menu" may not be identified as a place name or store name, and thus the user's intent to navigate may not be performed. At this time, if the keyword "sister's pickles" exists in the resource library, the sub-segment of "sister's pickles" in the user voice request can be identified as the pickles name according to the matching relationship with the keyword in the resource library, and the value of the additional feature vector of each word in the sub-segment is assigned to 1. While the sub-segment "navigates" to miss keywords in the repository, the value of the additional feature vector for each word is assigned 0.

And determining an additional feature vector corresponding to the additional information according to the type information of the sub-segment, so that the additional feature vector can be utilized in the subsequent slot identification process to obtain a more accurate slot identification result.

Referring to fig. 7, step 03 includes:

031: and carrying out slot recognition on the voice request according to the original feature vector corresponding to the original information of the voice request and the additional feature vector corresponding to the additional information.

The processor is used for carrying out slot recognition on the voice request according to the original feature vector corresponding to the original information of the voice request and the additional feature vector corresponding to the additional information.

Specifically, after a user sends a voice request, an original feature vector corresponding to the original information is obtained according to the pre-training result of the voice request and the original information such as the sequence of each word in the voice request.

When there is a sub-segment in the user's voice request that successfully matches the keyword stored in the repository, an additional feature value may be set for each word in the voice request, constituting an additional feature vector to distinguish the first sub-segment from the second sub-segment in the sentence by the additional feature vector.

In one example, a user issues a voice request "play a mother's call", where "play mother's call" hits a keyword stored in a resource library, and then an additional feature vector is set for each word in the voice request, for example, the value of the additional feature vector of each word in a first sub-segment "listen mother's call" of the keyword in the hit resource library is 1, and the value of the additional feature vector of each word in a second sub-segment "play" of the keyword in the miss resource library is 0, and finally the additional feature vector of "play mother's call" of the voice request is "[ CLS ], 0, 0, 1, 1, 1, 1, 1", and the corresponding additional information may be that "mother's call" in the sentence successfully matches the keyword stored in the resource library.

In the voice interaction process, after the additional information of the voice request is obtained according to the preset resource library, the voice request can be subjected to slot recognition according to the original feature vector corresponding to the original information of the voice request and the additional feature vector corresponding to the additional information. In the above example, the additional feature vector of the voice request "play mother's words" is "[ CLS ], 0, 0, 1, 1, 1, 1, 1", indicating that "listen mother's words" in the sentence successfully matches the keywords stored in the repository. The result of the slot recognition is [ "listen to mother's words" -song name (song) ], instead of recognizing "mother" and "words" as different slots respectively, so that the accuracy of slot recognition in the voice interaction process is improved.

Referring to fig. 8, step 031 includes:

0311: performing text sequence coding on the voice request to obtain a first feature vector in the original feature vector;

0312: determining a position vector of the voice request according to the character sequence of the voice request to obtain a second feature vector in the original feature vector;

0313: and carrying out slot recognition on the voice request according to the first feature vector, the second feature vector and the additional feature vector.

The processor is used for carrying out text sequence coding on the voice request to obtain a first feature vector in the original feature vector, determining a position vector of the voice request according to the character sequence of the voice request to obtain a second feature vector in the original feature vector, and carrying out slot recognition on the voice request according to the first feature vector, the second feature vector and the additional feature vector.

Specifically, text sequence coding can be performed on the result of the user voice request and the slot recognition in a splicing manner, namely, a word embedding matrix is used for obtaining a first feature vector in the original feature vector. For example, the first feature vector corresponding to the voice request "play listen to mom's voice" is "[ CLS ] play listen to mom's voice". The [ CLS ] character is used for text classification and is a logo character for the start of text. For a first feature vector of a plurality of consecutive voice requests, a [ SEP ] identifier is also included between each voice request for separating two sentences.

The position vector of the voice request, i.e. the second feature vector, may be determined from the character sequence of each word in the voice request. The value of the position vector is the serial number of the sequential position of the current text character in the voice request. The classification identifier [ CLS ] or pause identifier [ SEP ] before the sentence is numbered 0, the sentence head character is numbered 1, and the numbers of the rest characters are sequentially increased. Finally, a second feature vector is formed from the character sequence.

In one example, the voice request "play mother's voice" corresponds to a first feature vector of "[ CLS ] play mother's voice", and a second feature vector of "0, 1, 2, 3, 4, 5, 6, 7".

In particular, in a continuous plurality of voice requests, the number of the pause identifier [ SEP ] between two voice requests follows a sequentially increasing principle. For example, the voice request "come again. A song of Zhou Jielun is played. The first feature vector of "is" CLS "and then one song of Zhou Jielun is played from one [ SEP ], and the second feature vector is" 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13".

After the first feature vector and the second feature vector of the user voice request are obtained, the voice request can be subjected to slot recognition according to the first feature vector, the second feature vector and the additional feature vector. In the above example, the result of the slot recognition of the voice request "play mother's call" is [ "mother's call" -song name (song) ].

Referring to fig. 9, step 0313 includes:

03131: performing predetermined processing on the first feature vector, the second feature vector and the additional feature vector to obtain an input for performing slot recognition;

03132: and carrying out reasoning processing on the input by using the slot identification model to obtain a slot identification result, wherein the slot identification result comprises a slot value and a slot type corresponding to the slot value.

The processor is used for carrying out preset processing on the first feature vector, the second feature vector and the additional feature vector to obtain input used for carrying out slot identification, carrying out reasoning processing on the input by utilizing a slot identification model to obtain a slot identification result, wherein the slot identification result comprises a slot value and a slot type corresponding to the slot value.

Specifically, first, the first feature vector, the second feature vector, and the additional feature vector of the user voice request are preprocessed. The preprocessing process can use a BERT model to sum the first feature vector, the second feature vector and the additional feature vector in the same vector embedding mode, and the result of summation calculation is used as the integral text feature of the user voice request and is used as the input of slot recognition to train and infer so as to obtain the result of slot recognition. The result of the slot identification comprises a slot value and a slot type corresponding to the slot value. The slot identification process may use a Linear conditional random field (Linear-CRF) model, and the specific model is selected according to the accuracy requirement of slot identification, which is not limited herein.

In one example, the user voice requests a first feature vector, a second feature vector and an additional feature vector of "play a mother's voice", as shown in fig. 10, to enter a slot recognition model through a unified vector embedding manner to perform slot recognition, so as to obtain a slot recognition result [ "listen mother's voice" -song name (song) ], including a slot value "listen mother's voice", and a slot type corresponding to the slot value, namely, "song name (song)".

The slot recognition model of the application not only utilizes the original text features of the user voice request, namely the first feature vector and the second feature vector, but also introduces external resource features, namely additional feature vectors obtained based on the matching degree of the user voice request text and keywords in a resource library. Compared with the original model, the accuracy of the slot identification of the model utilizing the fusion of the external resource features can be remarkably improved.

Referring to fig. 11, step 05 includes:

051: determining target parameters of slot filling according to the voice request, the slot recognition result, the predicted application program interface and the predicted application program interface type;

052: and selecting the predicted application program interface to execute application program interface parameter filling according to the result of the slot position identification and the target parameter, outputting an execution result and transmitting the execution result to the vehicle to complete voice interaction.

The processor is used for determining target parameters of slot filling according to the voice request, the slot recognition result, the predicted application program interface and the predicted application program interface type, selecting the predicted application program interface to execute the application program interface parameter filling according to the slot recognition result and the target parameters, and outputting an execution result to be issued to the vehicle to complete voice interaction.

Specifically, the target parameters for slot filling may be determined based on the user voice request, the results of slot recognition, and the predicted application program interface and interface type. The target parameter is the slot name corresponding to the slot identification result. Finally, according to the result of the slot position identification and the target parameter, a predicted application program interface is selected, the filled target parameter is executed, and the output execution result is issued to the vehicle so as to complete the voice interaction.

For example, for a user voice request "play a mother's call", the result of slot recognition is: the parameters of the application program interface 1 comprise 2 parameters of singer and song, the corresponding application program interface type is a Music playing type, the target parameter which is required to be filled into the application program interface 1 in the result of obtaining the slot position identification is the song name, and after the application program interface 1 for playing the Music is filled with the Music playing device in the result of obtaining the slot position identification, the action of opening the Music playing device and playing the corresponding song can be correspondingly executed, so that the voice interaction is completed.

For example, for a user voice request "navigate to Zhongguancun", the result of slot recognition: the parameters of the application program interface 2 include 2 parameters of a departure Place and a destination, the corresponding application program interface type is a navigation type, and further the target parameter which is required to be filled into the application program interface 2 in the result of the slot identification is judged to be the destination, so that the navigation task for navigating to the middle-Guanyu can be correspondingly executed after the middle-Guanyu in the result of the slot identification is filled into the navigation application program interface 2, and the voice interaction is completed.

The computer readable storage medium of the present application stores a computer program which, when executed by one or more processors, implements the method described above.

In the description of the present specification, reference to the terms "above," "specifically," "particularly," "further," and the like, means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present application. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiments or examples. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, the different embodiments or examples described in this specification and the features of the different embodiments or examples may be combined and combined by those skilled in the art without contradiction.

Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable requests for implementing specific logical functions or steps of the process, and further implementations are included within the scope of the preferred embodiment of the present application in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present application.

While embodiments of the present application have been shown and described above, it will be understood that the above embodiments are illustrative and not to be construed as limiting the application, and that variations, modifications, alternatives and variations may be made to the above embodiments by one of ordinary skill in the art within the scope of the application.

Claims

1. A method of voice interaction, comprising:

receiving a voice request forwarded by a vehicle;

matching the voice request with keywords in a preset resource library, wherein the resource library stores keywords meeting preset conditions;

determining additional information of the voice request according to the type information of the first sub-segment and the type information of the second sub-segment;

performing application program interface prediction on the voice request;

2. The voice interaction method of claim 1, wherein the method further comprises:

and determining an additional feature vector corresponding to the additional information according to the type information of the sub-fragment in the voice request.

3. The voice interaction method according to claim 2, wherein the performing slot recognition on the voice request according to the voice request and the additional information includes:

4. The voice interaction method according to claim 3, wherein the performing slot recognition on the voice request according to the original feature vector corresponding to the original information of the voice request and the additional feature vector corresponding to the additional information includes:

5. The voice interaction method of claim 4, wherein the performing slot recognition on the voice request according to the first feature vector, the second feature vector and the additional feature vector comprises:

6. The voice interaction method according to claim 1, wherein the selecting the predicted application program interface to execute application program interface parameter filling according to the result of the slot recognition and the predicted application program interface, outputting the execution result and transmitting to a vehicle to complete voice interaction comprises:

7. A server comprising a processor and a memory, the memory having stored thereon a computer program which, when executed by the processor, implements the voice interaction method of any of claims 1-6.

8. A non-transitory computer readable storage medium containing a computer program, characterized in that the voice interaction method of any of claims 1-6 is implemented when the computer program is executed by one or more processors.