CN113990300B - Voice interaction method, vehicle, server and computer-readable storage medium - Google Patents

Voice interaction method, vehicle, server and computer-readable storage medium Download PDF

Info

Publication number
CN113990300B
CN113990300B CN202111606975.5A CN202111606975A CN113990300B CN 113990300 B CN113990300 B CN 113990300B CN 202111606975 A CN202111606975 A CN 202111606975A CN 113990300 B CN113990300 B CN 113990300B
Authority
CN
China
Prior art keywords
feature
audio
text
confidence
characteristic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111606975.5A
Other languages
Chinese (zh)
Other versions
CN113990300A (en
Inventor
韩传宇
易晖
翁志伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Xiaopeng Motors Technology Co Ltd
Original Assignee
Guangzhou Xiaopeng Motors Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Xiaopeng Motors Technology Co Ltd filed Critical Guangzhou Xiaopeng Motors Technology Co Ltd
Priority to CN202111606975.5A priority Critical patent/CN113990300B/en
Publication of CN113990300A publication Critical patent/CN113990300A/en
Application granted granted Critical
Publication of CN113990300B publication Critical patent/CN113990300B/en
Priority to PCT/CN2022/138595 priority patent/WO2023124960A1/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B60VEHICLES IN GENERAL
    • B60WCONJOINT CONTROL OF VEHICLE SUB-UNITS OF DIFFERENT TYPE OR DIFFERENT FUNCTION; CONTROL SYSTEMS SPECIALLY ADAPTED FOR HYBRID VEHICLES; ROAD VEHICLE DRIVE CONTROL SYSTEMS FOR PURPOSES NOT RELATED TO THE CONTROL OF A PARTICULAR SUB-UNIT
    • B60W50/00Details of control systems for road vehicle drive control not related to the control of a particular sub-unit, e.g. process diagnostic or vehicle driver interfaces
    • B60W50/08Interaction between the driver and the control system
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B60VEHICLES IN GENERAL
    • B60WCONJOINT CONTROL OF VEHICLE SUB-UNITS OF DIFFERENT TYPE OR DIFFERENT FUNCTION; CONTROL SYSTEMS SPECIALLY ADAPTED FOR HYBRID VEHICLES; ROAD VEHICLE DRIVE CONTROL SYSTEMS FOR PURPOSES NOT RELATED TO THE CONTROL OF A PARTICULAR SUB-UNIT
    • B60W2540/00Input parameters relating to occupants
    • B60W2540/21Voice

Abstract

The invention discloses a voice interaction method, a vehicle, a server and a storage medium. The voice interaction method comprises the following steps: receiving a user voice request forwarded by a vehicle; acquiring a first audio characteristic of a voice request of a user in a current round and a second audio characteristic of a voice request of the user in a previous round; acquiring a first confidence characteristic of the current round of voice recognition and a second confidence characteristic of the previous round of voice recognition; acquiring a first text feature of a current round and a second text feature of a previous round; and performing rejection processing according to the first audio characteristic, the second audio characteristic, the first confidence coefficient characteristic, the second confidence coefficient characteristic, the first text characteristic and the second text characteristic. In the voice interaction method, the vehicle, the server and the storage medium, the rejection processing is performed by combining the audio features, the confidence features and the text features of the current round and the last round, so that the rejection result is more accurate, the rejection missing rate and the rejection error rate can be reduced, and the rejection error rate can be reduced.

Description

Voice interaction method, vehicle, server and computer-readable storage medium
Technical Field
The present invention relates to voice technology, and in particular, to a voice interaction method, a vehicle, a server, and a computer-readable storage medium.
Background
In the related art, there is often a noisy speech input during the speech interaction process, thereby causing a false response of the speech system, and the speech system can reject the speech to improve the recognition rate of the speech system. The error rate of the rejection directly affects whether the final instruction is correctly understood and executed, and how to reduce the error rate of the rejection becomes an urgent problem to be solved.
Disclosure of Invention
The invention provides a voice interaction method, a vehicle, a server and a computer-readable storage medium.
The voice interaction method comprises the following steps: receiving a user voice request forwarded by a vehicle; acquiring a first audio characteristic of a voice request of a user in a current round and a second audio characteristic of a voice request of the user in a previous round; acquiring a first confidence characteristic of the current round of voice recognition and a second confidence characteristic of the previous round of voice recognition; acquiring a first text feature of a current round and a second text feature of a previous round; and performing rejection processing according to the first audio characteristic, the second audio characteristic, the first confidence characteristic, the second confidence characteristic, the first text characteristic and the second text characteristic.
In the voice interaction method, the rejection processing is performed by combining the audio features, confidence features and text features of the current round and the last round, so that the rejection result is more accurate, the rejection missing rate and the rejection error rate can be reduced, and the rejection error rate can be reduced.
The voice interaction method comprises the following steps: combining the first audio feature and the second audio feature to obtain an audio combination feature; combining the first confidence feature and the second confidence feature to obtain a confidence combined feature; combining the first text feature and the second text feature to obtain a text combination feature; the rejecting process according to the first audio feature, the second audio feature, the first confidence feature, the second confidence feature, the first text feature and the second text feature includes: and performing rejection processing according to the audio combined feature, the confidence combined feature and the text combined feature.
In this way, the audio features of the context, the confidence features of the context, and the text features of the context may be combined to obtain the audio combined features, the confidence combined features, and the text combined features, respectively, so that the rejection processing can be performed according to the audio combined features, the confidence combined features, and the text combined features.
The acquiring of the first audio feature of the voice request of the user in the current round and the second audio feature of the voice request of the user in the previous round includes: generating a digital feature matrix according to the user voice request of the current round and the user voice request of the previous round; reducing the dimensionality of the digital feature matrix to obtain a reduced dimensionality feature matrix; processing the context relation in the dimension reduction feature matrix to obtain a feature matrix to be processed; emphasizing emphasis features of the feature matrix to be processed to obtain output audio features, the output audio features including the first audio features and the second audio features.
Therefore, the audio features can be accurately obtained according to the voice requests of the users in the current round and the voice requests of the users in the previous round.
The obtaining of the first text feature of the current round and the second text feature of the previous round includes: encoding the text information of the current round and the text information of the previous round to obtain digital encoding information; extracting a deep feature matrix from the digitally encoded information, the deep feature matrix including the first textual feature and the second textual feature.
Therefore, the text features can be accurately obtained according to the text information of the current round and the text information of the previous round.
The voice interaction method comprises the following steps: concatenating the first audio feature, the second audio feature, the first confidence feature, the second confidence feature, the first text feature, and the second text feature to obtain a concatenation matrix; the rejecting process according to the first audio feature, the second audio feature, the first confidence feature, the second confidence feature, the first text feature and the second text feature includes: and performing rejection processing according to the splicing matrix.
Therefore, by splicing the features, rejection processing can be conveniently carried out according to the spliced feature matrix.
The rejecting treatment according to the splicing matrix comprises the following steps: recognizing the speaking object according to the splicing matrix to obtain the category of the speaking object; determining that the result is rejection under the condition that the speaking object type is a first preset type; wherein the speaking object categories include: speaking to the voice assistant, not speaking to the voice assistant, being unable to judge and having no speaker; the first preset category includes: no speech assistant is speaking, no judgment is made, and no speaker is present.
In this way, recognition of the speaking object can be performed, and whether recognition is denied or not can be determined according to the category of the speaking object.
The rejecting treatment according to the splicing matrix comprises the following steps: performing intention strength identification according to the splicing matrix to obtain an intention strength category; determining that a result is rejection if the intention strength category is a second preset category; the intent intensity categories include: strong effectiveness, weak effectiveness, no intention and no judgment; the second preset category includes: it is not intended and cannot be judged.
In this manner, the intention strength recognition can be performed, so that whether or not to reject the recognition is determined according to the category of the intention strength.
The voice interaction method comprises the following steps: inputting the training set into a semantic rejection model for speaking object recognition and intention strength recognition to obtain a predicted speaking object category and a predicted intention strength category; calculating a first loss according to the predicted speaking object category and the real speaking object category marked in the training set, and calculating a second loss according to the predicted intention intensity category and the real intention intensity category marked in the training set; training a semantic rejection model according to the first loss and the second loss; and performing rejection processing by using the trained semantic rejection model, the first audio characteristic, the second audio characteristic, the first confidence characteristic, the second confidence characteristic, the first text characteristic and the second text characteristic.
Thus, after training, a semantic rejection model capable of recognizing the speaking object and recognizing the intention strength can be obtained.
The vehicle of the present invention includes a memory and a processor, the memory stores a computer program, and the processor implements the voice interaction method according to any one of the above embodiments when executing the computer program.
In the vehicle, the rejection processing is performed by combining the audio features, confidence features and text features of the current turn and the previous turn, so that the rejection result is more accurate, the rejection missing rate and the rejection error rate can be reduced, and the rejection error rate can be reduced.
The server of the present invention includes a memory and a processor, where the memory stores a computer program, and the processor implements the voice interaction method according to any one of the above embodiments when executing the computer program.
In the server, the rejection processing is performed by combining the audio features, confidence features and text features of the current round and the last round, so that the rejection result is more accurate, the rejection missing rate and the rejection error rate can be reduced, and the rejection error rate can be reduced.
The computer-readable storage medium of the present invention stores thereon a computer program, which when executed by a processor implements the voice interaction method of any one of the above embodiments.
In the computer readable storage medium, the audio features, the confidence level features and the text features of the current round and the last round are combined to perform rejection processing, so that the rejection result is more accurate, the rejection missing rate and the rejection error rate can be reduced, and the error rate of rejection can be reduced.
Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.
Drawings
The foregoing and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:
FIG. 1 is a flow chart of a voice interaction method of the present invention;
FIG. 2 is a flow chart of the voice interaction method of the present invention (two);
FIG. 3 is a schematic illustration of the vehicle of the present invention;
FIG. 4 is a schematic diagram of the speech system of the present invention;
FIG. 5 is a flow chart diagram (III) of the voice interaction method of the present invention;
FIG. 6 is a flow chart diagram (IV) of the voice interaction method of the present invention;
FIG. 7 is a flow chart diagram (V) of the voice interaction method of the present invention;
FIG. 8 is a flow chart diagram (VI) of the voice interaction method of the present invention;
FIG. 9 is a flow chart diagram (seventh) of the voice interaction method of the present invention;
FIG. 10 is a flow chart diagram (eight) of the voice interaction method of the present invention;
FIG. 11 is a flow chart diagram of the voice interaction method of the present invention (nine);
FIG. 12 is a schematic illustration of a vehicle of the present invention interfacing with a computer readable storage medium.
Description of the main element symbols:
a speech system 100, a vehicle 10, a server 20, a processor 110, a memory 120, a computer-readable storage medium 300.
Detailed Description
Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are illustrative and intended to be illustrative of the invention and are not to be construed as limiting the invention.
In the related art, there is often a noisy speech input during the speech interaction process, thereby causing a false response of the speech system, and the speech system can reject the speech to improve the recognition rate of the speech system. The error rate of the rejection directly affects whether the final instruction is correctly understood and executed, and how to reduce the error rate of the rejection becomes an urgent problem to be solved.
Referring to fig. 1 and fig. 2, a voice interaction method according to an embodiment of the present invention includes:
012: receiving a user voice request forwarded by the vehicle 10;
014: acquiring a first audio characteristic of a voice request of a user in a current round and a second audio characteristic of a voice request of the user in a previous round;
016: acquiring a first confidence characteristic of the current round of voice recognition and a second confidence characteristic of the previous round of voice recognition;
018: acquiring a first text feature of a current round and a second text feature of a previous round;
022: and performing rejection processing according to the first audio characteristic, the second audio characteristic, the first confidence coefficient characteristic, the second confidence coefficient characteristic, the first text characteristic and the second text characteristic.
Referring to fig. 3 and 4, the voice interaction method according to the embodiment of the present invention may be applied to the vehicle 10 or the server 20 according to the embodiment of the present invention, wherein when the voice interaction method is applied to the vehicle 10, a microphone of the vehicle 10 may receive a user voice request, and then the user voice request may be forwarded to the processor 110 of the vehicle 10 for processing; when the voice interaction method is applied to the server 20, the microphone of the vehicle 10 may receive the user voice request, and then the user voice request may be forwarded to the processor 110 of the server 20 for processing, and the server 20 may forward the processing result of the user voice request to the vehicle 10, and the vehicle 10 and the server 20 form the voice system 100.
The user voice request (audio) may be, for example, an original audio file. The user voice request can output text information after passing through voice Recognition (ASR). The user voice requests include the user voice request of the previous round (pre _ flat) and the user voice request of the current round (cur _ flat).
The audio features may be features contained in the user's voice request including, for example, loudness, pitch, timbre, and the like. The audio characteristics are closely related to the real intention of the user, and the optimization effect can be more obvious by utilizing the characteristics.
The confidence characteristic can be a confidence measure value in the ASR process, and the confidence characteristic directly influences whether the voice request of the user is correctly understood and executed, so that rejection processing can be facilitated, and continuous listening capability can be guaranteed. If the ASR result of the user's speech request is: "what is played"; the actual result is that: all are noise, in which case ASR belongs to misrecognition, and the confidence of the erroneous part is low.
[ { "conf":0.355, "end":900, "pinyin": bounding, "" start ":700," word ": play" },
{ "conf":0.222, "end":1050, "pinyin": de, "" start ":1000," word ": of" },
{ "conf":0.486 "," end ": 1100", "pinyin": shi "," start ": 1050", "word": is "},
{ "conf":0.619, "end":1200, "pinyin": "she me," start ":1100," word ": what" })
Wherein conf is the confidence of each word, and the confidence is lower when the ASR belongs to the false recognition. The confidence features include the confidence feature of the previous round (pre _ conf) and the confidence feature of the current round (cur _ conf).
The text feature may be a feature included in the text information, and before obtaining the text feature, Natural Language Understanding (NLU) may be performed on the text information. The textual features of the context (current and previous turns) can effectively clarify the true intent of the user session, previous turn: listen to songs, current round: and if the person is a certain week, the following steps can be executed after identification: playing a song of a certain week; in the previous round: who your idol is, current wheel: someweek, it may be rejected and not respond. The text information includes the text information of the previous round (pre _ q) and the text information of the current round (cur _ q).
In the voice interaction method, the rejection processing is performed by combining the audio features, confidence features and text features of the current round and the last round, so that the rejection result is more accurate, the rejection missing rate and the rejection error rate can be reduced, and the rejection error rate can be reduced.
Referring to fig. 2 and 5, the voice interaction method includes:
024: combining the first audio feature and the second audio feature to obtain an audio combination feature;
026: combining the first confidence feature and the second confidence feature to obtain a confidence combined feature;
028: combining the first text feature and the second text feature to obtain a text combination feature;
step 022 (performing rejection processing based on the first audio feature, the second audio feature, the first confidence feature, the second confidence feature, the first text feature, and the second text feature), comprising:
0222: and performing rejection processing according to the audio combination characteristic, the confidence combination characteristic and the text combination characteristic.
In this way, the audio features of the context, the confidence features of the context, and the text features of the context may be combined to obtain the audio combined features, the confidence combined features, and the text combined features, respectively, so that the rejection processing can be performed according to the audio combined features, the confidence combined features, and the text combined features. After the audio features of the context are combined, they may be separated by a separator ([ SEP ]); similarly, after the confidence features of the contexts are combined, they may be separated by separators; after the text features of the context are combined, the separation may be performed using a separator. 0.0 may be separated as a separator.
The context characteristics can be generated according to the characters (user voice request or text information) of the context, and then the context characteristics are combined and separated; or the context characters (user voice request or text information) can be combined and separated, and then the context characteristics can be generated according to the context characters. It should be noted that the confidence feature acquisition does not include a process of generating features from characters, and therefore, the manner of combining and separating the context confidence features is to directly combine and separate the context confidence features.
Referring to fig. 2 and 6, step 012 (obtaining a first audio characteristic of the voice request of the user in the current round and a second audio characteristic of the voice request of the user in the previous round) includes:
0122: generating a digital feature matrix according to the user voice request of the current round and the user voice request of the previous round;
0124: reducing the dimensionality of the digital feature matrix to obtain a reduced dimensionality feature matrix;
0126: processing the context relation in the dimension reduction feature matrix to obtain a feature matrix to be processed;
0128: and reinforcing the key features of the feature matrix to be processed to obtain output audio features, wherein the output audio features comprise first audio features and second audio features.
Therefore, the audio features can be accurately obtained according to the voice requests of the users in the current round and the voice requests of the users in the previous round. Audio features may be acquired based on a speech coder (speech-encoder) model, and in particular, audio vector features may be extracted from a current round of user speech requests and a previous round of user speech requests using Mel-Frequency Cepstral coeffients (MFCCs) to generate a digital feature matrix (feature matrix in digital format, MFCC _ extract), then subjected to dimensionality reduction using a 4-layer Convolutional Neural Network (CNN) to obtain a dimensionality reduction feature matrix (CNN _ model _ fn), then subjected to feature association and extraction according to context using a 1-layer two-way Long Short-Term Memory Network (LSTM) to obtain a processed feature matrix (LSTM _ model _ fn), then subjected to emphasis feature enhancement using self-attention (self-attention) to obtain an output audio feature (attribute _ fn), the first audio feature and the second audio feature are included in the output audio features.
The inputs to the MFCC are, for example: the "window open" audio (last turn) and the "two wind speed" audio (current turn), the MFCC processing is, for example, "use python _ speed _ features audio processing package, MFCC parameters: samplerate =16000, winlen =0.025, wintemp =0.01, numcep =13, nfilt =26, nfft =512, lowfreq =0, highfreq = None, preemph =0.97, ceplifter =22, and appendEnergy = True, and as a result, is subjected to first and second order differences ", and the output of the MFCC is, for example," two 512 × 39 dimensional feature matrices ". The inputs to CNN are for example: the output of MFCC, CNN, is processed, for example, as follows: 4 one-dimensional convolutional layers and 4 maximum pooling layers, parameters: speed _ embedding _ size =512, speed _ filters _ num = 32, speed _ kernel _ size =2, speed _ threads _ len =2, speed _ pool _ size =1, the output of CNN is for example: 2 x 512 feature matrix. The inputs to the LSTM are, for example: the output of CNN, LSTM, is processed as: single layer bidirectional lstm model parameters: embedding _ size =512, the output of LSTM is for example: 2 x 1024 feature matrices. The self-attention inputs are for example: the output of LSTM, self-attention, is for example: self-attention model parameters: embedding _ size =512, and the output of self-attention is, for example: 2 x 512 feature matrix.
Referring to fig. 2 and 7, step 018 (obtaining the first text feature of the current round and the second text feature of the previous round) includes:
0182: encoding the text information of the current round and the text information of the previous round to obtain digital encoding information;
0184: a deep feature matrix is extracted from the digitally encoded information, the deep feature matrix including a first text feature and a second text feature.
Therefore, the text features can be accurately obtained according to the text information of the current round and the text information of the previous round. The text features may be obtained based on a pre-training language model (bert-encoder), specifically, the text information may be encoded by using the bert-encoding to obtain context encoding information as digital encoding information (bert _ encoding _ fn), and then a deep feature matrix (transformer _ model _ fn) including the first text feature and the second text feature is extracted from the digital encoding information by using a transformer model.
For example, the input to bert-embedding is: the text of "open window" (last turn) and the text of "wind speed two gear" (current turn), the output of bert-embedding is 1 × 768 dimensional feature matrix. For example, the inputs to the transform model are: the output of bert-embedding, the output of the transform model is for example: 1 x 768 dimensional feature matrices.
Referring to fig. 2 and 8, the voice interaction method includes:
032: splicing the first audio features, the second audio features, the first confidence coefficient features, the second confidence coefficient features, the first text features and the second text features to obtain a splicing matrix;
step 022 (performing rejection processing based on the first audio feature, the second audio feature, the first confidence feature, the second confidence feature, the first text feature, and the second text feature), comprising:
0224: and performing rejection processing according to the splicing matrix.
Thus, by splicing (concat) the features, rejection processing can be conveniently performed according to the spliced feature matrix. The context confidence feature is a 1 × 33 dimensional feature matrix: … … 0.99.99, 0.99, 0.85, 0.85, 0.0, 0.90, 0.90, 0.95, 0.95 … …, wherein 0.0 is a separator; the context text features are 1 x 768 dimensional feature matrixes; the contextual audio features are a 2 x 512 dimensional feature matrix. The context audio features with the dimension of 2 x 512 can be firstly transformed into the feature matrix with the dimension of 1 x 1024, and then the three features are spliced end to form the spliced matrix with the dimension of 1 x 1825.
Referring to fig. 2 and 9, step 0224 (reject processing according to the concatenation matrix) includes:
02242: recognizing the speaking object according to the splicing matrix to obtain the category of the speaking object;
02244: determining that the result is rejection under the condition that the speaking object type is a first preset type;
the voice interaction method is used for a voice assistant, and the speaking object types comprise: speaking to the voice assistant, not speaking to the voice assistant, being unable to judge and having no speaker; the first preset category includes: no speech assistant is speaking, no judgment is made, and no speaker is present.
In this way, the speaker recognition (DD) can be performed, and whether recognition is rejected or not can be determined according to the type of the speaker. Wherein speaking to the voice assistant may include specifically speaking to the voice assistant and approximately speaking to the voice assistant, and not speaking to the voice assistant includes specifically not speaking to the voice assistant and approximately not speaking to the voice assistant.
Referring to fig. 2 and 10, step 0224 (performing rejection processing according to the concatenation matrix) comprises:
02246: performing intention strength identification according to the splicing matrix to obtain an intention strength category;
02248: determining that the result is rejection under the condition that the intention strength category is a second preset category;
the intention strength categories include: strong effectiveness, weak effectiveness, no intention and no judgment; the second preset category includes: it is not intended and cannot be judged.
In this way, the intention intensity (intensity) recognition can be performed, and whether or not to reject the recognition can be determined according to the type of the intention intensity.
And multi-task classification can be realized by combining the DD and the intensity, so that the rejection result is more accurate. The splice matrix can be transmitted to a semantic rejection model through a full connection layer (dense) to classify the DD and intensity.
Referring to fig. 2 and fig. 11, the voice interaction method includes:
034: inputting the training set into a semantic rejection model for speaking object recognition and intention strength recognition to obtain a predicted speaking object category and a predicted intention strength category;
036: calculating a first loss according to the predicted speaking object category and the real speaking object category marked in the training set, and calculating a second loss according to the predicted intention intensity category and the real intention intensity category marked in the training set;
038: and training the semantic rejection model according to the first loss and the second loss.
Thus, after training, a semantic rejection model capable of recognizing the speaking object and recognizing the intention strength can be obtained. Specifically, the semantic rejection model may include a classifier, the classifier may perform recognition of a speaking object and recognition of intention strength, and the predicted speaking object class and the predicted intention strength class may be obtained by inputting a training set into the semantic rejection model, where the training set is pre-labeled with a real speaking object and a real intention strength, a first loss of the predicted speaking object class and the real speaking object class, a second loss of the predicted intention strength class and the real intention strength class may be calculated by using a cross entropy loss function, and the semantic rejection model is trained according to the first loss and the second loss to complete training iteration. And after the training set and the verification set are completely iterated, outputting a voice rejection model.
The step 022 of performing rejection processing according to the first audio feature, the second audio feature, the first confidence feature, the second confidence feature, the first text feature and the second text feature includes:
0226: and performing rejection processing by using the trained semantic rejection model, the first audio characteristic, the second audio characteristic, the first confidence characteristic, the second confidence characteristic, the first text characteristic and the second text characteristic.
The semantic rejection model may include a classifier, and the classifier may perform recognition of a speaking object and recognition of intention strength, specifically, for example, the number of classes of the recognition of the speaking object is n, n is, for example, 6 (explicitly speaking to a speech assistant, explicitly not speaking to a speech assistant, approximately not speaking to a speech assistant, unsuspecting and no speaker), the number of classes of the intention strength is, for example, m is, for example, 4 (strongly effective, weakly effective, uninteresting and unsuspecting), and the class of the speaking object and the class of the intention strength may be obtained by performing prediction using a logistic regression model (softmax) of the classifier.
Inputting the audio feature, the confidence feature and the text feature of the window opening (the last round), and the audio feature, the confidence feature and the text feature of the wind speed two-gear (the current round) into a semantic rejection model, so that DD can be predicted as "most probable saying to the voice assistant" and intensity as "strong intention".
TABLE 1
Figure 747734DEST_PATH_IMAGE001
Referring to table 1, the speech interaction method of the present invention employs a semantic rejection model 3. When the semantic rejection models 1, 2, and 3 are trained using the same training set and validation set, and tested using the same test set, it can be seen that the error ratio of the semantic rejection model 3 is reduced by 1-6.67%/7.99% =16.52% compared to the semantic rejection model 1. The test set includes an invalid set and an active set, the invalid set being an invalid instruction set and, for example: "haha, i don't know," the active set is the set of active instructions, such as: "open window, navigate to Daximen north". The number of leakages: the number of unrelated instructions in the invalid set that are not rejected. False rejection: the active set and the number of active instructions that are not released. Error number = miss number + false reject number.
Referring to fig. 3, the vehicle 10 according to the embodiment of the present invention includes a memory 120 and a processor 110, the memory 120 stores a computer program, and the processor 110 executes the computer program to implement the voice interaction method according to any one of the above embodiments.
For example, the computer program, when executed by the processor 110, may implement:
012: receiving a user voice request forwarded by the vehicle 10;
014: acquiring a first audio characteristic of a voice request of a user in a current round and a second audio characteristic of a voice request of the user in a previous round;
016: acquiring a first confidence characteristic of the current round of voice recognition and a second confidence characteristic of the previous round of voice recognition;
018: acquiring a first text feature of a current round and a second text feature of a previous round;
022: and performing rejection processing according to the first audio characteristic, the second audio characteristic, the first confidence coefficient characteristic, the second confidence coefficient characteristic, the first text characteristic and the second text characteristic.
In the vehicle 10, the rejection processing is performed by combining the audio features, confidence features, and text features of the current round and the previous round, so that the rejection result is more accurate, the rejection missing rate and the rejection error rate can be reduced, and the error rate of the rejection can be reduced.
Referring to fig. 4, the server 20 according to the embodiment of the present invention includes a memory 120 and a processor 110, where the memory 120 stores a computer program, and the processor 110 executes the computer program to implement the voice interaction method according to any one of the above embodiments.
For example, the computer program, when executed by the processor 110, may implement:
012: receiving a user voice request forwarded by the vehicle 10;
014: acquiring a first audio characteristic of a voice request of a user in a current round and a second audio characteristic of a voice request of the user in a previous round;
016: acquiring a first confidence characteristic of the current round of voice recognition and a second confidence characteristic of the previous round of voice recognition;
018: acquiring a first text feature of a current round and a second text feature of a previous round;
022: and performing rejection processing according to the first audio characteristic, the second audio characteristic, the first confidence coefficient characteristic, the second confidence coefficient characteristic, the first text characteristic and the second text characteristic.
In the server 20, the audio features, confidence features, and text features of the current round and the previous round are combined to perform rejection processing, so that the rejection result is more accurate, the rejection missing rate and the rejection error rate can be reduced, and the error rate of rejection can be reduced.
Referring to fig. 12, a computer readable storage medium 300 according to an embodiment of the present invention stores a computer program, and the computer program is executed by the processor 110 to implement the voice interaction method according to any one of the above embodiments.
For example, the computer program when executed by the processor 110 may implement:
012: receiving a user voice request forwarded by the vehicle 10;
014: acquiring a first audio characteristic of a voice request of a user in a current round and a second audio characteristic of a voice request of the user in a previous round;
016: acquiring a first confidence characteristic of the current round of voice recognition and a second confidence characteristic of the previous round of voice recognition;
018: acquiring a first text feature of a current round and a second text feature of a previous round;
022: and performing rejection processing according to the first audio characteristic, the second audio characteristic, the first confidence coefficient characteristic, the second confidence coefficient characteristic, the first text characteristic and the second text characteristic.
In the computer-readable storage medium 300, the audio features, confidence features, and text features of the current round and the previous round are combined to perform rejection processing, so that the rejection result is more accurate, the rejection missing rate and the rejection error rate can be reduced, and the rejection error rate can be reduced.
In the present invention, the computer program comprises computer program code. The computer program code may be in the form of source code, object code, an executable file or some intermediate form, etc. The memory 120 may include high-speed random access memory and may also include non-volatile memory such as a hard disk, a memory, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), at least one magnetic disk storage device, a Flash memory device, or other volatile solid state storage device. The Processor 110 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.
Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or to implicitly indicate the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present invention, "a plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise.
Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and alternate implementations are included within the scope of the preferred embodiment of the present invention in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present invention.
Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made to the above embodiments by those of ordinary skill in the art within the scope of the present invention.

Claims (10)

1. A voice interaction method, characterized in that the voice interaction method comprises:
receiving a user voice request forwarded by a vehicle;
acquiring a first audio characteristic of a voice request of a user in a current round and a second audio characteristic of a voice request of the user in a previous round;
acquiring a first confidence characteristic of the current round of voice recognition and a second confidence characteristic of the previous round of voice recognition;
acquiring a first text feature of a current round and a second text feature of a previous round;
performing rejection processing according to the first audio feature, the second audio feature, the first confidence feature, the second confidence feature, the first text feature and the second text feature;
the acquiring of the first audio feature of the voice request of the user in the current round and the second audio feature of the voice request of the user in the previous round includes:
generating a digital feature matrix according to the user voice request of the current round and the user voice request of the previous round;
reducing the dimensionality of the digital feature matrix to obtain a reduced dimensionality feature matrix;
processing the context relation in the dimension reduction feature matrix to obtain a feature matrix to be processed;
emphasizing emphasis features of the feature matrix to be processed to obtain output audio features, the output audio features including the first audio features and the second audio features.
2. The voice interaction method according to claim 1, wherein the voice interaction method comprises:
combining the first audio feature and the second audio feature to obtain an audio combination feature;
combining the first confidence feature and the second confidence feature to obtain a confidence combined feature;
combining the first text feature and the second text feature to obtain a text combination feature;
the rejecting process according to the first audio feature, the second audio feature, the first confidence feature, the second confidence feature, the first text feature and the second text feature includes:
and performing rejection processing according to the audio combined feature, the confidence combined feature and the text combined feature.
3. The method of claim 1, wherein the obtaining a first text feature of a current turn and a second text feature of a previous turn comprises:
encoding the text information of the current round and the text information of the previous round to obtain digital encoding information;
extracting a deep feature matrix from the digitally encoded information, the deep feature matrix including the first textual feature and the second textual feature.
4. The voice interaction method according to claim 1, wherein the voice interaction method comprises:
concatenating the first audio feature, the second audio feature, the first confidence feature, the second confidence feature, the first text feature, and the second text feature to obtain a concatenation matrix;
the rejecting process according to the first audio feature, the second audio feature, the first confidence feature, the second confidence feature, the first text feature and the second text feature includes:
and performing rejection processing according to the splicing matrix.
5. The voice interaction method according to claim 4, wherein the performing the rejection processing according to the concatenation matrix comprises:
recognizing the speaking object according to the splicing matrix to obtain the category of the speaking object;
determining that the result is rejection under the condition that the speaking object type is a first preset type;
wherein the speaking object categories include: speaking to the voice assistant, not speaking to the voice assistant, being unable to judge and having no speaker; the first preset category includes: no speech assistant is speaking, no judgment is made, and no speaker is present.
6. The voice interaction method according to claim 4, wherein the performing rejection processing according to the concatenation matrix comprises:
performing intention strength identification according to the splicing matrix to obtain an intention strength category;
determining that a result is rejection if the intention strength category is a second preset category;
the intent intensity categories include: strong effectiveness, weak effectiveness, no intention and no judgment; the second preset category includes: it is not intended and cannot be judged.
7. The voice interaction method according to claim 1, wherein the voice interaction method comprises:
inputting the training set into a semantic rejection model for speaking object recognition and intention strength recognition to obtain a predicted speaking object category and a predicted intention strength category;
calculating a first loss according to the predicted speaking object category and the real speaking object category marked in the training set, and calculating a second loss according to the predicted intention intensity category and the real intention intensity category marked in the training set;
training a semantic rejection model according to the first loss and the second loss;
and performing rejection processing by using the trained semantic rejection model, the first audio characteristic, the second audio characteristic, the first confidence characteristic, the second confidence characteristic, the first text characteristic and the second text characteristic.
8. A vehicle comprising a memory storing a computer program and a processor implementing the voice interaction method of any one of claims 1 to 7 when the computer program is executed by the processor.
9. A server, characterized in that the server comprises a memory and a processor, the memory storing a computer program, the processor implementing the voice interaction method according to any one of claims 1-7 when executing the computer program.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the method for voice interaction according to any one of claims 1 to 7.
CN202111606975.5A 2021-12-27 2021-12-27 Voice interaction method, vehicle, server and computer-readable storage medium Active CN113990300B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202111606975.5A CN113990300B (en) 2021-12-27 2021-12-27 Voice interaction method, vehicle, server and computer-readable storage medium
PCT/CN2022/138595 WO2023124960A1 (en) 2021-12-27 2022-12-13 Speech interaction method, vehicle, server, and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111606975.5A CN113990300B (en) 2021-12-27 2021-12-27 Voice interaction method, vehicle, server and computer-readable storage medium

Publications (2)

Publication Number Publication Date
CN113990300A CN113990300A (en) 2022-01-28
CN113990300B true CN113990300B (en) 2022-05-10

Family

ID=79734314

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111606975.5A Active CN113990300B (en) 2021-12-27 2021-12-27 Voice interaction method, vehicle, server and computer-readable storage medium

Country Status (2)

Country Link
CN (1) CN113990300B (en)
WO (1) WO2023124960A1 (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113990300B (en) * 2021-12-27 2022-05-10 广州小鹏汽车科技有限公司 Voice interaction method, vehicle, server and computer-readable storage medium
CN115376513B (en) * 2022-10-19 2023-05-12 广州小鹏汽车科技有限公司 Voice interaction method, server and computer readable storage medium
CN115457945B (en) * 2022-11-10 2023-03-31 广州小鹏汽车科技有限公司 Voice interaction method, server and storage medium
CN116741151B (en) * 2023-08-14 2023-11-07 成都筑猎科技有限公司 User call real-time monitoring system based on call center
CN116959421B (en) * 2023-09-21 2023-12-19 湖北星纪魅族集团有限公司 Method and device for processing audio data, audio data processing equipment and medium

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8359629B2 (en) * 2009-09-25 2013-01-22 Intel Corporation Method and device for controlling use of context information of a user
US9443522B2 (en) * 2013-11-18 2016-09-13 Beijing Lenovo Software Ltd. Voice recognition method, voice controlling method, information processing method, and electronic apparatus
CN105529030B (en) * 2015-12-29 2020-03-03 百度在线网络技术(北京)有限公司 Voice recognition processing method and device
US10468032B2 (en) * 2017-04-10 2019-11-05 Intel Corporation Method and system of speaker recognition using context aware confidence modeling
CN108509619B (en) * 2018-04-04 2021-05-04 科大讯飞股份有限公司 Voice interaction method and device
CN111583919B (en) * 2020-04-15 2023-10-13 北京小米松果电子有限公司 Information processing method, device and storage medium
CN111583907B (en) * 2020-04-15 2023-08-15 北京小米松果电子有限公司 Information processing method, device and storage medium
CN113221580B (en) * 2021-07-08 2021-10-12 广州小鹏汽车科技有限公司 Semantic rejection method, semantic rejection device, vehicle and medium
CN113990300B (en) * 2021-12-27 2022-05-10 广州小鹏汽车科技有限公司 Voice interaction method, vehicle, server and computer-readable storage medium

Also Published As

Publication number Publication date
WO2023124960A1 (en) 2023-07-06
CN113990300A (en) 2022-01-28

Similar Documents

Publication Publication Date Title
CN113990300B (en) Voice interaction method, vehicle, server and computer-readable storage medium
US11127416B2 (en) Method and apparatus for voice activity detection
US10446150B2 (en) In-vehicle voice command recognition method and apparatus, and storage medium
WO2018149209A1 (en) Voice recognition method, electronic device, and computer storage medium
CN111179975A (en) Voice endpoint detection method for emotion recognition, electronic device and storage medium
Ferrer et al. A prosody-based approach to end-of-utterance detection that does not require speech recognition
CN112581938B (en) Speech breakpoint detection method, device and equipment based on artificial intelligence
CN108899033B (en) Method and device for determining speaker characteristics
CN109036471A (en) Sound end detecting method and equipment
US6850885B2 (en) Method for recognizing speech
CN108932944A (en) Coding/decoding method and device
CN113330511A (en) Voice recognition method, voice recognition device, storage medium and electronic equipment
US20230368796A1 (en) Speech processing
CN115171731A (en) Emotion category determination method, device and equipment and readable storage medium
CN115148211A (en) Audio sensitive content detection method, computer device and computer program product
Iqbal et al. Stacked convolutional neural networks for general-purpose audio tagging
Alashban et al. Speaker gender classification in mono-language and cross-language using BLSTM network
CN114627868A (en) Intention recognition method and device, model and electronic equipment
JP2996019B2 (en) Voice recognition device
CN115132197B (en) Data processing method, device, electronic equipment, program product and medium
CN113658596A (en) Semantic identification method and semantic identification device
CN111145748A (en) Audio recognition confidence determining method, device, equipment and storage medium
CN114141271B (en) Psychological state detection method and system
CN114582373A (en) Method and device for recognizing user emotion in man-machine conversation
US11551666B1 (en) Natural language processing

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant