CN115512691A

CN115512691A - Method for judging echo based on semantic level in man-machine continuous conversation

Info

Publication number: CN115512691A
Application number: CN202211240000.XA
Authority: CN
Inventors: 刘光毅
Original assignee: Homwee Technology Co ltd
Current assignee: Homwee Technology Co ltd
Priority date: 2022-10-11
Filing date: 2022-10-11
Publication date: 2022-12-23

Abstract

The invention discloses a method for judging echo based on semantic level in man-machine continuous conversation, which collects reply request and secondary request, processes data to obtain integer sequence pair and phoneme sequence pair; after the phoneme sequence pair is subjected to word embedding layer, LSTM coding and L2 standardization, two vectors are obtained and are subjected to dot multiplication to obtain phoneme similarity; the integer sequence pair is subjected to word embedding layer coding, LSTM coding, vector splicing and phoneme similarity splicing to obtain a new feature vector, and is activated by a softmax activation function through two sense layers to obtain an echo prediction result; and if the phoneme similarity is greater than the phoneme similarity threshold and the echo prediction result is an echo, the secondary request is an echo, and identification is refused. The echo is judged at the semantic level, so that the situations of voice confusion and misrecognition caused by the echo are reduced, and the use experience of a user is improved; and meanwhile, the threshold value of phoneme similarity is strictly limited, and the misidentification of normal requests can be reduced.

Description

Method for judging echo based on semantic level in man-machine continuous conversation

Technical Field

The invention relates to the technical field of natural language processing, in particular to a method for judging echoes based on a semantic level in a man-machine continuous conversation.

Background

With the development of the AI technology, more and more intelligent voice devices are applied to markets, families and the like to provide services such as guidance, question answering and the like for people. In the service type man-machine conversation function, a user sends a voice request to the machine, the machine collects voice information through a microphone, encodes the voice information and transmits the encoded voice information to a processor for data processing, then voice recognition is carried out, and the message is replied through a loudspeaker through subsequent semantic understanding. Generally, the voice request and the voice response are performed separately, but in full duplex, the microphone is not turned off while the speaker plays the voice reply, and in this case, the playback of the speaker is easily collected again, and the speaker is mistaken as new voice information by a subsequent semantic processing module, which causes voice confusion and misrecognition. Moreover, if the sound collection function is shielded during sound reproduction, firstly, due to the fact that the lengths of the answering voices are different, proper shielding time cannot be set, secondly, secondary real-time voice information and instructions of a user cannot be collected during sound reproduction, and using experience and satisfaction of the user are greatly reduced.

Disclosure of Invention

The invention aims to provide a method for judging echo based on semantic level in man-machine continuous conversation, which is used for solving the problem of voice misrecognition caused by the fact that the sound played by a loudspeaker is collected by a microphone and mistakenly considers a new voice request in the prior art.

The invention solves the problems through the following technical scheme:

a method for judging echo based on semantic level in man-machine continuous conversation includes:

step S100, collecting the reply request and a secondary request with the time interval of the reply request within a set time, and after carrying out data processing on the two collected voice data, carrying out two kinds of processing: 1) Directly carrying out sequence conversion to obtain an integer sequence pair; 2) Performing pinyin conversion and phoneme conversion and then performing sequence conversion to obtain phoneme sequence pairs;

s200, after the word embedding layer, the LSTM coding and the L2 normalization are carried out on the phoneme sequence pair, two vectors are obtained, and the phoneme similarity is obtained by point multiplication of the two vectors;

the integer sequence pair obtains two sentence vectors after being coded by a word embedding layer and the LSTM, the two sentence vectors are integrated in a vector splicing mode, the phoneme similarity and the two sentence vectors are spliced through a full connection layer to obtain a new feature vector, the feature vector passes through two Dense layers, and finally is activated by a softmax activation function to obtain an echo prediction result;

step S300, comparing the phoneme similarity with a set similarity threshold, if the phoneme similarity is greater than the similarity threshold and the echo prediction result is echo, judging that the secondary request is echo of the reply request, and refusing to identify, otherwise, normally identifying the secondary request.

When the integer sequence pair is coded by a word embedding layer, adding mask to cover: when the integer sequence is transmitted in the forward direction, 0 in the integer sequence pair is directly covered, when the integer sequence is transmitted in the backward direction, meaningful inversion is firstly carried out, then 0 in the integer sequence pair is covered, and 0 filling is unchanged.

And processing the collected two pieces of voice data into the reason data, namely removing other characters except numbers, chinese and English and punctuations.

Compared with the prior art, the invention has the following advantages and beneficial effects:

the echo is judged at the semantic level, and the result and judgment are mutually corrected based on the phoneme similarity of two sentences and the analysis and classification of the semantics of the two sentences. By not performing subsequent voice response on the result judged to be the echo, the situations of voice confusion and misrecognition caused by the echo are reduced, and the use experience of the user under continuous conversation is improved; meanwhile, the threshold value of the phoneme similarity is strictly limited, and the misidentification of the normal request can be reduced.

Drawings

FIG. 1 is a flow chart of the present invention.

Detailed Description

The present invention will be described in further detail with reference to examples, but the embodiments of the present invention are not limited thereto.

Example 1:

referring to fig. 1, a method for determining echo based on semantic level in a human-computer continuous conversation includes:

1. collecting and sorting echo data, analyzing data characteristics to obtain training corpus

The echo recognition data is screened out from the user data under the existing continuous conversation, a sentence pair (the previous round of reply and the current round of request) is formed, the data characteristics are analyzed, and the characteristics of the echo data are analyzed from the text alone without considering the problem of time difference between the two requests. Checking the basic characteristics of echo data from the data, namely the request is highly similar to the pronunciation of the previous round reply language, and then removing the data which is not the echo from the data with similar pronunciation, namely the previous round reply and the request are in accordance with the relation of 'synonyms' or 'conversation' and 'reverse' and the like, such as:

a. the above or answer should be: "call brother" - "brother", "do you like" - "i like you"

b. Expression is in the opposite sense to the above: "He is not fool" - "He is fool"

c. Synonym sentence: "help me open this" - "help me open this"

d. Other dialog cases: "I don't like this" - "I don't like this either"

2. Data pre-processing

The sentence pairs are processed uniformly (full angle is converted into half angle, lower case is converted into upper case, punctuation marks and special characters except Chinese characters, english and numbers are removed), then the sentence pairs are converted into phonemes (taking international phonetic symbols as a standard), and simultaneously the sentence pairs and the phonemes thereof are subjected to sequence conversion, and one sentence has two positive number sequences, namely a phoneme sequence pair which is a sequence formed by phonemes, and an integer sequence pair which is a sequence formed by characters.

3. Constructing a model, including a phoneme similarity part and a semantic analysis part, inputting training corpora, and training the model

The model is divided into two parts, the first part is the judgment of phoneme similarity, whether the pronunciation is similar is judged, the similarity between the input phoneme sequence pair is output after the phoneme sequence pair passes through an embedding layer (word embedding), LSTM, L2 normalization and Dot product (Dot), and the part of similarity is used as a part of characteristic and is used for correcting the analysis result of the semantic correlation part; the second part is the analysis of semantic correlation, sentence sequence pair is processed by embedding layer, LSTM, full connection layer splicing (concatenate) and linear transformation (Dense) to obtain a semantic correlation characteristic vector, the vector is spliced with the result of phoneme similarity, and the final prediction result is output through the full connection layer. The two output results are combined, and if the relevance between the sentence pairs is not strong while the phoneme similarity is high, the sentence pairs are considered as echo positive samples.

For example:

phoneme similarity part:

for the phoneme sequence pair: a = [ a ] ₀ ，a ₁ ，a ₂ …a _n ]，B＝[b ₀ ，b ₁ ，b ₂ …b _n ]N is a limited sequence length, and after the first layer embedding layer, LSTM encoding and L2 normalization, two vectors representing the sentence phonemes can be obtained:

X _A ＝(x _a0 ，x _a1 ，x _a2 …x _am )，X _B ＝(x _b0 ，x _b1 ，x _b2 …x _bm )

and because it is L2 normalized, then there is X _A ·X _B ＝cos<X _A ，X _B >Because the subsequent labels are classified into two categories, i.e. similar is 1, and dissimilar is 0, after the two vectors are point-multiplied, during training, the two vectors are amplified by a set threshold value:

wherein the content of the first and second substances,

the similarity result of one phoneme is finally obtained for the set similarity threshold (0.85 taken in the implementation of the invention).

Semantic relevance part:

for integer sequence pairs: a '= [ a' ₀ ，a′ ₁ ，q′ ₂ …a′ _n ]，B′＝[b′ ₀ ，b′ ₁ ，b′ ₂ …b′ _n ]N is a limited sequence length, and similarly, after the embedding layer and LSTM coding, two sentence vectors are obtained, and mask masking is added during embedding, forward masking is directly applied to 0, and backward propagation needs to be processed before masking, meaningful inversion is carried out, and 0 is filled unchanged, such as 'I want to see a movie' [1 2 3 4 5 00 00 00]When the cover is propagated backward, the sequence should be changed to [5 4 3 2 100 00 0]Instead of [ 00 00 5 4 3 2 1]And ensuring that 0 represents specific semantic features during bidirectional propagation training:

X′ _A ＝(x′ _a0 ，x′ _a1 ，x′ _a2 …x′ _am )，X′ _B ＝(x′ _b0 ，x′ _b1 ，x′ _b2 …x′ _bm )

and integrating two sentence vectors in a vector splicing mode:

X＝(x′ _a0 ，x′ _a1 ，x′ _a2 …x′ _am ，x′ _b0 ，x′ _b1 ，x′ _b2 …x′ _bm )

dimension 1 multiplied by 2m, obtaining a preliminary classification result through a subsequent full-connection layer, splicing a phoneme similarity judgment result serving as a subsequent feature with a vector output by the full-connection layer to obtain a new feature vector, passing the feature vector through two Dense layers, and finally activating by a softmax activation function to obtain a final echo prediction result. The classification model comprises a mask layer used for covering and filling a meaningless integer 0; meanwhile, after the sentence pairs are output through a full connection layer, the phoneme similarity characteristic input is added, and the result is corrected; the final echo judgment result is judged by combining various conditions (request time difference, pronunciation similarity, correlation and the like).

4. Two request time limit

The prediction part needs to be limited in the time of two requests because no hardware sound receiving data exists, when machine echo is generated, the difference between the time of the sound receiving equipment for obtaining the audio and the time of the reply sent by the loudspeaker is not large, the time difference is divided by the time of network transmission delay and cloud semantic analysis and is basically equal to the time difference of the cloud for obtaining the two requests, through data analysis, the time difference is obtained by adding the network transmission delay and the cloud interpretation time of a single request, is about 600ms and is used as the time difference, the time difference is within a set threshold, the echo is judged in the next step, and if the time difference is not in the set threshold, the echo is not considered to be the echo.

5. Echo determination using trained models

Step S100, collecting the reply request and the secondary request with the time interval of the reply request within a set time (for example, 600 ms), and performing two types of processing after performing data processing on the two collected pieces of voice data:

1) Directly carrying out sequence conversion to obtain an integer sequence pair; such as "i want to watch a movie", converts itself according to words into a sequence of integers such as: [1,2,3,4,5,0,0,0,0,0].

2) The pinyin conversion (for example, it can be realized by pypinyin package, and for some common polyphones, the pronunciation of the polyphones is marked in the form of dictionary), and then the pinyin is converted into phonemes one by one according to the international phoneme comparison table (the pinyin is divided according to the phonemes), for example, "i want to see a movie" is converted into: "uo2 x iang3 k an4 dian4 ing3", and then converting it into a phoneme sequence according to phonemes, such as: [1,2,3,4,5,6,7,8,0,0].

Step S200, a phoneme similarity judging part of the phoneme sequence input model obtains two vectors after word embedding layer, LSTM coding and L2 normalization, and performs point multiplication on the two vectors to obtain a phoneme similarity;

an analysis part of the semantic relevance of the integer sequence to an input model obtains two sentence vectors after word embedding layer coding and LSTM coding, integrates the two sentence vectors in a vector splicing mode, splices the phoneme similarity and the two sentence vectors through a full connection layer to obtain a new feature vector, passes through two Dense layers for the feature vector, and is finally activated by a softmax activation function to obtain an echo prediction result;

Although the present invention has been described herein with reference to the illustrated embodiments thereof, which are intended to be preferred embodiments of the present invention, it is to be understood that the invention is not limited thereto, and that numerous other modifications and embodiments can be devised by those skilled in the art that will fall within the spirit and scope of the principles of this disclosure.

Claims

1. A method for judging echo based on semantic level in man-machine continuous conversation is characterized by comprising the following steps:

step S100, collecting the reply request and a secondary request with the time interval of the reply request within a set time, and performing two kinds of processing after data processing is performed on the two collected voice data: 1) Directly carrying out sequence conversion to obtain an integer sequence pair; 2) Performing pinyin conversion and phoneme conversion, and then performing sequence conversion to obtain phoneme sequence pairs;

s200, performing word embedding layer, LSTM coding and L2 normalization on a phoneme sequence pair to obtain two vectors, and performing dot multiplication on the two vectors to obtain phoneme similarity;

2. The method of claim 1, wherein the integer sequence pair is encoded by a word embedding layer, and a mask is added to cover the encoded integer sequence pair: when the integer sequence is transmitted in the forward direction, 0 in the integer sequence pair is directly covered, when the integer sequence is transmitted in the backward direction, the meaningful inversion is firstly carried out, then 0 in the integer sequence pair is covered, and the filling of 0 is unchanged.

3. The method as claimed in claim 1, wherein the collected two speech data are processed to remove the characters and punctuation marks except for numbers, chinese and English in the reason data.