CN115512691A - Method for judging echo based on semantic level in man-machine continuous conversation - Google Patents

Method for judging echo based on semantic level in man-machine continuous conversation Download PDF

Info

Publication number
CN115512691A
CN115512691A CN202211240000.XA CN202211240000A CN115512691A CN 115512691 A CN115512691 A CN 115512691A CN 202211240000 A CN202211240000 A CN 202211240000A CN 115512691 A CN115512691 A CN 115512691A
Authority
CN
China
Prior art keywords
echo
phoneme
sequence pair
similarity
request
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211240000.XA
Other languages
Chinese (zh)
Inventor
刘光毅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Homwee Technology Co ltd
Original Assignee
Homwee Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Homwee Technology Co ltd filed Critical Homwee Technology Co ltd
Priority to CN202211240000.XA priority Critical patent/CN115512691A/en
Publication of CN115512691A publication Critical patent/CN115512691A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/1822Parsing for meaning understanding
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • G10L2015/025Phonemes, fenemes or fenones being the recognition units

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Theoretical Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Biomedical Technology (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Signal Processing (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a method for judging echo based on semantic level in man-machine continuous conversation, which collects reply request and secondary request, processes data to obtain integer sequence pair and phoneme sequence pair; after the phoneme sequence pair is subjected to word embedding layer, LSTM coding and L2 standardization, two vectors are obtained and are subjected to dot multiplication to obtain phoneme similarity; the integer sequence pair is subjected to word embedding layer coding, LSTM coding, vector splicing and phoneme similarity splicing to obtain a new feature vector, and is activated by a softmax activation function through two sense layers to obtain an echo prediction result; and if the phoneme similarity is greater than the phoneme similarity threshold and the echo prediction result is an echo, the secondary request is an echo, and identification is refused. The echo is judged at the semantic level, so that the situations of voice confusion and misrecognition caused by the echo are reduced, and the use experience of a user is improved; and meanwhile, the threshold value of phoneme similarity is strictly limited, and the misidentification of normal requests can be reduced.

Description

Method for judging echo based on semantic level in man-machine continuous conversation
Technical Field
The invention relates to the technical field of natural language processing, in particular to a method for judging echoes based on a semantic level in a man-machine continuous conversation.
Background
With the development of the AI technology, more and more intelligent voice devices are applied to markets, families and the like to provide services such as guidance, question answering and the like for people. In the service type man-machine conversation function, a user sends a voice request to the machine, the machine collects voice information through a microphone, encodes the voice information and transmits the encoded voice information to a processor for data processing, then voice recognition is carried out, and the message is replied through a loudspeaker through subsequent semantic understanding. Generally, the voice request and the voice response are performed separately, but in full duplex, the microphone is not turned off while the speaker plays the voice reply, and in this case, the playback of the speaker is easily collected again, and the speaker is mistaken as new voice information by a subsequent semantic processing module, which causes voice confusion and misrecognition. Moreover, if the sound collection function is shielded during sound reproduction, firstly, due to the fact that the lengths of the answering voices are different, proper shielding time cannot be set, secondly, secondary real-time voice information and instructions of a user cannot be collected during sound reproduction, and using experience and satisfaction of the user are greatly reduced.
Disclosure of Invention
The invention aims to provide a method for judging echo based on semantic level in man-machine continuous conversation, which is used for solving the problem of voice misrecognition caused by the fact that the sound played by a loudspeaker is collected by a microphone and mistakenly considers a new voice request in the prior art.
The invention solves the problems through the following technical scheme:
a method for judging echo based on semantic level in man-machine continuous conversation includes:
step S100, collecting the reply request and a secondary request with the time interval of the reply request within a set time, and after carrying out data processing on the two collected voice data, carrying out two kinds of processing: 1) Directly carrying out sequence conversion to obtain an integer sequence pair; 2) Performing pinyin conversion and phoneme conversion and then performing sequence conversion to obtain phoneme sequence pairs;
s200, after the word embedding layer, the LSTM coding and the L2 normalization are carried out on the phoneme sequence pair, two vectors are obtained, and the phoneme similarity is obtained by point multiplication of the two vectors;
the integer sequence pair obtains two sentence vectors after being coded by a word embedding layer and the LSTM, the two sentence vectors are integrated in a vector splicing mode, the phoneme similarity and the two sentence vectors are spliced through a full connection layer to obtain a new feature vector, the feature vector passes through two Dense layers, and finally is activated by a softmax activation function to obtain an echo prediction result;
step S300, comparing the phoneme similarity with a set similarity threshold, if the phoneme similarity is greater than the similarity threshold and the echo prediction result is echo, judging that the secondary request is echo of the reply request, and refusing to identify, otherwise, normally identifying the secondary request.
When the integer sequence pair is coded by a word embedding layer, adding mask to cover: when the integer sequence is transmitted in the forward direction, 0 in the integer sequence pair is directly covered, when the integer sequence is transmitted in the backward direction, meaningful inversion is firstly carried out, then 0 in the integer sequence pair is covered, and 0 filling is unchanged.
And processing the collected two pieces of voice data into the reason data, namely removing other characters except numbers, chinese and English and punctuations.
Compared with the prior art, the invention has the following advantages and beneficial effects:
the echo is judged at the semantic level, and the result and judgment are mutually corrected based on the phoneme similarity of two sentences and the analysis and classification of the semantics of the two sentences. By not performing subsequent voice response on the result judged to be the echo, the situations of voice confusion and misrecognition caused by the echo are reduced, and the use experience of the user under continuous conversation is improved; meanwhile, the threshold value of the phoneme similarity is strictly limited, and the misidentification of the normal request can be reduced.
Drawings
FIG. 1 is a flow chart of the present invention.
Detailed Description
The present invention will be described in further detail with reference to examples, but the embodiments of the present invention are not limited thereto.
Example 1:
referring to fig. 1, a method for determining echo based on semantic level in a human-computer continuous conversation includes:
1. collecting and sorting echo data, analyzing data characteristics to obtain training corpus
The echo recognition data is screened out from the user data under the existing continuous conversation, a sentence pair (the previous round of reply and the current round of request) is formed, the data characteristics are analyzed, and the characteristics of the echo data are analyzed from the text alone without considering the problem of time difference between the two requests. Checking the basic characteristics of echo data from the data, namely the request is highly similar to the pronunciation of the previous round reply language, and then removing the data which is not the echo from the data with similar pronunciation, namely the previous round reply and the request are in accordance with the relation of 'synonyms' or 'conversation' and 'reverse' and the like, such as:
a. the above or answer should be: "call brother" - "brother", "do you like" - "i like you"
b. Expression is in the opposite sense to the above: "He is not fool" - "He is fool"
c. Synonym sentence: "help me open this" - "help me open this"
d. Other dialog cases: "I don't like this" - "I don't like this either"
2. Data pre-processing
The sentence pairs are processed uniformly (full angle is converted into half angle, lower case is converted into upper case, punctuation marks and special characters except Chinese characters, english and numbers are removed), then the sentence pairs are converted into phonemes (taking international phonetic symbols as a standard), and simultaneously the sentence pairs and the phonemes thereof are subjected to sequence conversion, and one sentence has two positive number sequences, namely a phoneme sequence pair which is a sequence formed by phonemes, and an integer sequence pair which is a sequence formed by characters.
3. Constructing a model, including a phoneme similarity part and a semantic analysis part, inputting training corpora, and training the model
The model is divided into two parts, the first part is the judgment of phoneme similarity, whether the pronunciation is similar is judged, the similarity between the input phoneme sequence pair is output after the phoneme sequence pair passes through an embedding layer (word embedding), LSTM, L2 normalization and Dot product (Dot), and the part of similarity is used as a part of characteristic and is used for correcting the analysis result of the semantic correlation part; the second part is the analysis of semantic correlation, sentence sequence pair is processed by embedding layer, LSTM, full connection layer splicing (concatenate) and linear transformation (Dense) to obtain a semantic correlation characteristic vector, the vector is spliced with the result of phoneme similarity, and the final prediction result is output through the full connection layer. The two output results are combined, and if the relevance between the sentence pairs is not strong while the phoneme similarity is high, the sentence pairs are considered as echo positive samples.
For example:
phoneme similarity part:
for the phoneme sequence pair: a = [ a ] 0 ,a 1 ,a 2 …a n ],B=[b 0 ,b 1 ,b 2 …b n ]N is a limited sequence length, and after the first layer embedding layer, LSTM encoding and L2 normalization, two vectors representing the sentence phonemes can be obtained:
X A =(x a0 ,x a1 ,x a2 …x am ),X B =(x b0 ,x b1 ,x b2 …x bm )
and because it is L2 normalized, then there is X A ·X B =cos<X A ,X B >Because the subsequent labels are classified into two categories, i.e. similar is 1, and dissimilar is 0, after the two vectors are point-multiplied, during training, the two vectors are amplified by a set threshold value:
Figure BDA0003884898900000041
wherein the content of the first and second substances,
Figure BDA0003884898900000042
the similarity result of one phoneme is finally obtained for the set similarity threshold (0.85 taken in the implementation of the invention).
Semantic relevance part:
for integer sequence pairs: a '= [ a' 0 ,a′ 1 ,q′ 2 …a′ n ],B′=[b′ 0 ,b′ 1 ,b′ 2 …b′ n ]N is a limited sequence length, and similarly, after the embedding layer and LSTM coding, two sentence vectors are obtained, and mask masking is added during embedding, forward masking is directly applied to 0, and backward propagation needs to be processed before masking, meaningful inversion is carried out, and 0 is filled unchanged, such as 'I want to see a movie' [1 2 3 4 5 00 00 00]When the cover is propagated backward, the sequence should be changed to [5 4 3 2 100 00 0]Instead of [ 00 00 5 4 3 2 1]And ensuring that 0 represents specific semantic features during bidirectional propagation training:
X′ A =(x′ a0 ,x′ a1 ,x′ a2 …x′ am ),X′ B =(x′ b0 ,x′ b1 ,x′ b2 …x′ bm )
and integrating two sentence vectors in a vector splicing mode:
X=(x′ a0 ,x′ a1 ,x′ a2 …x′ am ,x′ b0 ,x′ b1 ,x′ b2 …x′ bm )
dimension 1 multiplied by 2m, obtaining a preliminary classification result through a subsequent full-connection layer, splicing a phoneme similarity judgment result serving as a subsequent feature with a vector output by the full-connection layer to obtain a new feature vector, passing the feature vector through two Dense layers, and finally activating by a softmax activation function to obtain a final echo prediction result. The classification model comprises a mask layer used for covering and filling a meaningless integer 0; meanwhile, after the sentence pairs are output through a full connection layer, the phoneme similarity characteristic input is added, and the result is corrected; the final echo judgment result is judged by combining various conditions (request time difference, pronunciation similarity, correlation and the like).
4. Two request time limit
The prediction part needs to be limited in the time of two requests because no hardware sound receiving data exists, when machine echo is generated, the difference between the time of the sound receiving equipment for obtaining the audio and the time of the reply sent by the loudspeaker is not large, the time difference is divided by the time of network transmission delay and cloud semantic analysis and is basically equal to the time difference of the cloud for obtaining the two requests, through data analysis, the time difference is obtained by adding the network transmission delay and the cloud interpretation time of a single request, is about 600ms and is used as the time difference, the time difference is within a set threshold, the echo is judged in the next step, and if the time difference is not in the set threshold, the echo is not considered to be the echo.
5. Echo determination using trained models
Step S100, collecting the reply request and the secondary request with the time interval of the reply request within a set time (for example, 600 ms), and performing two types of processing after performing data processing on the two collected pieces of voice data:
1) Directly carrying out sequence conversion to obtain an integer sequence pair; such as "i want to watch a movie", converts itself according to words into a sequence of integers such as: [1,2,3,4,5,0,0,0,0,0].
2) The pinyin conversion (for example, it can be realized by pypinyin package, and for some common polyphones, the pronunciation of the polyphones is marked in the form of dictionary), and then the pinyin is converted into phonemes one by one according to the international phoneme comparison table (the pinyin is divided according to the phonemes), for example, "i want to see a movie" is converted into: "uo2 x iang3 k an4 dian4 ing3", and then converting it into a phoneme sequence according to phonemes, such as: [1,2,3,4,5,6,7,8,0,0].
Step S200, a phoneme similarity judging part of the phoneme sequence input model obtains two vectors after word embedding layer, LSTM coding and L2 normalization, and performs point multiplication on the two vectors to obtain a phoneme similarity;
an analysis part of the semantic relevance of the integer sequence to an input model obtains two sentence vectors after word embedding layer coding and LSTM coding, integrates the two sentence vectors in a vector splicing mode, splices the phoneme similarity and the two sentence vectors through a full connection layer to obtain a new feature vector, passes through two Dense layers for the feature vector, and is finally activated by a softmax activation function to obtain an echo prediction result;
step S300, comparing the phoneme similarity with a set similarity threshold, if the phoneme similarity is greater than the similarity threshold and the echo prediction result is echo, judging that the secondary request is echo of the reply request, and refusing to identify, otherwise, normally identifying the secondary request.
Although the present invention has been described herein with reference to the illustrated embodiments thereof, which are intended to be preferred embodiments of the present invention, it is to be understood that the invention is not limited thereto, and that numerous other modifications and embodiments can be devised by those skilled in the art that will fall within the spirit and scope of the principles of this disclosure.

Claims (3)

1. A method for judging echo based on semantic level in man-machine continuous conversation is characterized by comprising the following steps:
step S100, collecting the reply request and a secondary request with the time interval of the reply request within a set time, and performing two kinds of processing after data processing is performed on the two collected voice data: 1) Directly carrying out sequence conversion to obtain an integer sequence pair; 2) Performing pinyin conversion and phoneme conversion, and then performing sequence conversion to obtain phoneme sequence pairs;
s200, performing word embedding layer, LSTM coding and L2 normalization on a phoneme sequence pair to obtain two vectors, and performing dot multiplication on the two vectors to obtain phoneme similarity;
the integer sequence pair obtains two sentence vectors after being coded by a word embedding layer and the LSTM, the two sentence vectors are integrated in a vector splicing mode, the phoneme similarity and the two sentence vectors are spliced through a full connection layer to obtain a new feature vector, the feature vector passes through two Dense layers, and finally is activated by a softmax activation function to obtain an echo prediction result;
step S300, comparing the phoneme similarity with a set similarity threshold, if the phoneme similarity is greater than the similarity threshold and the echo prediction result is echo, judging that the secondary request is echo of the reply request, and refusing to identify, otherwise, normally identifying the secondary request.
2. The method of claim 1, wherein the integer sequence pair is encoded by a word embedding layer, and a mask is added to cover the encoded integer sequence pair: when the integer sequence is transmitted in the forward direction, 0 in the integer sequence pair is directly covered, when the integer sequence is transmitted in the backward direction, the meaningful inversion is firstly carried out, then 0 in the integer sequence pair is covered, and the filling of 0 is unchanged.
3. The method as claimed in claim 1, wherein the collected two speech data are processed to remove the characters and punctuation marks except for numbers, chinese and English in the reason data.
CN202211240000.XA 2022-10-11 2022-10-11 Method for judging echo based on semantic level in man-machine continuous conversation Pending CN115512691A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211240000.XA CN115512691A (en) 2022-10-11 2022-10-11 Method for judging echo based on semantic level in man-machine continuous conversation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211240000.XA CN115512691A (en) 2022-10-11 2022-10-11 Method for judging echo based on semantic level in man-machine continuous conversation

Publications (1)

Publication Number Publication Date
CN115512691A true CN115512691A (en) 2022-12-23

Family

ID=84509174

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211240000.XA Pending CN115512691A (en) 2022-10-11 2022-10-11 Method for judging echo based on semantic level in man-machine continuous conversation

Country Status (1)

Country Link
CN (1) CN115512691A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116110393A (en) * 2023-02-01 2023-05-12 镁佳(北京)科技有限公司 Voice similarity-based refusing method, device, computer and medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116110393A (en) * 2023-02-01 2023-05-12 镁佳(北京)科技有限公司 Voice similarity-based refusing method, device, computer and medium
CN116110393B (en) * 2023-02-01 2024-01-23 镁佳(北京)科技有限公司 Voice similarity-based refusing method, device, computer and medium

Similar Documents

Publication Publication Date Title
CN111477216B (en) Training method and system for voice and meaning understanding model of conversation robot
CN111968679B (en) Emotion recognition method and device, electronic equipment and storage medium
Hori et al. A new approach to automatic speech summarization
CN110689877A (en) Voice end point detection method and device
US20100217591A1 (en) Vowel recognition system and method in speech to text applictions
US20030195739A1 (en) Grammar update system and method
US20060009965A1 (en) Method and apparatus for distribution-based language model adaptation
US20060161434A1 (en) Automatic improvement of spoken language
US11144732B2 (en) Apparatus and method for user-customized interpretation and translation
US8285542B2 (en) Adapting a language model to accommodate inputs not found in a directory assistance listing
CN115827854A (en) Voice abstract generation model training method, voice abstract generation method and device
CN113488026B (en) Speech understanding model generation method based on pragmatic information and intelligent speech interaction method
CN112309406A (en) Voiceprint registration method, voiceprint registration device and computer-readable storage medium
CN115512691A (en) Method for judging echo based on semantic level in man-machine continuous conversation
Rose et al. Integration of utterance verification with statistical language modeling and spoken language understanding
Kadambe et al. Language identification with phonological and lexical models
CN113393841B (en) Training method, device, equipment and storage medium of voice recognition model
KR102407055B1 (en) Apparatus and method for measuring dialogue quality index through natural language processing after speech recognition
CN114707515A (en) Method and device for judging dialect, electronic equipment and storage medium
CN115132178A (en) Semantic endpoint detection system based on deep learning
Rahim et al. Robust numeric recognition in spoken language dialogue
Kurian et al. Connected digit speech recognition system for Malayalam language
CN114328867A (en) Intelligent interruption method and device in man-machine conversation
JP2001013992A (en) Voice understanding device
KR100366703B1 (en) Human interactive speech recognition apparatus and method thereof

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination