CN115641850A

CN115641850A - Method and device for recognizing ending of conversation turns, storage medium and computer equipment

Info

Publication number: CN115641850A
Application number: CN202211212816.1A
Authority: CN
Inventors: 辛逸男; 黄明星; 王福钋; 张航飞; 徐华韫; 曹富康; 郭立钊; 范野; 沈鹏
Original assignee: Beijing Absolute Health Ltd
Current assignee: Beijing Absolute Health Ltd
Priority date: 2022-09-30
Filing date: 2022-09-30
Publication date: 2023-01-24

Abstract

The invention discloses a method and a device for identifying the end of a phonetics turn, a storage medium and computer equipment, relates to the technical field of artificial intelligence, and mainly aims to improve the identification precision of the end of the phonetics turn. The method comprises the following steps: acquiring a dialect text and a voice signal corresponding to a dialect to be recognized; inputting the dialect text into a first preset dialect turn recognition model for performing the end recognition of the dialect turn to obtain a first recognition result corresponding to the dialect to be recognized; inputting the voice signal into a second preset dialect turn recognition model for performing dialect turn ending recognition to obtain a second recognition result corresponding to the dialect to be recognized; and judging whether the dialect turn corresponding to the dialect to be identified is finished or not based on the first identification result and the second identification result and the first identification result and the second identification result. The invention is suitable for recognizing the end of the session turn.

Description

Method and device for recognizing ending of conversation turns, storage medium and computer equipment

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a method and a device for identifying the end of a dialect turn, a storage medium and computer equipment.

Background

In the insurance field, along with artificial intelligence's development, the speech dialogue robot technique is mature day by day, can screen out more intentional user through the speech robot, improves net expense efficiency. After the user finishes speaking in any round, the voice robot needs to reply to the speaking content of the user effectively, and based on the situation, in order to enable the voice robot and the user to generate effective conversation, improve the service quality and increase the satisfaction degree of the user, in a multi-round conversation scene, whether the user finishes speaking round is judged to be an urgent problem to be solved.

Currently, it is usually determined whether the speaking turn of the user is finished by analyzing the text content in the user's speaking operation. However, this method cannot determine the speaking status of the user, such as "kaye", whether the word is a positive tone or a doubtful tone, and the speaking tone of the user is different, and the determination result of whether the corresponding speaking turn is finished is also different, thereby resulting in a low recognition accuracy of the end of the speaking turn.

Disclosure of Invention

The invention provides a method, a device, a storage medium and computer equipment for identifying the ending of a conversational turn, which mainly aim to improve the identification precision of the ending of the conversational turn.

According to a first aspect of the present invention, there is provided a method for identifying the end of a session turn, comprising:

acquiring a dialect text and a voice signal corresponding to a dialect to be recognized;

inputting the dialect text into a first preset dialect turn recognition model for performing the end recognition of the dialect turn to obtain a first recognition result corresponding to the dialect to be recognized;

inputting the voice signal into a second preset dialect turn recognition model for performing dialect turn ending recognition to obtain a second recognition result corresponding to the dialect to be recognized;

and judging whether the dialect turn corresponding to the dialect to be identified is finished or not based on the first identification result and the second identification result.

According to a second aspect of the present invention, there is provided an apparatus for identifying the end of a session turn, comprising:

the acquiring unit is used for acquiring a dialect text and a voice signal corresponding to the dialect to be recognized;

the first identification unit is used for inputting the jargon text into a first preset jargon turn identification model to carry out jargon turn ending identification to obtain a first identification result corresponding to the jargon to be identified;

the second recognition unit is used for inputting the voice signal into a second preset conversational turn recognition model to perform conversational turn ending recognition to obtain a second recognition result corresponding to the conversational skill to be recognized;

and the judging unit is used for judging whether the dialect turn corresponding to the dialect to be identified is ended or not based on the first identification result and the second identification result.

According to a third aspect of the present invention, there is provided a computer-readable storage medium, having stored thereon a computer program which, when executed by a processor, performs the steps of:

According to a fourth aspect of the present invention, there is provided a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the following steps when executing the program:

and judging whether the dialect turn corresponding to the dialect to be recognized is finished or not based on the first recognition result and the second recognition result.

According to the method, the device, the storage medium and the computer equipment for recognizing the end of the dialect turn, compared with the mode that only the text content in the user's dialect is analyzed at present to determine whether the user's dialect turn is ended, the method obtains the dialect text and the voice signal corresponding to the dialect to be recognized; inputting the dialect text into a first preset dialect turn recognition model for performing the end recognition of the dialect turn to obtain a first recognition result corresponding to the dialect to be recognized; then inputting the voice signal into a second preset dialect turn recognition model for performing the end recognition of the dialect turn to obtain a second recognition result corresponding to the dialect to be recognized; finally, whether the dialect turn corresponding to the dialect to be recognized is finished or not is judged based on the first recognition result and the second recognition result, so that the dialect turn corresponding to the dialect to be recognized is recognized through the dialect text and the voice information corresponding to the dialect to be recognized respectively to obtain the first recognition result and the second recognition result, and whether the dialect turn corresponding to the dialect to be recognized is finished or not is judged based on the first recognition result and the second recognition result, so that the situation that the dialect mood cannot be judged due to the fact that only the text content is analyzed is avoided, and the recognition accuracy of the end of the dialect turn is improved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:

fig. 1 is a flowchart illustrating a method for identifying an end of a conversation turn according to an embodiment of the present invention;

FIG. 2 is a flow chart of another method for identifying the end of a session turn according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram illustrating an apparatus for recognizing the end of a session turn according to an embodiment of the present invention;

FIG. 4 is a schematic structural diagram of another end-of-session identification apparatus provided in an embodiment of the present invention;

fig. 5 shows a physical structure diagram of a computer device according to an embodiment of the present invention.

Detailed Description

The invention will be described in detail hereinafter with reference to the accompanying drawings in conjunction with embodiments. It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.

At present, the method of determining whether the speaking turn of the user is finished or not by only analyzing the text content in the dialect of the user is adopted, and the recognition accuracy of the finish of the speaking turn is low because the text content is analyzed and the jargon cannot be determined.

In order to solve the above problem, an embodiment of the present invention provides a method for identifying an end of a session turn, as shown in fig. 1, where the method includes:

101. and acquiring a speech text and a speech signal corresponding to the speech to be recognized.

The dialect to be identified is the client dialect acquired by the outbound robot in the process of carrying out conversation with the client.

For the embodiment of the invention, in order to solve the problem of low recognition accuracy of the end of the dialect rotation in the prior art, the embodiment of the invention respectively recognizes the dialect rotation through the dialect text and the voice information corresponding to the dialect to be recognized to obtain the first recognition result and the second recognition result, and finally judges whether the dialect rotation corresponding to the dialect to be recognized is ended or not based on the first recognition result and the second recognition result, thereby avoiding the situation that the dialect mood cannot be judged due to the fact that only the text content is analyzed, and further improving the recognition accuracy of the end of the dialect rotation. The embodiment of the invention is mainly applied to a scene that whether the dialoging turn is finished or not is identified, and the execution main body of the embodiment of the invention is a device or equipment capable of identifying whether the dialoging turn is finished or not, and can be specifically arranged on one side of a client or a server.

Specifically, when the outbound robot communicates with the client, the recording device is used to record the communication content, and meanwhile, after the client finishes speaking for a preset time, the client does not speak again, the speaking section of the client is determined as the Speech to be recognized, the Speech to be recognized is transmitted to a preset ASR (Automatic Speech Recognition) model to obtain a Speech text corresponding to the Speech to be recognized, meanwhile, the semantic data corresponding to the Speech to be recognized of the client is intercepted in the recording device to obtain a Speech signal corresponding to the Speech to be recognized, and then the Speech text and the Speech signal are respectively recognized to obtain a first Recognition result and a second Recognition result, and finally, whether the Speech turn corresponding to the Speech to be recognized is finished or not is judged based on the first Recognition result and the second Recognition result, so that the Recognition accuracy of the Speech turn finishing is improved.

102. And inputting the dialect text into a first preset dialect turn recognition model for performing the end recognition of the dialect turn to obtain a first recognition result corresponding to the dialect to be recognized.

Wherein, the first recognition result is the recognition result of whether the conversation turn is finished or not. For the embodiment of the invention, after the dialect text corresponding to the dialect to be recognized is obtained, firstly, the semantic information vector corresponding to the dialect text is obtained by using the pre-trained BERT-Base model, then the semantic information vector is input into the preset multilayer perceptron, the first recognition result corresponding to the dialect to be recognized is output through the multilayer perceptron, meanwhile, the voice signal is input into the second preset dialect turn recognition model to carry out the end recognition of the dialect turn, the second recognition result corresponding to the dialect to be recognized is obtained, and finally, whether the dialect turn corresponding to the dialect to be recognized is finished or not is judged based on the first recognition result and the second recognition result, so that the situation that the real intention of a client is not mastered due to the fact that the client talks are interrupted by a robot is avoided, the intelligence degree of the robot is improved, and the dialect smoothness is increased.

103. And inputting the voice signal into a second preset dialect turn recognition model for performing dialect turn ending recognition to obtain a second recognition result corresponding to the dialect to be recognized.

The second preset conversational turn recognition model may specifically be a preset classifier, and the second recognition result is a recognition result of whether the conversational turn is finished or not.

For the embodiment of the invention, after the voice signal corresponding to the dialect to be recognized is obtained, the speech spectrogram corresponding to the voice signal is determined, the speech spectrogram is input into the preset classifier to be classified, the second recognition result corresponding to the dialect to be recognized is obtained, and finally whether the dialect turn corresponding to the dialect to be recognized is finished or not is judged based on the first recognition result and the second recognition result, so that the situation that the dialect gas cannot be judged due to the fact that only the text content is analyzed is avoided, and the recognition precision of the end of the dialect turn is improved.

104. And judging whether the dialect turn corresponding to the dialect to be recognized is finished or not based on the first recognition result and the second recognition result.

Wherein, the ending of the conversation turn means that the description of the client for the same content is ended. For the embodiment of the invention, after the first recognition result corresponding to the dialect to be recognized is determined based on the dialect text and the second recognition result corresponding to the dialect to be recognized is determined based on the voice signal, the first recognition result and the second recognition result are comprehensively considered, whether the dialect turn corresponding to the dialect to be recognized is finished is judged, so that the dialect turn corresponding to the dialect to be recognized is respectively recognized by the dialect text and the voice information corresponding to the dialect to be recognized to obtain the first recognition result and the second recognition result, and whether the dialect turn corresponding to the dialect to be recognized is finished is finally judged based on the first recognition result and the second recognition result, thereby avoiding the situation that the dialect mood cannot be judged due to the fact that only the text content is analyzed, and improving the recognition accuracy of the end of the dialect turn.

According to the recognition method for the completion of the talk operation turn, compared with the mode that only the text content in the talk operation of the user is analyzed at present to determine whether the talk operation turn of the user is completed, the method obtains the talk operation text and the voice signal corresponding to the talk operation to be recognized; inputting the dialect text into a first preset dialect turn recognition model for performing the end recognition of the dialect turn to obtain a first recognition result corresponding to the dialect to be recognized; then inputting the voice signal into a second preset dialect turn recognition model for performing the end recognition of the dialect turn to obtain a second recognition result corresponding to the dialect to be recognized; finally, whether the dialect turn corresponding to the dialect to be recognized is finished or not is judged based on the first recognition result and the second recognition result, so that the dialect turn corresponding to the dialect to be recognized is recognized through the dialect text and the voice information corresponding to the dialect to be recognized respectively to obtain the first recognition result and the second recognition result, and whether the dialect turn corresponding to the dialect to be recognized is finished or not is judged based on the first recognition result and the second recognition result, so that the situation that the dialect mood cannot be judged due to the fact that only the text content is analyzed is avoided, and the recognition accuracy of the end of the dialect turn is improved.

Further, in order to better describe the process of recognizing the end of the session turn, as a refinement and an extension to the foregoing embodiment, an embodiment of the present invention provides another method for recognizing the end of the session turn, as shown in fig. 2, where the method includes:

201. and acquiring a speech text and a speech signal corresponding to the speech to be recognized.

Specifically, during the process of a call between the caller and a client, a dialogue is recorded in real time through the recording device, meanwhile, the mute time after the end of the speaking of the client is recorded, when the mute time is longer than the preset time, the voice signal corresponding to the speaking section of the client is obtained through the recording device, and meanwhile, the voice data corresponding to the speaking to be recognized recorded by the recording device is converted into text content, namely speaking text, through the ASR model.

202. Determining each character contained in the dialect text, and determining an embedding vector corresponding to each character.

For the embodiment of the present invention, in order to determine the first recognition result corresponding to the to-be-recognized speech technology, first, each character included in the speech technology text needs to be determined, for example, the speech technology text is a medical insurance application requirement, and each character corresponding to the speech technology text is, for example, medical/insurance/application/insurance/requirement/, then each character in the speech technology text is converted into an embedded vector by Word2Vec and other Word embedding methods, the embedded vector corresponding to each character is input into a preset natural language model for semantic information recognition, so as to obtain a semantic information vector corresponding to the to-be-recognized speech technology, and finally, the first recognition result corresponding to the to-be-recognized speech technology is determined based on the semantic information vector.

203. And inputting the embedded vector into a preset natural language model for semantic information recognition to obtain a semantic information vector corresponding to the speech technology to be recognized.

The preset natural language model is a Bert model, the Bert model comprises a plurality of encoders, each encoder is connected end to end, the output of the previous encoder can be used as the input of the next encoder, and the encoder specifically comprises a self-attention layer and a feedforward neural network layer.

Specifically, in order to determine the first recognition result corresponding to the utterance to be recognized, firstly, a semantic information vector corresponding to the utterance to be recognized needs to be extracted, and based on this, step 203 specifically includes: inputting the embedded vector into different attention subspaces in the self-attention layer for feature extraction to obtain first feature vectors of the dialect text under the different attention subspaces; multiplying and summing the first feature vectors of the dialect text under the different attention subspaces and the weights corresponding to the different attention subspaces to obtain a self-attention layer output vector corresponding to the dialect text; adding the self-attention layer output vector and the embedded vector to obtain a second feature vector corresponding to the dialect text; and inputting the second feature vector into the feedforward neural network layer for feature extraction to obtain a semantic information vector corresponding to the speech to be identified.

The first feature vector is an output vector of the self-attention layer, and the semantic information vector corresponding to the speech to be recognized is an output vector of a feedforward neural network layer of the last encoder.

Specifically, in the process of extracting the semantic information vector corresponding to the speech to be recognized by using the BERT model, firstly, the embedded vector corresponding to each character is input to the self-attention layer of the first encoder in the BERT model for feature extraction, so as to obtain the output vector of the self-attention layer, that is, the first feature vector corresponding to each character, wherein the specific process of feature extraction in the self-attention layer is as follows: determining a query vector, a key vector and a value vector corresponding to each character according to the embedded vector corresponding to each character; multiplying a query vector corresponding to a target character in each character by a key vector corresponding to each character to obtain the attention score of each character for the target character; and multiplying and summing the attention scores corresponding to the characters and the value vectors to obtain a first feature vector corresponding to the target character.

For the embodiment of the present invention, in the process of obtaining the first feature vector corresponding to each character, the embedded vector corresponding to each character in the dialect text corresponding to the dialect to be recognized may be multiplied by the weight matrix corresponding to the self-attention layer in the BERT model to obtain the query vector, the key vector, and the value vector corresponding to each character, further, the attention score corresponding to each character needs to be calculated, when the attention score corresponding to any one character (target character) in each character is calculated, the target character needs to be scored by using each character in the dialect text, specifically, the query vector corresponding to the target character is multiplied by the key vector corresponding to each character to obtain the scoring value of each character to the target character, i.e., the attention score, and then the attention score and the value vector corresponding to each character are multiplied to finally obtain the self-attention layer output vector corresponding to the target character, i.e., the first feature vector corresponding to the target character, so that the first feature vector corresponding to each character can be determined in the above manner, so as to obtain the semantic information corresponding to the dialect vector corresponding to be recognized by using each character.

Further, in order to obtain a semantic information vector corresponding to a to-be-identified grammar, after an embedded vector corresponding to each character in a grammar text corresponding to the to-be-identified grammar is input to a self-attention layer of a first encoder, a first feature vector corresponding to each character is extracted, the first feature vector and the embedded vector corresponding to each character need to be added to obtain a second feature vector corresponding to each character, the second feature vector is input to a feedforward neural network layer of the first encoder to perform feature extraction, and an output vector of the first encoder is obtained.

204. And inputting the semantic information vector into the first preset conversational turn recognition model to perform conversational turn end recognition, and obtaining a first recognition result corresponding to the to-be-recognized conversational turn.

The first preset conversational turn recognition model may specifically be a multilayer perceptron model, and the multilayer perceptron model is a neural network model and includes an input layer, a hidden layer, and an output layer.

For the embodiment of the present invention, after determining the semantic information vector corresponding to the dialect to be recognized, a first recognition result corresponding to the dialect to be recognized needs to be determined according to the semantic information vector, and based on this, step 204 specifically includes: inputting the semantic information vector into the multilayer perceptron, and extracting the characteristics output by the last full-connection layer in the multilayer perceptron; and inputting the features output by the last full connection layer into a softmax layer in the multilayer perceptron to obtain a first probability value of the ending of the call turn corresponding to the call to be identified and a second probability value of the non-ending of the call turn.

Specifically, the embedded vectors corresponding to each character in the linguistic text corresponding to the to-be-recognized speech are input into the hidden layer through the input layer of the multilayer perceptron model, and the result output through the hidden layer is as follows:

f(W ₁ x+b ₁ )

where x is the embedding vector corresponding to each character, w ₁ Weight of hidden layer, which is also the connection coefficient of the multi-layer perceptron, b ₁ For the bias coefficients of the hidden layer, the f-function may generally adopt sigmoid function or tanh function, as shown below:

sigmoid(x)＝1/(1+e ^-x )

tanh(x)＝(e ^x -e ^-x )/(e ¹ +e ^-x )

further, after the embedded vectors corresponding to the characters in the linguistic text are input to the hidden layer through the input layer of the multilayer perceptron model to obtain the output result of the hidden layer, the result is input to the output layer, namely the softmax layer of the multilayer perceptron, the linguistic round result recognition is performed through the output layer, and the obtained recommendation result is:

softmax(W ₂ f(W ₁ x+b ₁ )+b ₂ )

wherein, W ₂ As weight coefficients of the output layer, b ₂ The output layer of the multi-layer sensor model can output a first recognition result corresponding to the dialect to be recognized, wherein the first recognition result is a classification probability of whether the dialect turn corresponding to the dialect to be recognized is finished or not, namely the first recognition result is a first probability value of finishing the dialect turn to be recognized and a second probability value of not finishing the dialect turn.

205. And inputting the voice signal into a second preset dialect turn recognition model for performing dialect turn ending recognition to obtain a second recognition result corresponding to the dialect to be recognized.

The second preset tactical turn recognition model may be specifically a preset classifier, the preset classifier is a preset neural network model, and the preset neural network model may be a multilayer model, such as a CNN-LSTM model.

For the embodiment of the present invention, in order to improve the recognition accuracy of the second preset conversational turn recognition model, a spectrogram corresponding to the speech signal needs to be determined first, and based on this, step 205 specifically includes: performing cross framing processing on the voice signal to obtain a framed voice signal; windowing the voice signal after framing to obtain a windowed voice signal; performing Fourier transform on the windowed voice signal to obtain each frequency spectrum vector corresponding to the windowed voice signal; connecting each frequency spectrum vector in parallel along a time axis in a preset coordinate system to obtain a spectrogram corresponding to the voice signal; and inputting the spectrogram into a second preset tactical turn recognition model for tactical turn ending recognition to obtain a second recognition result corresponding to the tactical to be recognized.

Specifically, recording voice stream bytes in a call process to collect original voice data, firstly performing cross framing processing on a voice signal, then performing windowing processing on the voice signal subjected to framing, wherein the size of each window is 25ms, the window is moved by 10ms, then performing short-time Fourier transform on the voice signal of each window to obtain the Mel filter bank characteristics of the voice signal, namely spectrum vectors, which are 160 dimensions, establishing a preset coordinate system, taking time as a horizontal axis, connecting each spectrum vector in parallel along a time axis in the preset coordinate system to obtain a voice spectrum corresponding to the voice signal, and then acquiring a second recognition result corresponding to the speech to be recognized by using a preset classifier based on the voice spectrum, wherein the method comprises the following steps of: determining a voice feature vector corresponding to the spectrogram; and inputting the voice feature vectors into the classifier for classification to obtain a third probability value of the ending of the dialect turn and a fourth probability value of the non-ending of the dialect turn corresponding to the dialect to be recognized.

Specifically, local features and global features in the speech spectrogram are extracted by utilizing a convolution layer in a preset CNN (convolution application network model), the local features and the global features are fused to obtain a speech feature vector corresponding to the speech spectrogram, then the speech feature vector is input into a preset classifier to be classified to obtain a classification probability of whether the jargon turn corresponding to the jargon to be recognized is finished or not, namely a second recognition result is a third probability value of finishing the jargon turn to be recognized and a fourth probability value of not finishing the jargon turn.

206. And judging whether the dialect turn corresponding to the dialect to be recognized is finished or not based on the first recognition result and the second recognition result.

For the embodiment of the present invention, after determining the first recognition result and the second recognition result corresponding to the utterance to be recognized, it needs to determine whether the utterance turn corresponding to the utterance to be recognized is ended based on the first recognition result and the second recognition result, and based on this, step 206 specifically includes: determining a first weighting factor to which the first probability value and the second probability value correspond together, and determining a second weighting factor to which the third probability value and the fourth probability value correspond together; adding the first probability value and the third probability value based on the first weight coefficient and the second weight coefficient to obtain a first total probability value of ending of the tactical turn corresponding to the tactical operation to be identified, and adding the second probability value and the fourth probability value to obtain a second total probability value of not ending of the tactical turn corresponding to the tactical operation to be identified; if the first total probability value is larger than the second total probability value, determining that the dialect turn corresponding to the dialect to be identified is ended; and if the first total probability value is smaller than the second total probability value, determining that the dialect turn corresponding to the dialect to be identified is not finished.

Specifically, a first weight coefficient is set in advance for the recognition result of the first preset tactical turn recognition model, and a second weight coefficient is set in advance for the recognition result of the second preset tactical turn recognition model, determining a first probability value of the ending of the tactical turn corresponding to the to-be-identified tactical operation and a second probability value of the non-ending of the tactical turn by using a first preset tactical turn identification model, and after determining a third probability value of the completion of the tactical turn corresponding to the to-be-identified tactical operation and a fourth probability value of the non-completion of the tactical turn by using a second preset tactical turn identification model, multiplying the first weight coefficient by the first probability value to obtain a first product, and multiplying a second weight coefficient by the third probability value to obtain a second product, adding the first product and the second product to obtain a first total probability value of the termination of the tactical turn corresponding to the tactical operation to be identified, at the same time, multiplying the first weight coefficient by the two probability values to obtain a third product, and multiplying a second weight coefficient by the fourth probability value to obtain a fourth product, adding the third product to the fourth product to obtain a second total probability value of the number of times of the dialect corresponding to the to-be-identified dialect which is not finished, if the first total probability value is greater than the second total probability value, determining that the dialect turn corresponding to the dialect to be identified is ended, if the first total probability value is smaller than the second total probability value, determining that the dialect turn corresponding to the dialect to be identified is not ended, if the first total probability value is equal to the second total probability value, determining a maximum probability value in the first probability value and the second probability value, and determining the recognition result corresponding to the maximum probability value as the recognition result of whether the discussion turn corresponding to the to-be-recognized grammar is finished.

For example, if the phonetics text is input into the first preset phonetics turn recognition model, the first probability value of ending the phonetics turn to be recognized is 0.8, the second probability value of ending the phonetics turn to be recognized is 0.2, the voice signal is input into the second preset phonetics turn recognition model, the third probability value of ending the phonetics turn to be recognized is 0.2, the fourth probability value of ending the phonetics turn to be recognized is 0.8, the first weight coefficient corresponding to the first probability value and the second probability value is 0.75, the second weight coefficient corresponding to the third probability value and the fourth probability value is 0.25, the first total probability value of ending the phonetics turn is 0.75 × 0.8 × 0.2.65 by calculation, the second total probability value of ending the phonetics turn is 0.75 × 0.2 × 0.8, the final probability value of ending the phonetics turn is determined to be larger than 35.8, and the final probability value of ending the corresponding phonetics turn is determined to be equal to the first total probability value of ending the phonetics turn.

Further, if the dialect corresponding to the dialect to be recognized is ended, the outbound robot makes a return call according to the intention of the client, if the dialect corresponding to the dialect to be recognized is not ended, the preset time is continuously waited, if the user continues speaking within the preset waiting time, the dialect section in which the user continues speaking and the last dialect section are determined as the description dialect of the client for the same content, and the outbound robot makes a corresponding answer aiming at the description dialect of the same content.

According to another recognition method for ending the dialect turns, compared with the mode that only the text content in the user dialect is analyzed at present to determine whether the user speaking turns are ended, the method obtains the dialect text and the voice signal corresponding to the dialect to be recognized; inputting the dialect text into a first preset dialect turn recognition model for performing the end recognition of the dialect turn to obtain a first recognition result corresponding to the dialect to be recognized; then inputting the voice signal into a second preset dialect turn recognition model for performing the end recognition of the dialect turn to obtain a second recognition result corresponding to the dialect to be recognized; finally, whether the dialect turn corresponding to the dialect to be recognized is finished or not is judged based on the first recognition result and the second recognition result, so that the dialect turn corresponding to the dialect to be recognized is recognized through the dialect text and the voice information corresponding to the dialect to be recognized respectively to obtain the first recognition result and the second recognition result, and whether the dialect turn corresponding to the dialect to be recognized is finished or not is judged based on the first recognition result and the second recognition result, so that the situation that the dialect mood cannot be judged due to the fact that only the text content is analyzed is avoided, and the recognition accuracy of the end of the dialect turn is improved.

Further, as a specific implementation of fig. 1, an embodiment of the present invention provides a device for identifying an end of a session turn, where as shown in fig. 3, the device includes: an acquisition unit 31, a first recognition unit 32, a second recognition unit 33, and a judgment unit 34.

The acquiring unit 31 may be configured to acquire a phonetic text and a voice signal corresponding to a phonetic technique to be recognized.

The first identification unit 32 may be configured to input the vocational text into a first preset vocational turn identification model to perform vocational turn end identification, so as to obtain a first identification result corresponding to the vocational to be identified.

The second recognition unit 33 may be configured to input the voice signal into a second preset tactical turn recognition model to perform tactical turn end recognition, so as to obtain a second recognition result corresponding to the to-be-recognized tactical.

The determining unit 34 may be configured to determine whether the dialect turn corresponding to the dialect to be identified is ended based on the first identification result and the second identification result.

In a specific application scenario, in order to determine the first recognition result corresponding to the to-be-recognized grammar, as shown in fig. 4, the first recognition unit 32 includes a first determining module 321, a semantic recognition module 322, and a first recognition module 323.

The first determining module 321 may be configured to determine each character included in the dialogistic text, and determine an embedded vector corresponding to each character.

The semantic recognition module 322 may be configured to input the embedded vector into a preset natural language model for semantic information recognition, so as to obtain a semantic information vector corresponding to the speech technology to be recognized.

The first identification module 323 may be configured to input the semantic information vector into the first preset conversational turn identification model to perform conversational turn identification, so as to obtain a first identification result corresponding to the conversational skill to be identified.

In a specific application scenario, in order to determine the semantic information vector corresponding to the linguistic text, the semantic recognition module 322 includes a feature extraction sub-module and a summation sub-module.

The feature extraction sub-module may be configured to input the embedded vector into different attention subspaces in the self-attention layer to perform feature extraction, so as to obtain a first feature vector of the dialect text in the different attention subspaces.

The summing sub-module may be configured to multiply and sum the first feature vector of the conversational text in the different attention subspaces and the weights corresponding to the different attention subspaces to obtain a self-attention layer output vector corresponding to the conversational text.

The summation sub-module may be specifically configured to add the self-attention layer output vector and the embedded vector to obtain a second feature vector corresponding to the dialect text.

The feature extraction sub-module may be specifically configured to input the second feature vector into the feedforward neural network layer to perform feature extraction, so as to obtain a semantic information vector corresponding to the speech technology to be identified.

In a specific application scenario, in order to determine a first recognition result corresponding to the to-be-recognized grammar, the first recognition module 323 may be specifically configured to input the semantic information vector to the multilayer perceptron, and extract a feature output by a last full-link layer in the multilayer perceptron; inputting the features output by the last full-connection layer into a softmax layer in the multilayer perceptron to obtain a first probability value of the ending of the speaking turn corresponding to the to-be-identified speaking and a second probability value of the non-ending of the speaking turn.

In a specific application scenario, in order to determine a second recognition result corresponding to the to-be-recognized grammar, the second recognition unit 33 includes a framing processing module 331, a windowing processing module 332, a transformation module 333, a parallel module 334, and a second recognition module 335.

The framing processing module 331 may be configured to perform cross-type framing processing on the voice signal to obtain a framed voice signal.

The windowing processing module 332 may be configured to perform windowing processing on the framed voice signal to obtain a windowed voice signal.

The transform module 333 may be configured to perform fourier transform on the windowed speech signal to obtain each spectral vector corresponding to the windowed speech signal.

The parallel module 334 may be configured to parallel-connect each spectrum vector along a time axis in a preset coordinate system, so as to obtain a spectrogram corresponding to the speech signal.

The second identifying module 335 may be configured to input the spectrogram into a second preset tactical turn identifying model to perform tactical turn identification, so as to obtain a second identifying result corresponding to the tactical to be identified.

In a specific application scenario, in order to determine a second recognition result corresponding to the speech technology to be recognized based on the spectrogram, the second recognition module 335 includes a determination sub-module and a classification sub-module.

The determining submodule may be configured to determine a speech feature vector corresponding to the spectrogram.

The classification submodule may be configured to input the voice feature vector into the classifier to perform classification, so as to obtain a third probability value that a speaking turn corresponding to the to-be-recognized speaking operation is ended and a fourth probability value that the speaking turn is not ended.

In a specific application scenario, in order to determine whether the dialect turn corresponding to the dialect to be identified is ended, the determining unit 34 includes a second determining module 341 and an adding module 342.

The second determining module 341 may be configured to determine a first weighting coefficient corresponding to the first probability value and the second probability value, and determine a second weighting coefficient corresponding to the third probability value and the fourth probability value.

The adding module 342 may be configured to add the first probability value and the third probability value based on the first weight coefficient and the second weight coefficient to obtain a first total probability value that the speaking turn corresponding to the speaking technique to be identified is ended, and add the second probability value and the fourth probability value to obtain a second total probability value that the speaking turn corresponding to the speaking technique to be identified is not ended.

The second determining module 341 may be specifically configured to determine that the session turn corresponding to the to-be-identified session is ended if the first total probability value is greater than the second total probability value.

The second determining module 341 is further specifically configured to determine that the session turn corresponding to the to-be-identified session is not ended if the first total probability value is smaller than the second total probability value.

It should be noted that other corresponding descriptions of the functional modules involved in the identification apparatus for ending a conversation turn provided in the embodiment of the present invention may refer to the corresponding description of the method shown in fig. 1, and are not described herein again.

Based on the method shown in fig. 1, correspondingly, an embodiment of the present invention further provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the following steps: acquiring a dialect text and a voice signal corresponding to a dialect to be recognized; inputting the dialect text into a first preset dialect turn recognition model for performing the end recognition of the dialect turn to obtain a first recognition result corresponding to the dialect to be recognized; inputting the voice signal into a second preset dialect turn recognition model for performing dialect turn ending recognition to obtain a second recognition result corresponding to the dialect to be recognized; and judging whether the dialect turn corresponding to the dialect to be recognized is finished or not based on the first recognition result and the second recognition result.

Based on the above embodiments of the method shown in fig. 1 and the apparatus shown in fig. 3, an embodiment of the present invention further provides an entity structure diagram of a computer device, as shown in fig. 5, where the computer device includes: a processor 41, a memory 42, and a computer program stored on the memory 42 and executable on the processor, wherein the memory 42 and the processor 41 are both arranged on a bus 43 such that when the processor 41 executes the program, the following steps are performed: acquiring a dialect text and a voice signal corresponding to a dialect to be recognized; inputting the dialect text into a first preset dialect turn recognition model for performing the end recognition of the dialect turn to obtain a first recognition result corresponding to the dialect to be recognized; inputting the voice signal into a second preset dialect turn recognition model for performing dialect turn ending recognition to obtain a second recognition result corresponding to the dialect to be recognized; and judging whether the dialect turn corresponding to the dialect to be recognized is finished or not based on the first recognition result and the second recognition result.

According to the technical scheme, the method comprises the steps of obtaining the corresponding dialect text and voice signals of the dialect to be recognized; inputting the dialect text into a first preset dialect turn recognition model for performing the end recognition of the dialect turn to obtain a first recognition result corresponding to the dialect to be recognized; then inputting the voice signal into a second preset dialect turn recognition model for performing the end recognition of the dialect turn to obtain a second recognition result corresponding to the dialect to be recognized; finally, whether the dialect turn corresponding to the dialect to be recognized is finished or not is judged based on the first recognition result and the second recognition result, so that the dialect turn corresponding to the dialect to be recognized is recognized through the dialect text and the voice information corresponding to the dialect to be recognized respectively to obtain the first recognition result and the second recognition result, and whether the dialect turn corresponding to the dialect to be recognized is finished or not is judged based on the first recognition result and the second recognition result, so that the situation that the dialect mood cannot be judged due to the fact that only the text content is analyzed is avoided, and the recognition accuracy of the end of the dialect turn is improved.

It will be apparent to those skilled in the art that the modules or steps of the present invention described above may be implemented by a general purpose computing device, they may be centralized in a single computing device or distributed across a network of multiple computing devices, and alternatively, they may be implemented by program code executable by a computing device, such that they may be stored in a memory device and executed by a computing device, and in some cases, the steps shown or described may be executed out of order, or separately as individual integrated circuit modules, or multiple modules or steps thereof may be implemented as a single integrated circuit module. Thus, the present invention is not limited to any specific combination of hardware and software.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made without departing from the spirit and principle of the present invention shall fall within the protection scope of the present invention.

Claims

1. A method for identifying an end of a session run, comprising:

2. The method of claim 1, wherein the inputting the vocational text into a first preset vocational turn recognition model for vocational turn end recognition to obtain a first recognition result corresponding to the vocational to be recognized comprises:

determining each character contained in the dialect text, and determining an embedded vector corresponding to each character;

inputting the embedded vector into a preset natural language model for semantic information identification to obtain a semantic information vector corresponding to the speech technology to be identified;

and inputting the semantic information vector into the first preset dialect turn recognition model for performing dialect turn recognition to obtain a first recognition result corresponding to the dialect to be recognized.

3. The method according to claim 2, wherein the preset natural language model is a preset encoder, the preset encoder includes a self-attention layer and a feedforward neural network layer, and the inputting the embedded vector into the preset natural language model for semantic information recognition to obtain a semantic information vector corresponding to the speech to be recognized comprises:

inputting the embedded vector into different attention subspaces in the self-attention layer for feature extraction to obtain first feature vectors of the dialect text under the different attention subspaces;

multiplying and summing the first feature vectors of the dialect text under the different attention subspaces and the weights corresponding to the different attention subspaces to obtain a self-attention layer output vector corresponding to the dialect text;

adding the self-attention layer output vector and the embedded vector to obtain a second feature vector corresponding to the dialect text;

and inputting the second feature vector into the feedforward neural network layer for feature extraction to obtain a semantic information vector corresponding to the speech technology to be identified.

4. The method according to claim 2, wherein the first predetermined conversational turn recognition model is a multilayer perceptron, and the inputting the semantic information vector into the first predetermined conversational turn recognition model for conversational turn end recognition to obtain the first recognition result corresponding to the conversational turn to be recognized comprises:

inputting the semantic information vector into the multilayer perceptron, and extracting the characteristics output by the last full-connection layer in the multilayer perceptron;

inputting the features output by the last full-connection layer into a softmax layer in the multilayer perceptron to obtain a first probability value of the ending of the speaking turn corresponding to the to-be-identified speaking and a second probability value of the non-ending of the speaking turn.

5. The method according to claim 1, wherein the inputting the voice signal into a second preset conversational turn recognition model for a conversational turn end recognition to obtain a second recognition result corresponding to the conversational turn to be recognized comprises:

performing cross framing processing on the voice signal to obtain a framed voice signal;

windowing the voice signal after framing to obtain a windowed voice signal;

performing Fourier transform on the windowed voice signal to obtain each frequency spectrum vector corresponding to the windowed voice signal;

connecting each frequency spectrum vector in parallel along a time axis in a preset coordinate system to obtain a spectrogram corresponding to the voice signal;

and inputting the spectrogram into a second preset tactical turn recognition model for tactical turn ending recognition to obtain a second recognition result corresponding to the tactical to be recognized.

6. The method of claim 5, wherein the second predetermined conversational turn recognition model is a classifier, and the inputting the spectrogram into the second predetermined conversational turn recognition model for conversational turn end recognition to obtain a second recognition result corresponding to the conversational skill to be recognized comprises:

determining a voice feature vector corresponding to the spectrogram;

and inputting the voice feature vectors into the classifier for classification to obtain a third probability value of the ending of the dialect turn and a fourth probability value of the non-ending of the dialect turn corresponding to the dialect to be recognized.

7. The method according to any one of claims 4 to 6, wherein the determining whether the dialect turn corresponding to the dialect to be identified is finished based on the first identification result and the second identification result comprises:

determining a first weighting factor to which the first probability value and the second probability value correspond together, and determining a second weighting factor to which the third probability value and the fourth probability value correspond together;

adding the first probability value and the third probability value based on the first weight coefficient and the second weight coefficient to obtain a first total probability value of ending of the tactical turn corresponding to the tactical operation to be identified, and adding the second probability value and the fourth probability value to obtain a second total probability value of not ending of the tactical turn corresponding to the tactical operation to be identified;

if the first total probability value is larger than the second total probability value, determining that the dialect turn corresponding to the dialect to be identified is ended;

and if the first total probability value is smaller than the second total probability value, determining that the dialect turn corresponding to the dialect to be identified is not finished.

8. An apparatus for identifying an end of a session run, comprising:

and the judging unit is used for judging whether the dialect turn corresponding to the dialect to be identified is finished or not based on the first identification result and the second identification result.

9. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.

10. A computer arrangement comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the computer program realizes the steps of the method of any one of claims 1 to 7 when executed by the processor.