CN112037799A

CN112037799A - Voice interrupt processing method and device, computer equipment and storage medium

Info

Publication number: CN112037799A
Application number: CN202011213393.6A
Authority: CN
Inventors: 王艺霏; 邓锐涛; 刘彦华; 刘云峰
Original assignee: Shenzhen Zhuiyi Technology Co Ltd
Current assignee: Shenzhen Zhuiyi Technology Co Ltd
Priority date: 2020-11-04
Filing date: 2020-11-04
Publication date: 2020-12-04
Anticipated expiration: 2040-11-04
Also published as: CN112037799B

Abstract

The application relates to a voice interrupt processing method and device, computer equipment and a storage medium. The method comprises the following steps: in the voice broadcasting process, acquiring voice information of a user; performing text conversion on the voice information to obtain text information; identifying whether a filter word exists in the text information; if the filter word exists, acquiring a current voice broadcasting talkback corresponding to the voice information, and performing semantic recognition on the text information according to the current voice broadcasting talkback to obtain a semantic recognition result; and if the semantic recognition result is interrupt information, interrupting the voice broadcast. By adopting the method, the voice interruption information can be accurately identified so as to improve the service communication efficiency.

Description

Voice interrupt processing method and device, computer equipment and storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to a method and an apparatus for processing voice interrupt, a computer device, and a storage medium.

Background

With the development of internet technology, the development of communication technology is more and more rapid. More and more enterprises provide business services to users through intelligent customer services, such as intelligent robots. In the communication process of the intelligent customer service and the user, when the intelligent customer service carries out voice broadcasting, whether the interruption information of the user exists or not can be identified, and if the interruption information exists, the voice broadcasting is interrupted. In the conventional method, whether the interrupt information exists in the text is judged by converting the voice information of the user into the text.

However, in the conventional method, the interruption information is determined through text analysis, and the identification is not accurate, which easily causes the intelligent customer service to be interrupted by mistake, resulting in unsmooth service communication and affecting the service communication efficiency. Therefore, how to accurately recognize the voice interruption information to improve the service communication efficiency becomes a technical problem to be solved at present.

Disclosure of Invention

In view of the foregoing, it is desirable to provide a voice interrupt processing method, apparatus, computer device and storage medium capable of accurately recognizing voice interrupt information to improve service communication efficiency.

A voice interrupt processing method, the method comprising:

in the voice broadcasting process, acquiring voice information of a user;

performing text conversion on the voice information to obtain text information;

identifying whether a filter word exists in the text information;

if the filter word exists, acquiring a current voice broadcasting talkback corresponding to the voice information, and performing semantic recognition on the text information according to the current voice broadcasting talkback to obtain a semantic recognition result;

and if the semantic recognition result is interrupt information, interrupting the voice broadcast.

In one embodiment, performing semantic recognition on the text information according to the current speech broadcast talkback technique, and obtaining a semantic recognition result includes:

determining a current context category according to the current voice broadcasting conversation;

and performing semantic recognition on the text information according to the current context category to obtain a semantic recognition result.

In one embodiment, determining the current context category based on the current voice cast conversation comprises:

identifying whether the current voice broadcasting conversation has a key sentence or not;

if the current voice broadcasting speech art has a key sentence, determining that the current context type is a key sentence context;

and if no key sentence exists in the current voice broadcasting conversation, determining the current context type as the standard context.

In one embodiment, the performing semantic recognition on the text information according to the current context category to obtain a semantic recognition result includes:

if the current context category is a key sentence context, acquiring a key sentence in the current voice broadcasting conversation;

determining the time sequence relation between the key sentence and a preset filter word in the text information;

and performing semantic recognition on the text information according to the time sequence relation to obtain a semantic recognition result.

if the current context type is a standard context, determining that the text information is invalid information, and taking the invalid information as a semantic recognition result;

the method further comprises the following steps:

and filtering the voice information according to the semantic recognition result, and continuously broadcasting the current voice broadcasting speech operation without interrupting the voice broadcasting.

In one embodiment, after the interrupting the voice broadcast if the semantic recognition result is interrupt information, the method further includes:

acquiring complete voice corresponding to the voice information, and performing text conversion on the complete voice to obtain a text to be recognized;

inputting the text to be recognized into a pre-trained intention recognition model to obtain an intention recognition result;

and executing corresponding response operation according to the intention recognition result.

In one embodiment, the method further comprises:

if the preset filter words do not exist in the text information, interrupting voice broadcasting and acquiring complete voice corresponding to the voice information;

performing text conversion on the complete voice to obtain a text to be recognized;

In one embodiment, the performing the corresponding response operation according to the intention recognition result includes:

if the intention type exists in the intention identification result, broadcasting reply information corresponding to the intention type or skipping the node corresponding to the current voice broadcasting conversation to the node corresponding to the intention type;

and if the intention type does not exist in the intention identification result, continuing voice broadcasting from the interrupted part of the current voice broadcasting speech.

A voice interrupt processing apparatus, the apparatus comprising:

the acquisition module is used for acquiring the voice information of the user in the voice broadcasting process;

the text conversion module is used for performing text conversion on the voice information to obtain text information;

the information identification module is used for identifying whether the text information contains filter words or not;

the semantic recognition module is used for acquiring a current voice broadcasting talkback corresponding to the voice information if the filter words exist, and performing semantic recognition on the text information according to the current voice broadcasting talkback to obtain a semantic recognition result;

and the voice control module is used for interrupting the voice broadcast if the semantic recognition result is interruption information.

A computer device comprising a memory and a processor, the memory storing a computer program operable on the processor, the processor implementing the steps in the various method embodiments described above when executing the computer program.

A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the respective method embodiment described above.

According to the voice interrupt processing method, the voice interrupt processing device, the computer equipment and the storage medium, in the voice broadcasting process, the voice information of a user is obtained, and text conversion is performed on the voice information to obtain text information. And if the filtering words exist, acquiring the current voice broadcasting talkback corresponding to the voice information, and performing semantic recognition on the text information according to the current voice broadcasting talkback to obtain a semantic recognition result. And if the semantic recognition result is interrupt information, interrupting the voice broadcast. Whether the filter words exist in the conversation content of the user can be quickly identified in the voice broadcasting process, the current conversation context is determined by performing semantic identification on the conversation content, the interruption strategy of the filter words under the corresponding conversation context is distinguished, whether the user is the interruption intention is accurately judged, the response operation can be correctly executed, the problems that the voice broadcasting process is interrupted by mistake, the business communication is not smooth, even the circulation of the subsequent conversation is influenced are solved, and the business communication efficiency between the terminal and the user is effectively improved.

Drawings

FIG. 1 is a diagram of an application environment for a method of speech interrupt handling in one embodiment;

FIG. 2 is a flow diagram illustrating a method for handling speech interrupts, according to one embodiment;

FIG. 3 is a flowchart illustrating the steps of performing semantic recognition on text information according to a current speech broadcast conversation technique to obtain a semantic recognition result according to an embodiment;

FIG. 4 is a flowchart illustrating the steps of performing semantic recognition on text information according to a current context category to obtain a semantic recognition result according to an embodiment;

FIG. 5 is a flowchart illustrating a method for handling speech interrupts according to another embodiment;

FIG. 6 is a flowchart illustrating a method for handling speech interrupts according to another embodiment;

FIG. 7 is a block diagram of a speech interrupt processing apparatus according to one embodiment;

FIG. 8 is a diagram illustrating an internal structure of a computer device according to an embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

The voice interrupt processing method provided by the application can be applied to the application environment shown in fig. 1. Wherein, the voice collecting device 102 and the terminal 104 communicate through a network. In the process of voice broadcasting, the terminal 104 acquires voice information of a user through the voice acquisition device 102, and after acquiring the voice information, the terminal 104 performs text conversion on the voice information to obtain text information. Therefore, the terminal 104 identifies whether the text information has the filtering word, if so, obtains the current voice broadcasting talkback corresponding to the voice information, and performs semantic identification on the text information according to the current voice broadcasting talkback to obtain a semantic identification result. If the semantic recognition result is interrupt information, the terminal 104 interrupts the voice broadcast. When the voice speech technology is broadcasted, the voice interaction content of the user can be accurately judged based on semantic recognition, the voice broadcast can be accurately interrupted, and the service communication efficiency is improved. The voice collecting device 102 may be a microphone, a terminal, or the like having a voice collecting function. Terminals 104 include, but are not limited to, robots of a variety of service types, such as smart customer service, telephony robots, chat robots, and the like.

In an embodiment, as shown in fig. 2, a method for processing a voice interrupt is provided, which is described by taking an example that the method is applied to a terminal, and includes the following steps:

step 202, in the voice broadcasting process, the voice information of the user is acquired.

The voice broadcast refers to a process of performing speech broadcast by a terminal. The dialect refers to a fixed conversation mode in a voice interaction scene stored in the terminal in advance. For example, the terminal is broadcasting service consultation information, service marketing information, or the terminal is broadcasting news information. The voice information refers to user voice information acquired by the terminal in real time in the voice broadcasting process. For example, when the terminal is broadcasting the statement "hello, i am three banks. Asking for if you are plum four

'when broadcasting' you are good, i is a bank with three pages 'and' asking you for Li four

"collection of voice information of the user at any time between" is then, i.e. yes. "

The terminal can provide voice services such as business consultation, complaints, on-line question answering or voice navigation and the like for the user through voice interaction with the user. The terminal can carry out voice broadcast at the voice interaction in-process, and in this process, can gather user's speech information in real time. The user refers to a target user who performs voice interaction with the terminal and provides voice service. In one embodiment, the terminal may extract and store the voiceprint features of the user in advance during the voice interaction. The voiceprint characteristics refer to the characteristics of sound, each speaking sound has the characteristics of the voice, and the voices of different people can be effectively distinguished through the voiceprint characteristics. When the terminal collects voice information in the voice broadcasting process, the voiceprint features of the voice information can be extracted, and therefore the similarity between the extracted voiceprint features and the prestored voiceprint features of the user is calculated. And if the similarity reaches the threshold value, the terminal takes the voice information as the voice information of the user. The threshold value is also used to indicate a preset degree of similarity of the voiceprint characteristics of the two speech information. If the similarity reaches a threshold value, the voiceprint feature of the voice information is sufficiently similar to the prestored voiceprint feature of the user, and the voice information can be determined as the voice information of the user. The voice of the target user and the voices of other people can be accurately distinguished through voiceprint recognition, the problem that the voice information of other people except the user is used as the voice information of the user is further avoided, and the accuracy of voice information collection is improved.

And step 204, performing text conversion on the voice information to obtain text information.

After the terminal collects the voice information, the voice information can be converted into corresponding text information. Meanwhile, the terminal can normally broadcast the speech. Specifically, the terminal may perform Speech Recognition on the Speech information by using an Automatic Speech Recognition technology (ASR for short) to obtain corresponding text information.

In one embodiment, the terminal may also pause the broadcast dialog first after acquiring the voice information, and record the pause position of the broadcast dialog. For example, when the terminal is broadcasting the statement "hello, i am three banks. Asking for if you are plum four

"when you are at" you are good, i is three banks "and" ask you for you to be four plum

The doctor then collects the voice information of the user, and then carries out the process. When the user asks you for plum four, the pause position of the speech playing is recorded as ' hello, i is a third bank ' and ' asking you for plum four

"in between. The terminal can perform text conversion on the voice information after the broadcasting of the voice message is suspended, acquire complete user voice by suspending the broadcasting of the voice message, and continue to broadcast the voice message from the suspended position after responding to the voice message of the user.

In one embodiment, the terminal may further perform information deletion processing on the text information. The terminal is preset and stores preset characters needing to be deleted. For example, the preset characters may include letters, punctuation marks, and the like that interfere with or do not contribute to the accuracy of semantic recognition.

Step 206, identifying whether the text information has the filter word.

And 208, if the filter word exists, acquiring a current voice broadcasting talkback corresponding to the voice information, and performing semantic recognition on the text information according to the current voice broadcasting talkback to obtain a semantic recognition result.

The terminal stores a filtering word list in advance. The filter word list may include statistically derived filter words. The filter word is used to determine whether the voice broadcast needs to be interrupted. The filter words may include discourse words and attached words, for example, discourse words: a part of or: good, known, so on.

Because the same filter word is in different contexts, the voice broadcast interruption strategies required to be executed are different, and therefore when the terminal recognizes that the filter word exists in the text information, the semantic recognition can be carried out on the text information according to the current voice broadcast talkback operation. Semantic recognition refers to recognizing whether text information is break information. The semantic recognition mode may be key sentence recognition, or may be semantic recognition through a semantic recognition model, or may adopt other semantic recognition modes, which is not limited in this embodiment.

When the terminal recognizes that no filter word exists in the text information, the voice information of the user is meaningful information, the terminal needs to interrupt voice broadcasting and acquire the complete voice of the user, so that text conversion and intention recognition are performed on the complete voice of the user to obtain an intention recognition result, and the terminal performs corresponding response operation according to the intention recognition result. The response operation may include replying to the user, jumping to a corresponding voice broadcast node, or continuing to perform voice broadcast, and other operation modes.

And step 210, if the semantic recognition result is interrupt information, interrupting the voice broadcast.

The semantic recognition result refers to a recognition result obtained by the terminal performing voice recognition on the text information according to the current voice broadcasting conversation, and the semantic recognition result may include interrupted information and non-interrupted information. The interruption information indicates that the text information is meaningful information, and the terminal needs to interrupt the voice broadcasting. The non-interrupt information indicates that the text information is invalid information and does not affect the voice broadcasting process of the terminal. And if the semantic recognition result is interrupt information, the terminal immediately stops the voice broadcasting process, performs intention recognition on the text information, and performs corresponding response operation according to the intention recognition result. The response operation may include replying to the user, jumping to a corresponding voice broadcast node, or continuing to perform voice broadcast, and other operation modes.

In this embodiment, in the voice broadcasting process, the terminal acquires the voice information of the user, and performs text conversion on the voice information to obtain text information. And if the filtering words exist, acquiring the current voice broadcasting talkback corresponding to the voice information, and performing semantic recognition on the text information according to the current voice broadcasting talkback to obtain a semantic recognition result. And if the semantic recognition result is interrupt information, interrupting the voice broadcast. Whether a filtering word exists in the conversation content of the user can be quickly identified in the voice broadcasting process, the current conversation context is determined by performing semantic identification on the conversation content, so that the interruption strategy of the filtering word under the corresponding conversation context is distinguished, whether the user is the interruption intention is accurately judged, the terminal can correctly execute response operation, the problems that the voice broadcasting process of the terminal is interrupted by mistake, the business communication is not smooth, and even the circulation of subsequent conversations is affected are solved, and the business communication efficiency between the terminal and the user is effectively improved.

In an embodiment, as shown in fig. 3, the step 208 of performing semantic recognition on the text information according to the current voice broadcasting conversation, and obtaining a semantic recognition result includes:

in step 302, a current context category is determined based on a current voice cast conversation.

And 304, performing semantic recognition on the text information according to the current context category to obtain a semantic recognition result.

The semantic recognition is to determine a context type corresponding to the current voice interaction process and judge whether the text information is the interrupt information according to the context type. The context categories may include standard contexts as well as key sentence contexts, among others. The standard context refers to a context in which voice information of a user does not affect a voice broadcasting process of a terminal. The key sentence context refers to a context in which the terminal acquires voice information of a user when broadcasting a key speech, and the voice information may affect a voice broadcasting process of the terminal.

If the terminal determines that the current context type is the standard context according to the current voice broadcast talk, the text information of the user is invalid information, namely the voice information corresponding to the text information is invalid voice, the voice information can be directly filtered, and the terminal continues to broadcast the current voice broadcast talk without interrupting voice broadcast.

If the terminal determines that the current context category is the key sentence context according to the current voice broadcasting speech technology, the terminal needs to further identify whether the voice information of the user can influence the context of the voice broadcasting process of the terminal. The terminal can obtain an interruption strategy corresponding to the key sentence context, and identify whether the voice broadcasting process of the terminal needs to be interrupted or not according to the obtained interruption strategy. And if the voice broadcasting process needs to be interrupted, the terminal immediately stops the voice broadcasting process, performs intention recognition on the text information, and makes corresponding response operation according to an intention recognition result. The response operation may include replying to the user, jumping to a corresponding voice broadcast node, or continuing to perform voice broadcast, and other operation modes.

In the embodiment, the current context category is determined, and the text information is subjected to semantic recognition according to the current context category, so that the corresponding interruption strategies of the same filter word under different context categories can be selected, whether the voice broadcasting needs to be interrupted can be more accurately judged, the voice broadcasting process of the terminal is effectively prevented from being interrupted by mistake, and the service communication efficiency between the terminal and a user is further improved.

In one embodiment, determining the current context category based on the current voice cast conversation includes: identifying whether a key sentence exists in the current voice broadcasting talk operation; if the current voice broadcasting speech art has the key sentence, determining the current context type as the key sentence context; and if no key sentence exists in the current voice broadcasting conversation, determining the current context type as the standard context.

And if the text information corresponding to the voice information acquired by the terminal has the filter word, the terminal acquires the current voice broadcasting speech corresponding to the voice information. The current voice broadcasting speech technology refers to fixed conversation information broadcasted by the terminal at the current moment. The terminal can perform key sentence recognition on the current voice broadcast speech to determine the current context category. Specifically, a key sentence marking strategy is preset in the terminal. The key sentence marking strategy is used for marking key sentence marks on the sentences in the current voice broadcasting speech operation. The key sentences may include multiple information such as question sentences and important information that needs to be confirmed by the user. For example, an question sentence may be "ask you for Li four

The important information that needs the user to confirm may be "the user is lee with four arrears of 100 dollars. The terminal can firstly perform sentence division processing on the current voice broadcasting speech according to punctuation marks in the current voice broadcasting speech to obtain a plurality of complete sentences. Punctuation marks are punctuation marks that represent the end of a sentence, e.g. "," is a little bit "

"and the like. Two sentences. Therefore, the terminal identifies whether the key sentence exists in the complete sentences according to the key sentence marking strategy. And if the key sentence exists, the terminal determines the current context type as the key sentence context. And if no key sentence exists, the terminal determines the current context type as the standard context.

For example, the current voice broadcast dialog for the terminal is "hello, i is Zhang three banks. Asking for if you are plum four

' the terminal can broadcast the current voice according to the punctuation markThe operation is divided into' you good, i.e. three banks. 'and' asking for question that you are plum four

"two sentences. The terminal identifies the key sentences of the two split sentences according to a key sentence marking strategy, and identifies that the question asking you is Li four

If the key sentence is ' the terminal can ask ' if you are ' Li four

"Add Key sentence identification. So that the terminal determines the current context category as the key sentence context.

In this embodiment, since the boundaries of the filter words are fuzzy, the semantics may be different under different context categories. Therefore, the terminal identifies the key sentence of the current voice broadcasting speech technology to determine the current context category, and the semantic identification accuracy can be improved. In addition, the key sentence identification mode is simple and effective, and the current context category can be quickly determined so as to realize the semantic identification of the text information.

In one embodiment, as shown in fig. 4, in step 304, performing semantic recognition on the text information according to the current context category, and obtaining a semantic recognition result includes:

step 402, if the current context type is the key sentence context, obtaining the key sentence in the current voice broadcast talk.

Step 404, determining a time sequence relation between the key sentence and a preset filter word in the text information.

And 406, performing semantic recognition on the text information according to the time sequence relation to obtain a semantic recognition result.

After determining the current context category, the terminal can perform semantic recognition on the text information according to the current context category. The current context category may include a key sentence context, where the key sentence context refers to a context in which the terminal acquires voice information of the user when broadcasting the key sentence, and the voice information may affect a voice broadcasting process of the terminal.

If the terminal determines that the current context category is the key sentence context according to the current voice broadcasting speech technology, the terminal needs to further identify whether the voice information of the user can influence the context of the voice broadcasting process of the terminal. Specifically, the terminal searches sentences marked with key sentence marks in a plurality of complete sentences corresponding to the current voice broadcasting speech operation. Therefore, the terminal can determine the time sequence relation between the key sentence and the preset filter word in the text information according to the comparison between the time for acquiring the text information and the broadcasting time of the key sentence. The time sequence relation refers to the sequential relation of time. If the terminal acquires the text information containing the filter words before the key sentences are broadcasted, the text information is invalid, and the terminal can determine the text information as non-interrupt information. If the terminal acquires the text information containing the filter words after broadcasting the key sentences, the text information is meaningful, and the terminal determines the text information as the interruption information.

For example, if the current context category is the key sentence context, the terminal acquires two sentences obtained by splitting the current voice broadcasting conversation, namely' hello, i is a third bank. 'and' asking for question that you are plum four

". The key sentence is that the question is Li four

". When the terminal broadcasts' you are good, i is three banks. When the voice information of the user is acquired, the corresponding text information can be determined as non-interrupt information. When the terminal is broadcasting' asking you for plum four

And then, acquiring the voice information of the user, determining the corresponding text information as interruption information, and at the moment, immediately interrupting the voice broadcast by the terminal and recording the interruption position of the voice broadcast.

In the embodiment, the terminal further semantically identifies the text information in the context of the key sentence by determining the time sequence relationship between the key sentence and the preset filter word in the text information, and can comprehensively consider various conditions in the context of the key sentence in combination with an actual application scene, so that the semantic identification accuracy can be further improved, the user intention identification accuracy can be effectively improved, and the business communication efficiency can be further improved.

In another embodiment, as shown in fig. 5, there is provided a voice interrupt processing method including the steps of:

step 502, in the voice broadcasting process, acquiring the voice information of the user.

Step 504, performing text conversion on the voice information to obtain text information.

Step 506, identifying whether a filter word exists in the text information. If yes, go to step 508, otherwise go to step 510.

And step 508, acquiring the current voice broadcast talks corresponding to the voice information, and determining the current context type according to the current voice broadcast talks. Step 512 and step 514 are performed according to the current context category, respectively.

Step 510, interrupting the voice broadcast.

And 512, if the current context category is the key sentence context, locating the key sentence in the current voice broadcasting speech technology, determining the time sequence relation between the key sentence and a preset filter word in the text information, performing semantic recognition on the text information according to the time sequence relation to obtain a semantic recognition result, and if the semantic recognition result is interrupt information, interrupting the voice broadcasting.

And 514, if the type of the current context is the standard context, determining that the text information is invalid information, taking the invalid information as a semantic recognition result, filtering the text information according to the semantic recognition result, and continuously broadcasting the current voice broadcast speech without interrupting the voice broadcast.

And if the terminal identifies that the filtering words exist in the text information, acquiring the current voice broadcast talkback corresponding to the voice information, and determining the current context type according to the current voice broadcast talkback. If the current context type is the standard context, the voice information of the user is indicated not to influence the voice broadcasting process of the terminal. And the terminal determines the text information as invalid information and takes the invalid information as a semantic recognition result. Therefore, the voice information corresponding to the text information is invalid voice. The terminal can directly filter voice information, and the terminal does not need to interrupt voice broadcasting and continues to broadcast the current voice broadcasting speech. If the terminal recognizes that the filter word does not exist in the text information, the voice broadcasting can be directly interrupted.

In this embodiment, if the current context type is the standard context, the text information is determined as invalid information, the corresponding voice information is filtered, and the current voice broadcast speech is continuously broadcast. The voice information can be filtered when the user sends invalid voice information, so that the problems that the voice broadcast of the terminal is interrupted by mistake, the service communication is not smooth, and even the circulation of subsequent conversations is influenced are avoided, and the service communication efficiency is improved.

In one embodiment, after interrupting the voice broadcast if the semantic recognition result is interruption information, the method further includes: acquiring complete voice corresponding to the voice information, and performing text conversion on the complete voice corresponding to the voice information to obtain a text to be recognized; inputting a text to be recognized into a pre-trained intention recognition model to obtain an intention recognition result; and executing corresponding response operation according to the intention recognition result.

If the voice recognition result obtained by the terminal is the interrupt information, the terminal interrupts the voice broadcast to obtain the complete voice of the user, so that the complete sentence is ensured to be obtained, the subsequent accurate recognition of the user intention is facilitated, and the corresponding response operation is executed. And the terminal converts the complete voice of the user into a corresponding text to be recognized. The conversion method is Automatic Speech Recognition (ASR for short). A terminal stores a conscious recognition model in advance, and the intention recognition model is obtained by training a large number of voice samples. The intent recognition model may be a convolutional neural network model. The intent recognition model may include multiple network layers. For example, the intent recognition model may include an input layer, an attention layer, a convolutional layer, a pooling layer, a fully connected layer, and an output layer. And the terminal calls the intention recognition model, inputs the text to be recognized into the intention recognition model, performs prediction operation on the text to be recognized through the intention recognition model, and outputs an intention recognition result. The intention type may or may not be present in the intention recognition result. And then corresponding response operation is executed according to the intention recognition result.

In one embodiment, performing the corresponding response operation according to the intention recognition result includes: if the intention type exists in the intention identification result, broadcasting reply information corresponding to the intention type or skipping the node corresponding to the current voice broadcast conversation to the node corresponding to the intention type; and if the intention type does not exist in the intention identification result, continuing the voice broadcasting from the interrupted part of the current voice broadcasting speech. Specifically, if the intention type exists in the intention identification result, the terminal may extract reply information corresponding to the intention type from the database and broadcast the reply information, or trigger a node corresponding to the current voice broadcast talkback to a node corresponding to the intention type, or trigger other voice interaction instructions. This embodiment is not limited thereto. The voice information of the user can be responded in time. If the intention type does not exist in the intention identification result or other operation requests of the user are indicated, the terminal can perform voice broadcasting from the interruption position of the current voice broadcasting speech, and the interruption position of the current voice broadcasting speech refers to the recorded interruption position when the voice broadcasting speech is interrupted. The terminal does not need to restart voice broadcasting, and the voice interaction efficiency is improved. If the intention recognition result indicates that the user requests a replay, the terminal may perform a voice broadcast from the beginning.

In this embodiment, after the voice broadcast is interrupted, the terminal acquires the complete voice of the user, converts the complete voice into the text to be recognized, and performs the intention recognition on the text to be recognized through the intention recognition model, so that the accuracy of the intention recognition can be improved. The terminal executes corresponding response operation according to the intention recognition result, and can timely and accurately respond to the voice information of the user, so that the service communication efficiency can be improved.

In another embodiment, as shown in fig. 6, there is provided a voice interrupt processing method including the steps of:

step 602, in the voice broadcasting process, acquiring the voice information of the user.

Step 604, performing text conversion on the voice information to obtain text information.

Step 606, identify whether there is a filter word in the text message. If so, go to step 608, otherwise go to step 610 and step 618.

Step 608, obtaining the current voice broadcast speech corresponding to the voice information, and determining the current context type according to the current voice broadcast speech. Step 620 and step 622 are performed, respectively, according to the current context category.

Step 610, interrupting the voice broadcast.

Step 612, acquiring a complete voice corresponding to the voice information, and performing text conversion on the complete voice to obtain a text to be recognized.

And 614, inputting the text to be recognized into a pre-trained intention recognition model to obtain an intention recognition result.

In step 616, it is detected whether the intention category exists in the intention recognition result. If yes, go to step 618, otherwise go to step 620.

And step 618, broadcasting the reply information corresponding to the intention category or skipping the node corresponding to the current voice broadcast talkback to the node corresponding to the intention category.

And step 620, continuing the voice broadcasting from the interrupted part of the current voice broadcasting speech operation.

Step 622, if the current context category is the key sentence context, locating the key sentence in the current voice broadcasting speech technology, determining the time sequence relation between the key sentence and the preset filter word in the text information, performing semantic recognition on the text information according to the time sequence relation to obtain a semantic recognition result, and if the semantic recognition result is interruption information, interrupting the voice broadcasting. Step 612 and 620 continue to be performed.

And step 624, if the current context type is the standard context, determining that the text information is invalid information, taking the invalid information as a semantic recognition result, filtering the text information according to the semantic recognition result, and continuing to broadcast the current voice broadcast speech operation without interrupting the voice broadcast.

After the terminal collects the voice information, the voice information can be converted into corresponding text information. So that the terminal recognizes whether the filter word exists in the text information. The filter word is used to determine whether the voice broadcast needs to be interrupted. The filter words may include discourse words and attached words, for example, discourse words: a part of or: good, known, so on.

If the terminal identifies that no filter word exists in the text information, the voice information is meaningful information, the terminal needs to interrupt voice broadcasting to acquire complete voice of the user, the acquired complete sentence is ensured, the subsequent accurate identification of the user intention is facilitated, and corresponding response operation is executed. And the terminal converts the complete voice of the user into a corresponding text to be recognized. The conversion method is Automatic Speech Recognition (ASR for short). A terminal stores a conscious recognition model in advance, and the intention recognition model is obtained by training a large number of voice samples. The intent recognition model may be a convolutional neural network model. The intent recognition model may include multiple network layers. For example, the intent recognition model may include an input layer, an attention layer, a convolutional layer, a pooling layer, a fully connected layer, and an output layer. And the terminal calls the intention recognition model, inputs the text to be recognized into the intention recognition model, performs prediction operation on the text to be recognized through the intention recognition model, and outputs an intention recognition result. The terminal thus detects whether an intention category exists in the intention recognition result. And if the intention identification result has the intention type, the terminal broadcasts reply information corresponding to the intention type or skips the node corresponding to the current voice broadcast speech operation to the node corresponding to the intention type. If the intention type does not exist in the intention identification result or other operation requests of the user are indicated, the terminal can perform voice broadcasting from the interruption position of the current voice broadcasting speech, and the interruption position of the current voice broadcasting speech refers to the recorded interruption position when the voice broadcasting speech is interrupted. The terminal does not need to restart voice broadcasting, and the voice interaction efficiency is improved. In this embodiment, if the terminal recognizes that no filter word exists in the text information, or that a filter word exists in the text information, and a semantic recognition result obtained after performing semantic recognition on the text is interrupt information, the terminal needs to interrupt the voice broadcast, and the processing steps after interrupting the voice broadcast may be the same.

In this embodiment, if the terminal recognizes that no filter word exists in the text information, the voice broadcast can be directly interrupted, the acquired complete voice is converted into a text to be recognized, and the text to be recognized is subjected to intention recognition through the intention recognition model, so that the accuracy of intention recognition can be improved. The terminal executes corresponding response operation according to the intention recognition result, and can timely and accurately respond to the voice information of the user, so that the service communication efficiency can be improved.

It should be understood that although the steps in the flowcharts of fig. 2 to 6 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in fig. 2-6 may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performing the sub-steps or stages is not necessarily sequential, but may be performed in turn or alternately with other steps or at least some of the sub-steps or stages of other steps.

In one embodiment, as shown in fig. 7, there is provided a voice interrupt processing apparatus including: an obtaining module 702, a text conversion module 704, an information recognition module 706, a semantic recognition module 708, and a voice control module 710, wherein:

the obtaining module 702 is configured to obtain voice information of a user in a voice broadcast process.

And the text conversion module 704 is configured to perform text conversion on the voice information to obtain text information.

And the information identification module 706 is used for identifying whether the filter words exist in the text information.

The semantic recognition module 708 is configured to, if there is a filter word, obtain a current voice broadcast talkback corresponding to the voice information, and perform semantic recognition on the text information according to the current voice broadcast talkback to obtain a semantic recognition result.

And the voice control module 710 is configured to interrupt the voice broadcast if the semantic recognition result is the interrupt information.

In one embodiment, the semantic identification module 708 is further configured to determine a current context category based on the current voice cast conversation; and performing semantic recognition on the text information according to the current context category to obtain a semantic recognition result.

In one embodiment, the semantic identification module 708 is further configured to identify whether a key sentence exists for the current voice-over utterance; if the current voice broadcasting speech art has the key sentence, determining the current context type as the key sentence context; and if no key sentence exists in the current voice broadcasting conversation, determining the current context type as the standard context.

In one embodiment, the semantic recognition module 708 is further configured to obtain a key sentence in the current speech playing conversation if the current context category is a key sentence context; determining the time sequence relation between the key sentence and a preset filter word in the text information; and performing semantic recognition on the text information according to the time sequence relation to obtain a semantic recognition result.

In one embodiment, the semantic recognition module 708 is further configured to determine that the text information is invalid information if the current context type is a standard context, and use the invalid information as a semantic recognition result.

The voice control module 710 is further configured to filter the voice information according to the semantic recognition result, so as to continue to broadcast the current voice broadcast speech operation without interrupting the voice broadcast.

In one embodiment, the above apparatus further comprises:

the obtaining module 702 is further configured to obtain a complete voice corresponding to the voice information.

The text conversion module 704 is further configured to perform text conversion on the complete speech to obtain a text to be recognized.

And the intention recognition module is used for inputting the text to be recognized into a pre-trained intention recognition model to obtain an intention recognition result.

And the response module is used for executing corresponding response operation according to the intention recognition result.

In one embodiment, the above apparatus further comprises:

the voice control module 710 is further configured to interrupt voice broadcast if the preset filter word does not exist in the text message.

In one embodiment, the response module is further configured to, if the intention type exists in the intention identification result, broadcast reply information corresponding to the intention type or skip a node corresponding to a current voice broadcast conversation to a node corresponding to the intention type; and if the intention type does not exist in the intention identification result, continuing the voice broadcasting from the interrupted part of the current voice broadcasting speech.

For the specific limitation of the voice interrupt processing apparatus, reference may be made to the above limitation on the voice interrupt processing method, which is not described herein again. The various modules in the above-described speech interrupt processing apparatus may be implemented in whole or in part by software, hardware, and combinations thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, a computer device is provided, which may be a terminal, and its internal structure diagram may be as shown in fig. 8. The computer device includes a processor, a memory, a communication interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The communication interface of the computer device is used for carrying out wired or wireless communication with an external terminal, and the wireless communication can be realized through WIFI, an operator network, NFC (near field communication) or other technologies. The computer program is executed by a processor to implement a speech interrupt processing method. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on the shell of the computer equipment, an external keyboard, a touch pad or a mouse and the like.

Those skilled in the art will appreciate that the architecture shown in fig. 8 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, a computer device is provided, comprising a memory storing a computer program and a processor implementing the steps of the voice interrupt processing method provided in the above embodiments when the processor executes the computer program.

In one embodiment, a computer-readable storage medium is provided, on which a computer program is stored, which, when being executed by a processor, implements the steps of the voice interrupt processing method provided in the above-mentioned embodiments.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A method of speech interrupt processing, the method comprising:

in the voice broadcasting process, acquiring voice information of a user;

performing text conversion on the voice information to obtain text information;

identifying whether a filter word exists in the text information;

2. The method of claim 1, wherein performing semantic recognition on the text message according to the current voice broadcast conversation technique to obtain a semantic recognition result comprises:

3. The method of claim 2, wherein determining a current context category from the current voice cast dialog comprises:

4. The method according to any one of claims 2 to 3, wherein the performing semantic recognition on the text information according to the current context category to obtain a semantic recognition result comprises:

if the current context type is a key sentence context, acquiring a key sentence in the current voice broadcasting conversation;

5. The method according to any one of claims 2 to 3, wherein the performing semantic recognition on the text information according to the current context category to obtain a semantic recognition result comprises:

if the current context type is the standard context, determining that the text information is invalid information, and taking the invalid information as a semantic recognition result;

the method further comprises the following steps:

6. The method according to claim 1, wherein after interrupting the voice broadcast if the semantic recognition result is interruption information, the method further comprises:

7. The method of claim 1, further comprising:

8. The method according to any one of claims 6 to 7, wherein the performing the corresponding response operation according to the intention recognition result comprises:

9. A voice interrupt processing apparatus, the apparatus comprising:

10. A computer device comprising a memory and a processor, the memory storing a computer program operable on the processor, wherein the processor implements the steps of the method of any one of claims 1 to 8 when executing the computer program.

11. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 8.