CN111540349A - Voice interruption method and device - Google Patents

Voice interruption method and device Download PDF

Info

Publication number
CN111540349A
CN111540349A CN202010232214.7A CN202010232214A CN111540349A CN 111540349 A CN111540349 A CN 111540349A CN 202010232214 A CN202010232214 A CN 202010232214A CN 111540349 A CN111540349 A CN 111540349A
Authority
CN
China
Prior art keywords
preset
interruption
voice
user
identifier
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010232214.7A
Other languages
Chinese (zh)
Other versions
CN111540349B (en
Inventor
郑鑫哲
李健
武卫东
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Sinovoice Technology Co Ltd
Original Assignee
Beijing Sinovoice Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Sinovoice Technology Co Ltd filed Critical Beijing Sinovoice Technology Co Ltd
Priority to CN202010232214.7A priority Critical patent/CN111540349B/en
Publication of CN111540349A publication Critical patent/CN111540349A/en
Application granted granted Critical
Publication of CN111540349B publication Critical patent/CN111540349B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L15/222Barge in, i.e. overridable guidance for interrupting prompts

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)

Abstract

The embodiment of the invention provides a method and a device for interrupting voice, which comprises the following steps: when user voice sent by a user is received in the process of broadcasting broadcast voice, acquiring the current broadcasting time of the broadcast voice; recognizing the user voice to obtain a recognition result; and based on a preset judgment rule aiming at a preset parameter, interrupting the broadcast voice which is being played by adopting the current playing time and the recognition result. In the embodiment of the invention, whether the broadcasting of the broadcasting voice needs to be interrupted or not is judged by regularly detecting the recognition result, whether the broadcasting voice needs to be interrupted or not based on the voice of the user can be effectively determined, and meanwhile, the interaction requirements under different scenes can be met by adjusting different preset parameters.

Description

Voice interruption method and device
Technical Field
The present invention relates to the field of speech processing technologies, and in particular, to a speech interruption method and a speech interruption device.
Background
When man-machine interaction is carried out in scenes of intelligent outbound and intelligent navigation, in order to enable a client to perceptively experience communication between similar people, the outbound robot needs to imitate normal conversation scenes between people, can keep silent in the speaking process of the client, answers the client after the client finishes speaking questions, and timely stops broadcasting if the client has the interruption condition in the broadcasting process.
In the current speech interruption interactive flow, the logic judgment for judging when the TTS (TextToSpeech, text-to-speech technology) broadcasting stops is difficult to control, for example, whether interruption is needed or not is judged from the perspective of speech recognition and depends strongly on the recognition judgment of a recognition engine on noise and short sound, and the situation of error interruption or continuous interruption is caused; the judgment by the natural language processing mode can cause a large delay to the response speed of the whole interaction.
Disclosure of Invention
In view of the above, embodiments of the present invention are proposed to provide a speech interruption method and a corresponding speech interruption apparatus that overcome or at least partially solve the above-mentioned problems.
In order to solve the above problem, an embodiment of the present invention discloses a speech interruption method, including:
when user voice sent by a user is received in the process of broadcasting broadcast voice, acquiring the current broadcasting time of the broadcast voice;
recognizing the user voice to obtain a recognition result;
and based on a preset judgment rule aiming at a preset parameter, interrupting the broadcast voice which is being played by adopting the current playing time and the recognition result.
Optionally, the step of interrupting the broadcast voice being played by using the current playing time and the recognition result based on a preset judgment rule for a preset parameter includes:
generating an interruption identifier according to the recognition result and the preset judgment rule aiming at the preset parameter;
determining an interruption moment according to the current playing time and the preset judgment rule aiming at the preset parameter;
and interrupting the broadcast voice which is being played by adopting the interruption time and the interruption identifier.
Optionally, the recognition result comprises a user voice word number; the preset judgment rule aiming at the preset parameters comprises the following steps: judging whether the user voice word number is greater than or equal to a rule of a first preset word number threshold value; the interruption identifier comprises a first interruption identifier; the step of generating an interruption identifier according to the recognition result and the preset judgment rule of the preset modification parameter includes:
judging whether the user voice word number is larger than or equal to the first preset word number threshold value or not;
and if so, generating the first disconnection identifier.
Optionally, the recognition result further comprises user speech semantics; the preset judgment rule for the preset parameter further comprises: judging whether the user voice semantics are matched with a first preset semantics; the interruption identifier further comprises a second interruption identifier; the method further comprises the following steps:
when the user voice word number is smaller than the first preset word number threshold value, matching the user voice semantics in the first preset semantics;
and when the matching is successful, generating the second interruption identifier.
Optionally, the preset judgment rule for the preset parameter further includes: judging whether the user voice semantics are larger than or equal to a second preset word number threshold value and whether the user voice semantics are not matched with the second preset semantics; the interruption identifier further comprises a third interruption identifier: the method further comprises the following steps:
when the user voice semantics are not matched in the first preset semantics, judging whether the user voice word number is greater than or equal to a second preset word number threshold value or not;
if so, matching the user voice semantics in the second preset semantics;
and when the matching fails, generating the third interruption identifier.
Optionally, the preset parameter further includes an allowable interruption duration; the preset judgment rule for the preset parameter further comprises: judging whether the current playing time length is greater than or equal to a rule of a preset allowable interruption time length; the break time comprises a first break time; the step of determining the interruption time according to the current playing time and the preset judgment rule aiming at the preset parameter comprises the following steps:
judging whether the current playing time length is greater than or equal to the preset allowable interruption time length or not;
if so, determining the identifier generation time for generating the interrupt identifier;
and determining the identifier generation time as the first break-off time.
Optionally, the break time further includes a second break time; the method further comprises the following steps:
and when the current playing time length is less than the interruption-allowed time length, determining the time when the broadcasting time length of the broadcasting voice is equal to the interruption-allowed time length as the second interruption time.
The embodiment of the invention also discloses a voice interruption device, which comprises:
the system comprises a current playing time length obtaining module, a current playing time length obtaining module and a broadcasting voice playing module, wherein the current playing time length obtaining module is used for obtaining the current playing time length of broadcasting voice when user voice sent by a user is received in the process of playing the broadcasting voice;
the recognition module is used for recognizing the user voice to obtain a recognition result;
and the interruption module is used for interrupting the broadcast voice which is being played by adopting the current playing time length and the recognition result based on a preset judgment rule aiming at a preset parameter.
Optionally, the interrupting module comprises:
the interruption identifier generation submodule is used for generating an interruption identifier according to the recognition result and the preset judgment rule aiming at the preset parameter;
the interruption time determining submodule is used for determining interruption time according to the current playing time and the preset judgment rule aiming at the preset parameter;
and the interruption submodule is used for interrupting the broadcast voice which is being played by adopting the interruption time and the interruption identifier.
Optionally, the recognition result comprises a user voice word number; the preset judgment rule aiming at the preset parameter comprises a rule for judging whether the word number of the voice of the user is greater than or equal to a first preset word number threshold value; the interruption identifier comprises a first interruption identifier; the interruption identifier generation submodule comprises:
the first preset word number threshold judging unit is used for judging whether the user voice word number is larger than or equal to the first preset word number threshold;
a first break identifier generating unit, configured to generate the first break identifier.
Optionally, the recognition result further comprises user speech semantics; the preset judgment rule for the preset parameter further comprises: judging whether the user voice semantics are matched with the first preset semantics, wherein the interruption identifier also comprises a second interruption identifier; the interruption identifier generation submodule further includes:
the first preset semantic matching unit is used for matching the user voice semantic in the first preset semantic when the user voice word number is smaller than the first preset word number threshold;
and the second interruption identifier generation unit is used for generating the second interruption identifier when the matching is successful.
Optionally, the preset judgment rule for the preset parameter further includes: judging whether the voice semantics of the user is greater than or equal to a second preset word number threshold value and whether the voice semantics of the user are not matched with the second preset semantics; the interruption identifier further comprises a third interruption identifier: the interruption identifier generation submodule further includes:
a second preset word number threshold judgment sub-module, configured to, when the user voice semantic is not matched in the first preset semantic, judge whether the user voice word number is greater than or equal to the second preset word number threshold;
the second preset semantic matching unit is used for matching the user voice semantics in the second preset semantics;
and the third interruption identification unit is used for generating the third interruption identification when the matching fails.
Optionally, the preset parameter further includes an allowable interruption duration; the preset judgment rule for the preset parameter further comprises: judging whether the current playing time length is greater than or equal to a rule of a preset allowable interruption time length; the break time comprises a first break time; the break moment determining submodule includes:
the judging unit is used for judging whether the current playing time length is greater than the preset allowable interruption time length or not;
an identifier generation time determination unit for determining an identifier generation time at which the interrupt identifier is generated;
a first break time determination unit configured to determine the identifier generation time as the first break time.
Optionally, the break time further includes a second break time; the break moment determining submodule further includes:
and the second interruption time determining unit is used for determining the time when the broadcasting time length of the broadcasting voice is equal to the interruption-allowed time length as the second interruption time when the current playing time length is less than the interruption-allowed time length.
The embodiment of the invention also discloses a device, which comprises: a processor, a memory and a computer program stored on the memory and capable of running on the processor, which computer program, when executed by the processor, carries out the steps of the method of interrupting a speech as claimed in any one of the above.
The embodiment of the invention also discloses a computer readable storage medium, wherein a computer program is stored on the computer readable storage medium, and when being executed by a processor, the computer program realizes the steps of the voice interruption method.
The embodiment of the invention has the following advantages: the embodiment of the invention obtains the current playing time of the broadcast voice when receiving the user voice sent by the user in the process of playing the broadcast voice; recognizing the received user voice to obtain a recognition result; therefore, the broadcasting voice which is being played can be interrupted by adopting the current playing time length and the recognition result based on the preset judgment rule aiming at the preset parameter. In the embodiment of the invention, whether the broadcasting of the broadcasting voice needs to be interrupted or not is judged by regularly detecting the recognition result, whether the broadcasting voice needs to be interrupted or not based on the voice of the user can be effectively determined, and meanwhile, the interaction requirements under different scenes can be met by adjusting different preset parameters.
Drawings
FIG. 1 is a flow chart of the steps of a first embodiment of a speech interruption method of the present invention;
FIG. 2 is a flow chart of the steps of a second embodiment of a speech interruption method of the present invention;
FIG. 3 is a flow chart of a speech interruption method embodiment of the present invention;
fig. 4 is a block diagram of a speech interruption apparatus according to an embodiment of the present invention.
Detailed Description
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.
In the solution of intelligent outbound and navigation, the intelligent semantic interaction technology relates to the logic that TTS (text to speech) broadcasting needs to be interrupted when a user speaks so as to provide more intelligent and humanized experience for the called user. The current logic of interrupting speech interaction is mainly realized by the following two ways:
1) pure speech detection
The logic interrupted by the method strongly depends on the judgment of the voice recognition engine on the voice, and the logic for interrupting TTS (text to speech) broadcasting is triggered when voice is detected at the user side, which is quite common in the current voice interaction products, but the realization mode has the following inevitable error interruption conditions:
in the process of telephone interaction, the voice environment of a client side is mostly uncertain, sometimes on a noisy road or in an environment with more people stream, noise and loud noise are very likely to occur in the sound received from a telephone terminal, and the current voice recognition technology cannot well and completely filter out the sound of the surrounding environment, so that the false recognition and the false interruption of the system can be caused; in addition, even though the technology has a good effect on noise reduction, meaningless answers such as the Chinese characters ' kao ' and ' are interrupted, and the whole voice interaction experience is still influenced.
2) Semantic Understanding detection (NLU, Natural Language Understanding)
The logic interrupted by the method depends on natural semantic processing of a voice recognition result, the natural semantic understanding capacity needs to be additionally added in the implementation, and the logic interrupted can be triggered only when the intention of a current client is confirmed to be interrupted through the semantic understanding;
the method solves the problem of error interruption to a certain extent, but on one hand, from the aspect of system realization, a text result needing voice recognition calls a product with semantic understanding once again, and the response time of the whole product can cause larger time delay; on the other hand, the rules of natural semantic processing generally require reloading resources, and the logic for controlling whether the client's speech is interrupted or not under different scenes cannot be freely adjusted in usability.
In view of the foregoing problems, one of the core concepts of the embodiments of the present invention is to provide a voice interruption method, which obtains a recognition result by recognizing a user voice received during playing of a broadcast voice, and interrupts the broadcast voice being played by using a current playing time and the recognition result of the broadcast voice based on a preset determination rule for a preset parameter.
Referring to fig. 1, a flowchart illustrating steps of a first embodiment of a speech interruption method according to the present invention is shown, which may specifically include the following steps:
step 101, when user voice sent by a user is received in the process of playing broadcast voice, obtaining the current playing time of the broadcast voice;
when man-machine interaction is carried out in the scenes of intelligent outbound and intelligent navigation, in order to enable a client to perceptively experience similar person-to-person communication, the outbound robot needs to simulate the normal conversation scene of the person-to-person. Including remaining silent during the user's speaking and answering after the user has spoken the question. And in the broadcasting process, when the user has the sound production interruption condition, the broadcasting of the broadcasting voice is stopped in time.
In the embodiment of the invention, when the outbound robot receives the user voice sent by the user in the process of playing the broadcast voice, the current playing time length of the broadcast voice can be firstly obtained, and the current time length can be used for judging whether to immediately perform voice interruption on the broadcast voice.
Step 102, recognizing the user voice to obtain a recognition result;
after receiving the user voice, the user voice can be recognized to obtain a recognition result. For example, the user's voice may be converted into text information by an ASR (Automatic Speech Recognition), and the text information is analyzed to obtain information including word number, semantic meaning, and the like.
And 103, interrupting the broadcast voice which is being played by adopting the current playing time length and the recognition result based on a preset judgment rule aiming at a preset parameter.
In the embodiment of the present invention, the preset parameter is used for comparing the obtained recognition result with the current playing time, and specifically may include semantics, a word number threshold, and the like. The preset parameters and the preset judgment rules aiming at the preset parameters can be adjusted in real time according to the use needs of users so as to meet the interaction requirements in different scenes.
After the recognition result is obtained, the current playing time and the recognition result can be adopted to interrupt the playing broadcast voice based on the preset judgment rule aiming at the preset parameter.
In an example, the preset determination rule for the preset parameter may specifically include comparing the recognition result and the current playing time with the preset parameter respectively to obtain a comparison result, and then determining whether the comparison result meets the preset determination rule. If so, the broadcast voice which is being played can be interrupted.
The embodiment of the invention obtains the current playing time of the broadcast voice when receiving the user voice sent by the user in the process of playing the broadcast voice; recognizing the received user voice to obtain a recognition result; therefore, the broadcasting voice which is being played can be interrupted by adopting the current playing time length and the recognition result based on the preset judgment rule aiming at the preset parameter. In the embodiment of the invention, whether the broadcasting of the broadcast voice needs to be interrupted or not is judged by regularly detecting the recognition result, whether the broadcasting voice needs to be interrupted or not can be effectively determined based on the voice of the user, and meanwhile, the voice interruption conditions suitable for different scenes can be obtained by setting different preset parameters.
Referring to fig. 2, a flowchart illustrating steps of a second embodiment of a speech interruption method according to the present invention is shown, which may specifically include the following steps:
step 201, when user voice sent by a user is received in a broadcast voice playing process, obtaining the current playing time of the broadcast voice;
in the embodiment of the invention, when the outbound robot receives the user voice sent by the user in the process of playing the broadcast voice, the current playing time length of the broadcast voice can be firstly obtained, and the current time length can be used for judging whether to immediately perform voice interruption on the broadcast voice.
Step 202, recognizing the user voice to obtain a recognition result;
further, after receiving the user voice, the user voice can be recognized to obtain a recognition result. For example, the user's Voice can be monitored through IVR (Interactive Voice Response). Then, the user voice is converted into text information through an ASR (Automatic Speech Recognition) technology, and the text information is analyzed to obtain information including word number, semantic meaning and the like.
Step 203, generating an interruption identifier according to the recognition result and the preset judgment rule aiming at the preset parameter;
in the embodiment of the present invention, the preset parameters may include: the word breaking method comprises the following steps of a first preset semantic, a second preset semantic, a first preset word number threshold, a second preset word number threshold and an allowable breaking time length.
In an example, the preset determination rule for the preset parameter may specifically include a rule that compares the recognition result with the preset parameter to obtain a comparison result, and determines whether to interrupt the broadcast voice according to the comparison result.
In the embodiment of the present invention, the recognition result may include the number of words spoken by the user; the preset determination rule for the preset parameter may include: judging whether the user voice word number is greater than or equal to a rule of a first preset word number threshold value; the interruption flag may include a first interruption flag; thus, step 203 may comprise the following sub-steps:
s11, judging whether the user voice word number is larger than or equal to the first preset word number threshold value;
and S12, if yes, generating the first break identifier.
The first preset word number threshold value is a preset threshold value of the user voice word number which can directly interrupt the broadcast voice being played, and the numerical value can be set according to the personal use condition of the user.
After the voice of the user is identified and the word number of the voice of the user is determined, whether the word number of the voice of the user is larger than or equal to a first preset word number threshold value or not is judged, and if yes, the word number of the voice of the user is proved to meet the requirement of directly interrupting the broadcast voice. At this time, a first break flag is generated.
In the embodiment of the invention, the recognition result can also comprise user voice semantics; the preset judgment rule for the preset parameter may further include: judging whether the user voice semantics are matched with a first preset semantics; the interruption identifier further comprises a second interruption identifier; thus, step 203 may further comprise the sub-steps of:
s13, when the user voice word number is smaller than the first preset word number threshold, matching the user voice semantic in the first preset semantic;
and S14, when the matching is successful, generating the second interrupt identifier.
The first preset semantic meaning is preset for the user, and when the first preset semantic meaning is detected from the voice of the user, a keyword for interrupting the identification can be generated.
In one example, semantic recognition can be performed on the received user voice to obtain the semantics of the user voice, and matching is performed according to the semantics and the preset semantics, so that whether the interrupt identifier is generated or not is judged according to the matching result. For example, a white list may be preset for storing a plurality of preset semantics. And after the voice semantics of the user are obtained through recognition, matching the voice semantics of the user in the white list, and generating a second interruption identifier for interrupting the broadcast voice when the matching is successful.
In the embodiment of the present invention, the preset determination rule for the preset parameter may further include: judging whether the voice semantics of the user is greater than or equal to a second preset word number threshold value and whether the voice semantics of the user are not matched with the second preset semantics; thus, step 203 may further comprise the sub-steps of:
s15, when the user voice semantic is not matched in the first preset semantic, judging whether the user voice word number is larger than or equal to the second preset word number threshold value;
s16, if yes, matching the user voice semantic in the second preset semantic;
and S17, when the matching fails, generating the third interruption identifier.
The second preset word number threshold value is preset for a user, the word number critical value of whether the voice word number of the user can interrupt the broadcasting voice or not can be set according to the personal use habits and different use scenes of the user.
And the second preset semantic meaning is a keyword which is preset by the user and does not generate an interruption mark when the keyword is detected.
In one example, when the user speech semantic is not matched in the first preset semantic, it cannot be determined whether to generate the interrupt identifier, and at this time, it may be detected whether the user speech word number is greater than or equal to a second preset word number threshold. If yes, the user voice semantics can be matched in the second preset semantics, and when the matching fails, a third interruption identifier can be generated.
For example, the second word count threshold may be a blacklist validation threshold and the second predetermined voice semantic may be a blacklist. When the voice semantics of the user cannot be detected in the white list, judging whether the number of the voice words of the user is greater than or equal to the effective threshold of the black list; if yes, matching the user voice semantics in the blacklist, and if the matching is successful, not generating an interrupt identifier; and if the matching fails, generating a third interruption identifier for interrupting the broadcast voice.
Step 204, determining an interruption moment according to the current playing time and the preset judgment rule aiming at the preset parameter;
in the embodiment of the present invention, it may be configured that interruption is not allowed for a period of time during which playback of the broadcast voice is started. When broadcasting of broadcast voice is carried out, the current broadcasting time length is counted in real time, and the interruption time is determined according to the current broadcasting time length and a preset judgment rule aiming at preset parameters.
In the embodiment of the present invention, the preset parameter may further include an allowable interruption duration; the preset judgment rule for the preset parameter may further include: judging whether the current playing time length is greater than or equal to a rule of a preset allowable interruption time length; the break time may comprise a first break time; thus, step 204 may include the following sub-steps:
s21, judging whether the current playing time length is larger than or equal to the preset interruption time length;
s22, if yes, determining the mark generation time for generating the interrupt mark;
s23, determining the mark generation time as the first disconnection time.
When the current playing time length of the broadcast voice is acquired, the current playing time length can be compared with a preset allowable interruption time length. If the current playing time length is greater than or equal to the preset allowable interruption time length and the interruption identifier is generated at the moment, the broadcast voice can be immediately interrupted according to the interruption identifier, namely the identifier generation time for generating the interruption identifier can be determined as a first interruption time so as to interrupt the broadcast voice at the first interruption time.
In the embodiment of the present invention, the break time may further include a second break time; thus, step 204 may also include the following sub-steps:
and S24, when the current playing time length is less than the interruption-allowed time length, determining the time when the broadcasting time length of the broadcasting voice is equal to the interruption-allowed time length as the second interruption time.
In addition, if the current playing time length is less than the allowed interruption time length, even if the ASR has an identification result and judges that interruption is needed at the moment, the interruption mark is not returned temporarily for interruption. But the broadcasting voice is interrupted when the current playing time reaches the allowable interruption time.
In one example, two or more recognition results may occur during a time when the current playback time period is less than the allowable interruption time period, and at this time, the recognition result of the interruption flag may be returned in response to only the recognition result of the first generated interruption flag.
And step 205, interrupting the broadcast voice being played by adopting the interruption time and the interruption identifier.
In the embodiment of the invention, after the interruption time and the interruption identifier are obtained, the interruption identifier can be adopted to interrupt the broadcasting voice which is being played at the interruption time.
Fig. 3 is a flow chart of an embodiment of a speech interruption method of the present invention. In one example, in order to adapt to the occurrence of various situations in an actual scene, the following parameters may be set to control the specific logic of interrupting the broadcast voice:
1. white list (first preset semantics): when the recognition result detects that the recognition result is matched with the white list, sending an interruption event;
2. black list (second preset semantics): when the recognition result detects that the recognition result is matched with the blacklist, an interruption event does not need to be sent;
3. allowed break duration: the interruption operation can be configured after the broadcast voice is played for a period of time;
4. blacklist validation threshold (second preset word number threshold): when the number of the identification result words exceeds the effective threshold of the blacklist, the blacklist is detected again;
5. first preset word number threshold: and when the number of the identification result words exceeds a first preset word number threshold, the black-white list is not detected any more, and the interruption is directly carried out.
The parameters can be transmitted to the voice recognition capability platform in a grammar file mode by the IVR system in each voice interaction process, and the voice recognition capability platform judges whether to return related fields to the IVR system to interrupt the broadcasting of voice synthesis after receiving voice.
The logic principle and priority order for judging whether to interrupt are as follows:
1. under the condition that the current playing time length is less than the allowable interruption time length, interruption is not carried out in any way;
2. when the recognition result is greater than or equal to the first preset word number threshold, sending an interruption identifier and interrupting, and returning the recognition result after the recognition is finished;
3. when the identification result is less than the first preset word number threshold value:
1) if the white list is detected, interrupting, and returning an identification result after the identification is finished;
2) if the identification result is smaller than the effective threshold value of the blacklist, the white list is not detected, the identification is finished, and the interruption is not performed;
the recognition result is greater than or equal to the effective threshold value of the blacklist, and the blacklist is not interrupted when being detected;
and the recognition result is greater than or equal to the effective threshold value of the blacklist, the blacklist is not detected, the recognition is finished, the interrupt identifier is sent and interrupted, and the recognition result is returned after the recognition is finished.
Various scene judgment results and reasons of interruption or interruption of different voice inputs of the user in specific use scenes are described below, wherein the scene judgment results and the reasons are finally realized under the following parameter settings:
when the parameters for interrupting the voice interaction are configured as follows:
1. "allowed break duration" is set to 1 s;
"first preset word count threshold" is set to 5 words, utf-8 encodes the next 15 bytes;
"white list" is set to "I am; is I; i'm is; you wait for one drop);
setting a blacklist as follows; hello; you say that; please say,;
the "blacklist validation threshold" is set to 2 words and 6 bytes under utf-8 encoding.
Figure BDA0002429614300000131
The embodiment of the invention obtains the current playing time of the broadcast voice when receiving the user voice sent by the user in the process of playing the broadcast voice; recognizing the received user voice to obtain a recognition result; therefore, the broadcasting voice which is being played can be interrupted by adopting the current playing time length and the recognition result based on the preset judgment rule aiming at the preset parameter. In the embodiment of the invention, whether the broadcasting of the broadcast voice needs to be interrupted or not is judged by regularly detecting the recognition result, whether the broadcasting voice needs to be interrupted or not can be effectively determined based on the voice of the user, and meanwhile, the voice interruption conditions suitable for different scenes can be obtained by setting different preset parameters.
It should be noted that, for simplicity of description, the method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the illustrated order of acts, as some steps may occur in other orders or concurrently in accordance with the embodiments of the present invention. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred and that no particular act is required to implement the invention.
Referring to fig. 4, a block diagram of a structure of an embodiment of a speech interruption apparatus of the present invention is shown, which may specifically include the following modules:
a current play time obtaining module 401, configured to obtain a current play time of a broadcast voice when a user voice sent by a user is received in a process of playing the broadcast voice;
an identification module 402, configured to identify the user voice to obtain an identification result;
the interrupting module 403 is configured to interrupt the broadcast voice being played by using the current playing time and the recognition result based on a preset judgment rule for a preset parameter.
In this embodiment of the present invention, the interrupting module 403 may include:
the interruption identifier generation submodule is used for generating an interruption identifier according to the recognition result and the preset judgment rule aiming at the preset parameter;
the interruption time determining submodule is used for determining interruption time according to the current playing time and the preset judgment rule aiming at the preset parameter;
and the interruption submodule is used for interrupting the broadcast voice which is being played by adopting the interruption time and the interruption identifier.
In the embodiment of the invention, the recognition result comprises the number of words of the user voice; the preset judgment rule aiming at the preset parameter comprises a rule for judging whether the word number of the voice of the user is greater than or equal to a first preset word number threshold value; the interruption identifier comprises a first interruption identifier; the interruption identifier generation submodule may include:
the first preset word number threshold judging unit is used for judging whether the user voice word number is larger than or equal to the first preset word number threshold;
a first break identifier generating unit, configured to generate the first break identifier.
In the embodiment of the invention, the recognition result further comprises user voice semantics; the preset judgment rule for the preset parameter further comprises: judging whether the user voice semantics are matched with the first preset semantics, wherein the interruption identifier also comprises a second interruption identifier; the interruption identifier generation sub-module may further include:
the first preset semantic matching unit is used for matching the user voice semantic in the first preset semantic when the user voice word number is smaller than the first preset word number threshold;
and the second interruption identifier generation unit is used for generating the second interruption identifier when the matching is successful.
In this embodiment of the present invention, the preset determination rule for the preset parameter further includes: judging whether the voice semantics of the user is greater than or equal to a second preset word number threshold value and whether the voice semantics of the user are not matched with the second preset semantics; the interruption identifier further comprises a third interruption identifier: the interruption identifier generation sub-module may further include:
a second preset word number threshold judgment sub-module, configured to, when the user voice semantic is not matched in the first preset semantic, judge whether the user voice word number is greater than or equal to the second preset word number threshold;
the second preset semantic matching unit is used for matching the user voice semantics in the second preset semantics;
and the third interruption identification unit is used for generating the third interruption identification when the matching fails.
In the embodiment of the present invention, the preset parameters further include an allowable interruption duration; the preset judgment rule for the preset parameter further comprises: judging whether the current playing time length is greater than or equal to a rule of a preset allowable interruption time length; the break time comprises a first break time; the interruption time determination submodule may include:
the judging unit is used for judging whether the current playing time length is greater than the preset allowable interruption time length or not;
an identifier generation time determination unit for determining an identifier generation time at which the interrupt identifier is generated;
a first break time determination unit configured to determine the identifier generation time as the first break time.
In the embodiment of the present invention, the break time further includes a second break time; the interruption time determining submodule may further include:
and the second interruption time determining unit is used for determining the time when the broadcasting time length of the broadcasting voice is equal to the interruption-allowed time length as the second interruption time when the current playing time length is less than the interruption-allowed time length.
For the device embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, refer to the partial description of the method embodiment.
An embodiment of the present invention further provides an apparatus, including:
the method comprises a processor, a memory and a computer program which is stored on the memory and can run on the processor, wherein when the computer program is executed by the processor, each process of the voice interruption method embodiment is realized, the same technical effect can be achieved, and in order to avoid repetition, the details are not repeated.
The embodiment of the present invention further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when being executed by a processor, the computer program implements each process of the embodiment of the speech interruption method, and can achieve the same technical effect, and in order to avoid repetition, the details are not repeated here.
The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.
As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, apparatus, or computer program product. Accordingly, embodiments of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
Embodiments of the present invention are described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing terminal to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing terminal to cause a series of operational steps to be performed on the computer or other programmable terminal to produce a computer implemented process such that the instructions which execute on the computer or other programmable terminal provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While preferred embodiments of the present invention have been described, additional variations and modifications of these embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the embodiments of the invention.
Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or terminal that comprises the element.
The speech interruption method and the speech interruption device provided by the present invention are described in detail above, and the principle and the implementation of the present invention are explained in the present document by applying specific examples, and the description of the above embodiments is only used to help understanding the method and the core idea of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims (10)

1. A method of speech interruption, comprising:
when user voice sent by a user is received in the process of broadcasting broadcast voice, acquiring the current broadcasting time of the broadcast voice;
recognizing the user voice to obtain a recognition result;
and based on a preset judgment rule aiming at a preset parameter, interrupting the broadcast voice which is being played by adopting the current playing time and the recognition result.
2. The method according to claim 1, wherein the step of interrupting the broadcasted broadcast voice being broadcasted using the current broadcast time length and the recognition result based on a preset judgment rule for a preset parameter comprises:
generating an interruption identifier according to the recognition result and the preset judgment rule aiming at the preset parameter;
determining an interruption moment according to the current playing time and the preset judgment rule aiming at the preset parameter;
and interrupting the broadcast voice which is being played by adopting the interruption time and the interruption identifier.
3. The method of claim 2, wherein the recognition result comprises a number of words spoken by the user; the preset judgment rule aiming at the preset parameters comprises the following steps: judging whether the user voice word number is greater than or equal to a rule of a first preset word number threshold value; the interruption identifier comprises a first interruption identifier; the step of generating an interruption identifier according to the recognition result and the preset judgment rule of the preset modification parameter includes:
judging whether the user voice word number is larger than or equal to the first preset word number threshold value or not;
and if so, generating the first disconnection identifier.
4. The method of claim 3, wherein the recognition result further comprises user speech semantics; the preset judgment rule for the preset parameter further comprises: judging whether the user voice semantics are matched with a first preset semantics; the interruption identifier further comprises a second interruption identifier; the method further comprises the following steps:
when the user voice word number is smaller than the first preset word number threshold value, matching the user voice semantics in the first preset semantics;
and when the matching is successful, generating the second interruption identifier.
5. The method according to claim 4, wherein the preset judgment rule for the preset parameter further comprises: judging whether the user voice semantics are larger than or equal to a second preset word number threshold value and whether the user voice semantics are not matched with the second preset semantics; the interruption identifier further comprises a third interruption identifier: the method further comprises the following steps:
when the user voice semantics are not matched in the first preset semantics, judging whether the user voice word number is greater than or equal to a second preset word number threshold value or not;
if so, matching the user voice semantics in the second preset semantics;
and when the matching fails, generating the third interruption identifier.
6. The method according to claim 3, 4 or 5, wherein the preset parameters further comprise an allowed break duration; the preset judgment rule for the preset parameter further comprises: judging whether the current playing time length is greater than or equal to a rule of a preset allowable interruption time length; the break time comprises a first break time; the step of determining the interruption time according to the current playing time and the preset judgment rule aiming at the preset parameter comprises the following steps:
judging whether the current playing time length is greater than or equal to the preset allowable interruption time length or not;
if so, determining the identifier generation time for generating the interrupt identifier;
and determining the identifier generation time as the first break-off time.
7. The method of claim 6, wherein the break time further comprises a second break time; the method further comprises the following steps:
and when the current playing time length is less than the interruption-allowed time length, determining the time when the broadcasting time length of the broadcasting voice is equal to the interruption-allowed time length as the second interruption time.
8. A speech interruption device, comprising:
the system comprises a current playing time length obtaining module, a current playing time length obtaining module and a broadcasting voice playing module, wherein the current playing time length obtaining module is used for obtaining the current playing time length of broadcasting voice when user voice sent by a user is received in the process of playing the broadcasting voice;
the recognition module is used for recognizing the user voice to obtain a recognition result;
and the interruption module is used for interrupting the broadcast voice which is being played by adopting the current playing time length and the recognition result based on a preset judgment rule aiming at a preset parameter.
9. An apparatus, comprising: processor, memory and a computer program stored on the memory and capable of running on the processor, which computer program, when executed by the processor, carries out the steps of the method of interrupting a speech according to any one of claims 1 to 7.
10. A computer-readable storage medium, characterized in that a computer program is stored on the computer-readable storage medium, which computer program, when being executed by a processor, carries out the steps of the method of interrupting a speech according to any one of claims 1-7.
CN202010232214.7A 2020-03-27 2020-03-27 Voice breaking method and device Active CN111540349B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010232214.7A CN111540349B (en) 2020-03-27 2020-03-27 Voice breaking method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010232214.7A CN111540349B (en) 2020-03-27 2020-03-27 Voice breaking method and device

Publications (2)

Publication Number Publication Date
CN111540349A true CN111540349A (en) 2020-08-14
CN111540349B CN111540349B (en) 2023-10-10

Family

ID=71974815

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010232214.7A Active CN111540349B (en) 2020-03-27 2020-03-27 Voice breaking method and device

Country Status (1)

Country Link
CN (1) CN111540349B (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112037799A (en) * 2020-11-04 2020-12-04 深圳追一科技有限公司 Voice interrupt processing method and device, computer equipment and storage medium
CN112185393A (en) * 2020-09-30 2021-01-05 深圳供电局有限公司 Voice recognition processing method for power supply intelligent client
CN112185392A (en) * 2020-09-30 2021-01-05 深圳供电局有限公司 Voice recognition processing system for power supply intelligent client
CN112700775A (en) * 2020-12-29 2021-04-23 维沃移动通信有限公司 Method and device for updating voice receiving period and electronic equipment
CN112714058A (en) * 2020-12-21 2021-04-27 浙江百应科技有限公司 Method, system and electronic equipment for instantly interrupting AI voice
CN113113013A (en) * 2021-04-15 2021-07-13 北京帝派智能科技有限公司 Intelligent voice interaction interruption processing method, device and system
CN113160817A (en) * 2021-04-22 2021-07-23 平安科技(深圳)有限公司 Voice interaction method and system based on intention recognition
CN113488024A (en) * 2021-05-31 2021-10-08 杭州摸象大数据科技有限公司 Semantic recognition-based telephone interruption recognition method and system
CN113779208A (en) * 2020-12-24 2021-12-10 北京汇钧科技有限公司 Method and device for man-machine conversation

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102374864A (en) * 2010-08-13 2012-03-14 国基电子(上海)有限公司 Voice navigation equipment and voice navigation method
CN105704554A (en) * 2016-01-22 2016-06-22 广州视睿电子科技有限公司 Audio playing method and device
US20170186425A1 (en) * 2015-12-23 2017-06-29 Rovi Guides, Inc. Systems and methods for conversations with devices about media using interruptions and changes of subjects
CN107342085A (en) * 2017-07-24 2017-11-10 深圳云知声信息技术有限公司 Method of speech processing and device
CN107369439A (en) * 2017-07-31 2017-11-21 北京捷通华声科技股份有限公司 A kind of voice awakening method and device
US20180261223A1 (en) * 2017-03-13 2018-09-13 Amazon Technologies, Inc. Dialog management and item fulfillment using voice assistant system
CN108831455A (en) * 2018-05-25 2018-11-16 四川斐讯全智信息技术有限公司 A kind of method and system of intelligent sound box streaming interaction
CN110427460A (en) * 2019-08-06 2019-11-08 北京百度网讯科技有限公司 Method and device for interactive information
CN110853638A (en) * 2019-10-23 2020-02-28 吴杰 Method and equipment for interrupting voice robot in real time in voice interaction process
CN110867197A (en) * 2019-10-23 2020-03-06 吴杰 Method and equipment for interrupting voice robot in real time in voice interaction process

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102374864A (en) * 2010-08-13 2012-03-14 国基电子(上海)有限公司 Voice navigation equipment and voice navigation method
US20170186425A1 (en) * 2015-12-23 2017-06-29 Rovi Guides, Inc. Systems and methods for conversations with devices about media using interruptions and changes of subjects
US20190237064A1 (en) * 2015-12-23 2019-08-01 Rovi Guides, Inc. Systems and methods for conversations with devices about media using interruptions and changes of subjects
CN105704554A (en) * 2016-01-22 2016-06-22 广州视睿电子科技有限公司 Audio playing method and device
US20180261223A1 (en) * 2017-03-13 2018-09-13 Amazon Technologies, Inc. Dialog management and item fulfillment using voice assistant system
CN107342085A (en) * 2017-07-24 2017-11-10 深圳云知声信息技术有限公司 Method of speech processing and device
CN107369439A (en) * 2017-07-31 2017-11-21 北京捷通华声科技股份有限公司 A kind of voice awakening method and device
CN108831455A (en) * 2018-05-25 2018-11-16 四川斐讯全智信息技术有限公司 A kind of method and system of intelligent sound box streaming interaction
CN110427460A (en) * 2019-08-06 2019-11-08 北京百度网讯科技有限公司 Method and device for interactive information
CN110853638A (en) * 2019-10-23 2020-02-28 吴杰 Method and equipment for interrupting voice robot in real time in voice interaction process
CN110867197A (en) * 2019-10-23 2020-03-06 吴杰 Method and equipment for interrupting voice robot in real time in voice interaction process

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
SU-HYUN JIN,ET AL.: "Interrupted speech perception:The effects of hearing sensitive and frequency resolution" *
李恒庭等: "SkyEye模拟器的音频输出模拟模块设计与实现" *

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112185393A (en) * 2020-09-30 2021-01-05 深圳供电局有限公司 Voice recognition processing method for power supply intelligent client
CN112185392A (en) * 2020-09-30 2021-01-05 深圳供电局有限公司 Voice recognition processing system for power supply intelligent client
CN112037799A (en) * 2020-11-04 2020-12-04 深圳追一科技有限公司 Voice interrupt processing method and device, computer equipment and storage medium
CN112037799B (en) * 2020-11-04 2021-04-06 深圳追一科技有限公司 Voice interrupt processing method and device, computer equipment and storage medium
CN112714058A (en) * 2020-12-21 2021-04-27 浙江百应科技有限公司 Method, system and electronic equipment for instantly interrupting AI voice
CN113779208A (en) * 2020-12-24 2021-12-10 北京汇钧科技有限公司 Method and device for man-machine conversation
CN112700775A (en) * 2020-12-29 2021-04-23 维沃移动通信有限公司 Method and device for updating voice receiving period and electronic equipment
CN113113013A (en) * 2021-04-15 2021-07-13 北京帝派智能科技有限公司 Intelligent voice interaction interruption processing method, device and system
CN113113013B (en) * 2021-04-15 2022-03-18 北京帝派智能科技有限公司 Intelligent voice interaction interruption processing method, device and system
CN113160817A (en) * 2021-04-22 2021-07-23 平安科技(深圳)有限公司 Voice interaction method and system based on intention recognition
CN113488024A (en) * 2021-05-31 2021-10-08 杭州摸象大数据科技有限公司 Semantic recognition-based telephone interruption recognition method and system

Also Published As

Publication number Publication date
CN111540349B (en) 2023-10-10

Similar Documents

Publication Publication Date Title
CN111540349B (en) Voice breaking method and device
US7069221B2 (en) Non-target barge-in detection
CN108962233B (en) Voice conversation processing method and system for voice conversation platform
US10074371B1 (en) Voice control of remote device by disabling wakeword detection
US9734845B1 (en) Mitigating effects of electronic audio sources in expression detection
US11551685B2 (en) Device-directed utterance detection
JP4838351B2 (en) Keyword extractor
JP3363630B2 (en) Voice recognition method
JP5381988B2 (en) Dialogue speech recognition system, dialogue speech recognition method, and dialogue speech recognition program
CN110661927A (en) Voice interaction method and device, computer equipment and storage medium
US11687526B1 (en) Identifying user content
CN110557451A (en) Dialogue interaction processing method and device, electronic equipment and storage medium
CN110853638A (en) Method and equipment for interrupting voice robot in real time in voice interaction process
JP2014191029A (en) Voice recognition system and method for controlling voice recognition system
CN102282610A (en) Voice conversation device, conversation control method, and conversation control program
CN112581938B (en) Speech breakpoint detection method, device and equipment based on artificial intelligence
CN113779208A (en) Method and device for man-machine conversation
JP5387416B2 (en) Utterance division system, utterance division method, and utterance division program
CN112735398A (en) Man-machine conversation mode switching method and system
CN114385800A (en) Voice conversation method and device
CN111739506A (en) Response method, terminal and storage medium
JP4491438B2 (en) Voice dialogue apparatus, voice dialogue method, and program
CN114328867A (en) Intelligent interruption method and device in man-machine conversation
CN110660393B (en) Voice interaction method, device, equipment and storage medium
CN112700767B (en) Man-machine conversation interruption method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant