CN111540349B - Voice breaking method and device - Google Patents
Voice breaking method and device Download PDFInfo
- Publication number
- CN111540349B CN111540349B CN202010232214.7A CN202010232214A CN111540349B CN 111540349 B CN111540349 B CN 111540349B CN 202010232214 A CN202010232214 A CN 202010232214A CN 111540349 B CN111540349 B CN 111540349B
- Authority
- CN
- China
- Prior art keywords
- preset
- voice
- breaking
- user
- semantics
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 71
- 230000008569 process Effects 0.000 claims abstract description 24
- 238000004590 computer program Methods 0.000 claims description 15
- 230000003993 interaction Effects 0.000 abstract description 14
- 238000001514 detection method Methods 0.000 abstract description 6
- 238000010586 diagram Methods 0.000 description 8
- 238000005516 engineering process Methods 0.000 description 6
- 230000006870 function Effects 0.000 description 4
- 230000004044 response Effects 0.000 description 4
- 238000010200 validation analysis Methods 0.000 description 4
- 230000009471 action Effects 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 230000004048 modification Effects 0.000 description 3
- 238000012986 modification Methods 0.000 description 3
- 238000004891 communication Methods 0.000 description 2
- 230000001960 triggered effect Effects 0.000 description 2
- 230000004075 alteration Effects 0.000 description 1
- 230000015572 biosynthetic process Effects 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 230000002452 interceptive effect Effects 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000000750 progressive effect Effects 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 238000003786 synthesis reaction Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
- G10L15/222—Barge in, i.e. overridable guidance for interrupting prompts
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)
Abstract
The embodiment of the invention provides a method and a device for interrupting voice, comprising the following steps: when receiving user voice sent by a user in the process of playing the broadcast voice, acquiring the current playing time length of the broadcast voice; recognizing the user voice to obtain a recognition result; and based on a preset judgment rule aiming at preset parameters, interrupting the broadcasting voice which is being played by adopting the current playing time length and the identification result. In the embodiment of the invention, whether the broadcasting voice needs to be interrupted or not is judged by carrying out rule detection on the identification result, whether the broadcasting voice needs to be interrupted or not can be effectively determined based on the user voice, and meanwhile, the interaction requirements under different scenes can be met by adjusting different preset parameters.
Description
Technical Field
The invention relates to the technical field of voice processing, in particular to a voice breaking method and a voice breaking device.
Background
When man-machine interaction is performed in the scenes of intelligent outbound and intelligent navigation, in order to enable clients to perceptively experience similar person-to-person communication, the outbound robot needs to imitate normal conversation scenes of people and people, can keep silent in the speaking process of the clients, answer after the clients speak questions, and timely stop broadcasting if the clients have interrupt in the broadcasting process.
Because in the current voice interrupt interaction flow, logic judgment for judging when TTS (text to speech technology) broadcasting is stopped is difficult to control, for example, judging whether to interrupt or not is needed from the perspective of speech recognition strongly depends on recognition judgment of a recognition engine on noise and short voice, and a situation of false interrupt or no interrupt can be caused; and the judgment is carried out in a natural language processing mode, so that larger delay is caused on the response speed of the whole interaction.
Disclosure of Invention
In view of the foregoing, embodiments of the present invention are directed to providing a method for interrupting speech and a corresponding apparatus for interrupting speech that overcome or at least partially solve the foregoing problems.
In order to solve the above problems, an embodiment of the present invention discloses a method for interrupting speech, including:
when receiving user voice sent by a user in the process of playing the broadcast voice, acquiring the current playing time length of the broadcast voice;
recognizing the user voice to obtain a recognition result;
and based on a preset judgment rule aiming at preset parameters, interrupting the broadcasting voice which is being played by adopting the current playing time length and the identification result.
Optionally, the step of interrupting the broadcasting voice being played by adopting the current playing duration and the recognition result based on a preset judgment rule for a preset parameter includes:
generating a breaking mark according to the identification result and the preset judgment rule aiming at preset parameters;
determining a breaking moment according to the current playing time length and the preset judging rule aiming at preset parameters;
and interrupting the broadcasting voice which is being played by adopting the interruption moment and the interruption mark.
Optionally, the recognition result includes a number of words of the user's voice; the preset judging rule for the preset parameters comprises the following steps: a rule for judging whether the number of the user voice words is larger than or equal to a first preset word number threshold value; the breaking mark comprises a first breaking mark; the step of generating a breaking mark according to the identification result and the preset judgment rule of the preset modification parameter comprises the following steps:
judging whether the number of the user voice words is larger than or equal to the first preset word number threshold;
if yes, the first breaking identification is generated.
Optionally, the recognition result further includes user speech semantics; the preset judging rule for the preset parameter further comprises: a rule for judging whether the voice semantics of the user are matched with a first preset semantics; the break mark also comprises a second break mark; the method further comprises the steps of:
when the number of the user voice words is smaller than the first preset word number threshold value, matching the user voice semantics in the first preset semantics;
and when the matching is successful, generating the second breaking mark.
Optionally, the preset judging rule for the preset parameter further includes: judging whether the voice semantics of the user are larger than or equal to a second preset word number threshold value or not and whether the voice semantics of the user are not matched with the second preset semantics or not; the break indicator further includes a third break indicator: the method further comprises the following steps:
when the user voice semantics are not matched in the first preset semantics, judging whether the user voice word number is larger than or equal to the second preset word number threshold;
if yes, matching the user voice semantics in the second preset semantics;
and when the matching fails, generating the third interrupt identifier.
Optionally, the preset parameters further include a permissible interruption time length; the preset judging rule for the preset parameter further comprises: judging whether the current playing time length is greater than or equal to a rule of a preset allowable interrupt time length; the breaking time comprises a first breaking time; the step of determining the interruption time according to the current playing time and the preset judging rule aiming at the preset parameter comprises the following steps:
judging whether the current playing time length is greater than or equal to the preset allowable interrupt time length;
if yes, determining an identifier generation time for generating the interrupt identifier;
and determining the mark generation time as the first breaking time.
Optionally, the breaking moment further comprises a second breaking moment; the method further comprises the following steps:
and when the current playing time length is smaller than the allowed breaking time length, determining the time when the broadcasting time length of the broadcasting voice is equal to the allowed breaking time length as the second breaking time.
The embodiment of the invention also discloses a device for interrupting the voice, which comprises the following steps:
the current playing time length acquisition module is used for acquiring the current playing time length of the broadcast voice when receiving the user voice sent by the user in the process of playing the broadcast voice;
the recognition module is used for recognizing the user voice to obtain a recognition result;
and the breaking module is used for breaking the broadcasting voice which is being played by adopting the current playing time length and the identification result based on a preset judging rule aiming at preset parameters.
Optionally, the breaking module includes:
the breaking mark generation sub-module is used for generating a breaking mark according to the identification result and the preset judgment rule aiming at the preset parameter;
a breaking moment determining sub-module, configured to determine a breaking moment according to the current playing duration and the preset judging rule for the preset parameter;
and the breaking submodule is used for breaking the broadcasting voice which is being played by adopting the breaking moment and the breaking mark.
Optionally, the recognition result includes a number of words of the user's voice; the preset judging rules aiming at the preset parameters comprise rules for judging whether the number of the voice words of the user is larger than or equal to a first preset word number threshold value; the breaking mark comprises a first breaking mark; the interrupt identifier generation sub-module includes:
a first preset word number threshold value judging unit, configured to judge whether the number of words of the user voice is greater than or equal to the first preset word number threshold value;
and the first breaking identifier generating unit is used for generating the first breaking identifier.
Optionally, the recognition result further includes user speech semantics; the preset judging rule for the preset parameter further comprises: judging whether the voice semantics of the user are matched with the rules of the first preset semantics or not, wherein the breaking identification further comprises a second breaking identification; the interrupt identifier generation sub-module further includes:
the first preset semantic matching unit is used for matching the user voice semantics in the first preset semantics when the number of the user voice words is smaller than the first preset word number threshold;
and the second breaking identifier generating unit is used for generating the second breaking identifier when the matching is successful.
Optionally, the preset judging rule for the preset parameter further includes: judging whether the voice semantics of the user are larger than or equal to a second preset word number threshold value or not and whether the voice semantics of the user are not matched with the second preset semantics or not; the break indicator further includes a third break indicator: the interrupt identifier generation sub-module further includes:
a second preset word number threshold judging sub-module, configured to judge whether the number of words of the user voice is greater than or equal to the second preset word number threshold when the user voice semantics are not matched in the first preset semantics;
the second preset semantic matching unit is used for matching the user voice semantics in the second preset semantics;
and the third breaking identification unit is used for generating the third breaking identification when the matching fails.
Optionally, the preset parameters further include a permissible interruption time length; the preset judging rule for the preset parameter further comprises: judging whether the current playing time length is greater than or equal to a rule of a preset allowable interrupt time length; the breaking time comprises a first breaking time; the breaking moment determining submodule comprises:
the judging unit is used for judging whether the current playing time length is greater than the preset allowable interrupt time length or not;
the mark generation time determining unit is used for determining mark generation time for generating the interrupt mark;
and the first breaking moment determining unit is used for determining the identification generation moment as the first breaking moment.
Optionally, the breaking moment further comprises a second breaking moment; the breaking moment determining submodule further comprises:
and the second breaking moment determining unit is used for determining the moment when the broadcasting time of the broadcasting voice is equal to the allowable breaking time when the current playing time is smaller than the allowable breaking time as the second breaking moment.
The embodiment of the invention also discloses a device, which comprises: a processor, a memory and a computer program stored on the memory and capable of running on the processor, which when executed by the processor performs the steps of the speech disruption method according to any one of the preceding claims.
The embodiment of the invention also discloses a computer readable storage medium, wherein the computer readable storage medium stores a computer program, and the computer program realizes the steps of the voice breaking method according to any one of the above steps when being executed by a processor.
The embodiment of the invention has the following advantages: according to the embodiment of the invention, when the user voice sent by the user is received in the process of playing the broadcast voice, the current playing time length of the broadcast voice is obtained; and recognizing the received user voice to obtain a recognition result; therefore, the broadcasting voice which is being played can be interrupted by adopting the current playing time length and the identification result based on the preset judging rule aiming at the preset parameter. In the embodiment of the invention, whether the broadcasting voice needs to be interrupted or not is judged by carrying out rule detection on the identification result, whether the broadcasting voice needs to be interrupted or not can be effectively determined based on the user voice, and meanwhile, the interaction requirements under different scenes can be met by adjusting different preset parameters.
Drawings
FIG. 1 is a flowchart illustrating steps of a first embodiment of a speech disruption method according to the present invention;
FIG. 2 is a flow chart of steps of a second embodiment of a speech breaking method of the present invention;
FIG. 3 is a flow chart of an embodiment of a method of interrupting speech according to the present invention;
fig. 4 is a block diagram of an embodiment of a speech breaking device of the present invention.
Detailed Description
In order that the above-recited objects, features and advantages of the present invention will become more readily apparent, a more particular description of the invention will be rendered by reference to the appended drawings and appended detailed description.
In the intelligent outbound and navigation solution, the intelligent semantic interaction technology relates to logic that needs to interrupt TTS broadcasting when a user speaks so as to give a more intelligent and humanized experience to the called user. The current logic of voice interaction interruption is mainly realized by the following two modes:
1) Pure speech detection
The logic of breaking the method is strongly dependent on the judgment of the voice recognition engine on the voice, and the logic of breaking TTS broadcasting is triggered when the voice on the user side is detected, so that the method is quite common in the current voice interaction products, but the implementation mode has the following unavoidable error breaking conditions:
because the voice environment of the client side is mostly uncertain in the telephone interaction process, sometimes on a noisy road or in an environment with more people flow, noise and noisiness are very likely to occur in the sound received from the telephone at the moment, and the voice recognition technology cannot perfectly filter the sound of the surrounding environment at present, so that the system can be mistakenly recognized and mistakenly broken; in addition, even if the technology has a better effect in the aspect of noise reduction, meaningless answers such as the words of the user's voice and the like can be interrupted, and the overall voice interaction experience can still be influenced.
2) Semantic understanding detection (NLU, natural Language Understanding)
The logic of breaking the method depends on natural semantic processing of a voice recognition result, the capability of additionally increasing natural semantic understanding is required to be realized, and the breaking logic is triggered only when the intention of a current customer is confirmed to be broken through semantic understanding;
the method solves the problem of mistaken interruption to a certain extent, but on one hand, the text result of the voice recognition is called once again to understand the product, and the response time of the whole product can cause larger delay; on the other hand, the rules of natural semantic processing generally need to reload resources, and the logic for controlling whether the client session is interrupted in different scenes cannot be freely adjusted in terms of usability.
In view of the above problems, one of the core ideas of the embodiments of the present invention is to provide a voice breaking method, which is to obtain a recognition result by recognizing a user voice received during the playing of a broadcast voice, and break the playing broadcast voice by using the current playing time length of the broadcast voice and the recognition result based on a preset judgment rule for a preset parameter.
Referring to fig. 1, a flowchart illustrating steps of a first embodiment of a voice interrupt method according to the present invention may specifically include the following steps:
step 101, when receiving user voice sent by a user in the process of playing broadcast voice, acquiring the current playing time length of the broadcast voice;
when man-machine interaction is performed in the intelligent outbound and intelligent navigation scenes, in order to enable clients to perceptively experience similar person-to-person communication, the outbound robot needs to simulate a normal conversation scene of people. Including keeping silent during the user speaking and answering after the user has finished speaking the question. And in the broadcasting process, when the user has the condition of sounding interruption, the broadcasting of the broadcasting voice is stopped in time.
In the embodiment of the invention, when the outbound robot receives the user voice sent by the user in the process of playing the broadcast voice, the current playing time length of the broadcast voice can be obtained first, and the current time length can be used for judging whether to immediately perform voice interruption on the broadcast voice.
102, recognizing the user voice to obtain a recognition result;
after receiving the user voice, the user voice can be recognized, and a recognition result is obtained. For example, the user's speech may be converted into text information by ASR (Automatic Speech Recognition, automatic speech recognition technology), and the text information may be analyzed to obtain information including word count, semantics, and the like.
And step 103, interrupting the broadcasting voice being played by adopting the current playing time length and the identification result based on a preset judging rule aiming at preset parameters.
In the embodiment of the invention, the preset parameters are used for comparing the acquired identification result with the current playing time length, and specifically can comprise semantics, word number threshold values and the like. The preset parameters and preset judgment rules aiming at the preset parameters can be adjusted in real time according to the use needs of users so as to be suitable for interaction requirements in different scenes.
After the recognition result is obtained, the broadcasting voice being played can be interrupted by adopting the current playing duration and the recognition result based on a preset judgment rule aiming at a preset parameter.
In an example, the preset judging rule for the preset parameter may specifically include comparing the identification result and the current playing duration with the preset parameter to obtain a comparison result, and further judging whether the comparison result meets the preset judging rule. If yes, the broadcasting voice being played can be interrupted.
According to the embodiment of the invention, when the user voice sent by the user is received in the process of playing the broadcast voice, the current playing time length of the broadcast voice is obtained; and recognizing the received user voice to obtain a recognition result; therefore, the broadcasting voice which is being played can be interrupted by adopting the current playing time length and the identification result based on the preset judging rule aiming at the preset parameter. In the embodiment of the invention, whether the broadcasting voice needs to be interrupted or not is judged by carrying out rule detection on the identification result, whether the broadcasting voice needs to be interrupted or not can be effectively determined based on the user voice, and meanwhile, the voice interruption conditions suitable for different scenes can be obtained by setting different preset parameters.
Referring to fig. 2, a flowchart illustrating steps of a second embodiment of a voice interrupt method according to the present invention may specifically include the following steps:
step 201, when receiving user voice sent by a user in the process of playing broadcast voice, acquiring the current playing time length of the broadcast voice;
in the embodiment of the invention, when the outbound robot receives the user voice sent by the user in the process of playing the broadcast voice, the current playing time length of the broadcast voice can be obtained first, and the current time length can be used for judging whether to immediately perform voice interruption on the broadcast voice.
Step 202, recognizing the user voice to obtain a recognition result;
further, after receiving the user voice, the user voice can be recognized, and a recognition result is obtained. The user's voice may be monitored, for example, through an IVR (Interactive Voice Response ). And then the user voice is converted into text information through ASR (Automatic Speech Recognition, automatic voice recognition technology), and the text information is analyzed to obtain information including word number, semantics and the like.
Step 203, generating a breaking identifier according to the identification result and the preset judgment rule for the preset parameter;
in the embodiment of the present invention, the preset parameters may include: the method comprises the steps of first preset semantics, second preset semantics, a first preset word number threshold value, a second preset word number threshold value and a permissible interruption duration.
In an example, the preset judgment rule for the preset parameter may specifically include a rule for comparing the recognition result with the preset parameter to obtain a comparison result, and judging whether to interrupt the broadcast voice according to the comparison result.
In the embodiment of the invention, the recognition result can comprise the number of words of the user voice; the preset judgment rule for the preset parameter may include: a rule for judging whether the number of the voice words of the user is larger than or equal to a first preset word number threshold value; the break identifier may include a first break identifier; thus, step 203 may comprise the sub-steps of:
s11, judging whether the number of the user voice words is larger than or equal to the first preset word number threshold;
and S12, if yes, generating the first breaking identification.
The first preset word number threshold is a preset threshold value of the number of the user voice words capable of directly interrupting the broadcasting voice being played, and the value of the first preset word number threshold value can be set according to the personal use condition of the user.
After the number of words of the user voice is determined by recognizing the user voice, judging whether the number of words of the user voice is larger than or equal to a first preset word number threshold value, and if so, proving that the number of words of the user voice meets the requirement of directly interrupting the broadcast voice. At this point, a first break indicator may be generated.
In the embodiment of the invention, the recognition result can also comprise user voice semantics; the preset judgment rule for the preset parameter may further include: a rule for judging whether the voice semantics of the user are matched with the first preset semantics; the break mark also comprises a second break mark; thus, step 203 may further comprise the sub-steps of:
s13, when the number of the user voice words is smaller than the first preset word number threshold, matching the user voice semantics in the first preset semantics;
and S14, when the matching is successful, generating the second breaking mark.
The first preset semantic is preset for a user, and keywords for interrupting the identification can be generated when the keywords are detected from the voice of the user.
In one example, the semantic recognition can be performed on the received user voice to obtain the semantic of the user voice, and matching is performed according to the semantic and preset semantic, so that whether to generate the breaking identifier is judged according to the matching result. For example, a white list may be preset for storing a plurality of preset semantics. After the user voice semantics are obtained through recognition, the user voice semantics are matched in the white list, and when the matching is successful, a second breaking identification for breaking the broadcasting voice can be generated.
In the embodiment of the present invention, the preset judgment rule for the preset parameter may further include: judging whether the voice semantics of the user are larger than or equal to a second preset word number threshold value or not and whether the voice semantics of the user are not matched with the second preset semantics or not; thus, step 203 may further comprise the sub-steps of:
s15, judging whether the number of the user voice words is larger than or equal to the second preset word number threshold value or not when the user voice semantics are not matched in the first preset semantics;
s16, if yes, matching the user voice semantics in the second preset semantics;
and S17, when the matching fails, generating the third interrupt identifier.
The second preset word number threshold value is a word number threshold value which is preset by a user, and the word number threshold value can be set according to personal use habits and different use scenes of the user, wherein the word number threshold value can be set by the user whether the voice word number of the user can interrupt the broadcast voice or not.
And the second preset semantic meaning is a keyword preset by a user and used for generating no breaking mark when the keyword is detected.
In one example, when the user speech semantics are not matched in the first preset semantics, it cannot be determined whether to generate the break flag, at which time it may be detected whether the number of user speech words is greater than or equal to a second preset word number threshold. If so, the user voice semantics can be matched in the second preset semantics, and when the matching fails, a third interrupt identifier can be generated.
For example, the second word count threshold may be a blacklist validation threshold and the second preset phonetic semantics may be a blacklist. When the voice semantics of the user cannot be detected in the white list, judging whether the word number of the voice of the user is larger than or equal to a blacklist effective threshold value; if yes, matching the voice semantics of the user in the blacklist, and if the matching is successful, not generating a breaking mark; if the matching fails, a third interrupt identifier for interrupting the broadcast voice is generated.
Step 204, determining a breaking moment according to the current playing time length and the preset judging rule aiming at the preset parameter;
in the embodiment of the invention, the device can be configured to not be interrupted in a period of time when broadcasting voice begins to be played. When broadcasting the broadcast voice, counting the current broadcasting time in real time so as to determine the breaking moment according to the current broadcasting time and a preset judging rule aiming at preset parameters.
In the embodiment of the invention, the preset parameters can also comprise a time length for allowing interruption; the preset judgment rule for the preset parameter may further include: judging whether the current playing time length is greater than or equal to a rule of a preset allowable interrupt time length; the break time may include a first break time; thus, step 204 may include the sub-steps of:
s21, judging whether the current playing time length is greater than or equal to the preset allowable interrupt time length;
s22, if yes, determining a mark generation time for generating the interrupt mark;
s23, determining the mark generation time as the first breaking time.
When the current playing time length of the broadcasting voice is obtained, the current playing time length can be compared with the preset allowable interrupt time length. If the current playing time is longer than or equal to the preset allowable breaking time and the breaking mark is generated at the moment, the broadcasting voice can be immediately broken according to the breaking mark, namely the mark generation time for generating the breaking mark can be determined as a first breaking time, and the broadcasting voice is broken at the first breaking time.
In the embodiment of the invention, the breaking time can further comprise a second breaking time; thus, step 204 may further comprise the sub-steps of:
and S24, when the current playing time length is smaller than the allowed breaking time length, determining the time when the broadcasting time length of the broadcasting voice is equal to the allowed breaking time length as the second breaking time.
In addition, if the current playing time length is smaller than the allowable interruption time length, the interruption mark is not returned for interruption temporarily even if the ASR has an identification result and the interruption is judged to be required at the moment. But only interrupts the broadcast voice when the current playing time reaches the allowable interrupt time.
In one example, two or more recognition results may occur during a time when the current play duration is less than the allowed break duration, at which time only the recognition result of the break flag may be generated in response to the first and returned.
And step 205, interrupting the broadcasting voice which is being played by adopting the interruption moment and the interruption mark.
In the embodiment of the invention, after the breaking moment and the breaking mark are obtained, the breaking mark can be adopted to break the broadcasting voice which is being played at the breaking moment.
Fig. 3 is a flow chart of an embodiment of a speech breaking method of the present invention. In one example, to adapt to the occurrence of various situations in an actual scenario, the following parameters may be set to control the specific logic that interrupts the broadcast voice:
1. white list (first preset semantics): when the identification result detects that the identification result is matched with the white list, a breaking event is sent;
2. blacklist (second preset semantics): when the identification result detects that the identification result is matched with the blacklist, a breaking event does not need to be sent;
3. the duration of the interruption is allowed: the interrupt operation can be configured to be performed after a period of time for broadcasting voice begins to be played;
4. blacklist validation threshold (second preset word count threshold): when the word number of the identification result exceeds the blacklist validation threshold value, starting to detect the blacklist;
5. a first preset word number threshold value: and when the number of words of the identification result exceeds a first preset word number threshold value, the black-and-white list is not detected any more, and the method is directly interrupted.
The parameters can be transmitted to the voice recognition capability platform in the voice interaction process by the IVR system in a grammar file mode, and the voice recognition capability platform judges whether to return relevant fields to the IVR system after receiving the voice so as to interrupt the broadcasting of voice synthesis.
The logic principle and priority order for judging whether to interrupt are as follows:
1. under the condition that the current playing time length is smaller than the allowable breaking time length, breaking is not carried out in any way;
2. when the identification result is greater than or equal to a first preset word number threshold value, a breaking mark is sent and broken, and the identification result is returned after the identification is completed;
3. when the recognition result is smaller than the first preset word number threshold value:
1) If the white list is detected, breaking is carried out, and an identification result is returned after identification is completed;
2) The identification result is smaller than a blacklist effective threshold value, a whitelist is not detected, and the identification is finished without interruption;
the identification result is larger than or equal to a blacklist effective threshold value, and the blacklist is not broken when the blacklist is detected;
the identification result is larger than or equal to a blacklist effective threshold value, the blacklist is not detected and the identification is finished, a breaking identification is sent and broken, and the identification result is returned after the identification is finished.
The following describes various scene judgment results and reasons of interruption or non-interruption finally realized by different voice inputs of a user under the following parameter settings in a specific use scene:
the parameters of the voice interaction interruption are configured as follows:
the "allowed break duration" is set to 1s;
setting a first preset word number threshold to 5 words, utf-8 encoding the next 15 bytes;
the "white list" is set to "i am; is me; i are; you wait for ";
the blacklist is arranged as; you are good; you say; a couple you talk ";
the blacklist validation threshold is set to 2 words and the utf-8 code is 6 bytes.
According to the embodiment of the invention, when the user voice sent by the user is received in the process of playing the broadcast voice, the current playing time length of the broadcast voice is obtained; and recognizing the received user voice to obtain a recognition result; therefore, the broadcasting voice which is being played can be interrupted by adopting the current playing time length and the identification result based on the preset judging rule aiming at the preset parameter. In the embodiment of the invention, whether the broadcasting voice needs to be interrupted or not is judged by carrying out rule detection on the identification result, whether the broadcasting voice needs to be interrupted or not can be effectively determined based on the user voice, and meanwhile, the voice interruption conditions suitable for different scenes can be obtained by setting different preset parameters.
It should be noted that, for simplicity of description, the method embodiments are shown as a series of acts, but it should be understood by those skilled in the art that the embodiments are not limited by the order of acts, as some steps may occur in other orders or concurrently in accordance with the embodiments. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred embodiments, and that the acts are not necessarily required by the embodiments of the invention.
Referring to fig. 4, a block diagram of an embodiment of a speech breaking device according to the present invention is shown, and may specifically include the following modules:
the current playing time length obtaining module 401 is configured to obtain a current playing time length of the broadcast voice when receiving a user voice sent by a user in a process of playing the broadcast voice;
the recognition module 402 is configured to recognize the user voice to obtain a recognition result;
and a breaking module 403, configured to break the broadcasting voice being played by adopting the current playing duration and the recognition result based on a preset judging rule for a preset parameter.
In an embodiment of the present invention, the breaking module 403 may include:
the breaking mark generation sub-module is used for generating a breaking mark according to the identification result and the preset judgment rule aiming at the preset parameter;
a breaking moment determining sub-module, configured to determine a breaking moment according to the current playing duration and the preset judging rule for the preset parameter;
and the breaking submodule is used for breaking the broadcasting voice which is being played by adopting the breaking moment and the breaking mark.
In the embodiment of the invention, the recognition result comprises the number of words of the user voice; the preset judging rules aiming at the preset parameters comprise rules for judging whether the number of the voice words of the user is larger than or equal to a first preset word number threshold value; the breaking mark comprises a first breaking mark; the interrupt identifier generation sub-module may include:
a first preset word number threshold value judging unit, configured to judge whether the number of words of the user voice is greater than or equal to the first preset word number threshold value;
and the first breaking identifier generating unit is used for generating the first breaking identifier.
In the embodiment of the invention, the recognition result also comprises user voice semantics; the preset judging rule for the preset parameter further comprises: judging whether the voice semantics of the user are matched with the rules of the first preset semantics or not, wherein the breaking identification further comprises a second breaking identification; the interrupt identifier generating sub-module may further include:
the first preset semantic matching unit is used for matching the user voice semantics in the first preset semantics when the number of the user voice words is smaller than the first preset word number threshold;
and the second breaking identifier generating unit is used for generating the second breaking identifier when the matching is successful.
In the embodiment of the present invention, the preset judgment rule for the preset parameter further includes: judging whether the voice semantics of the user are larger than or equal to a second preset word number threshold value or not and whether the voice semantics of the user are not matched with the second preset semantics or not; the break indicator further includes a third break indicator: the interrupt identifier generating sub-module may further include:
a second preset word number threshold judging sub-module, configured to judge whether the number of words of the user voice is greater than or equal to the second preset word number threshold when the user voice semantics are not matched in the first preset semantics;
the second preset semantic matching unit is used for matching the user voice semantics in the second preset semantics;
and the third breaking identification unit is used for generating the third breaking identification when the matching fails.
In the embodiment of the invention, the preset parameters further comprise allowable breaking time length; the preset judging rule for the preset parameter further comprises: judging whether the current playing time length is greater than or equal to a rule of a preset allowable interrupt time length; the breaking time comprises a first breaking time; the breaking moment determining submodule may include:
the judging unit is used for judging whether the current playing time length is greater than the preset allowable interrupt time length or not;
the mark generation time determining unit is used for determining mark generation time for generating the interrupt mark;
and the first breaking moment determining unit is used for determining the identification generation moment as the first breaking moment.
In the embodiment of the invention, the breaking time also comprises a second breaking time; the breaking moment determining submodule may further include:
and the second breaking moment determining unit is used for determining the moment when the broadcasting time of the broadcasting voice is equal to the allowable breaking time when the current playing time is smaller than the allowable breaking time as the second breaking moment.
For the device embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and reference is made to the description of the method embodiments for relevant points.
The embodiment of the invention also provides a device, which comprises:
the system comprises a processor, a memory and a computer program which is stored in the memory and can run on the processor, wherein the computer program realizes the processes of the voice breaking method embodiment when being executed by the processor, can achieve the same technical effects, and is not repeated here.
The embodiment of the invention also provides a computer readable storage medium, on which a computer program is stored, which when executed by a processor, realizes the processes of the above-mentioned voice breaking method embodiment, and can achieve the same technical effects, and in order to avoid repetition, the description is omitted here.
In this specification, each embodiment is described in a progressive manner, and each embodiment is mainly described by differences from other embodiments, and identical and similar parts between the embodiments are all enough to be referred to each other.
It will be apparent to those skilled in the art that embodiments of the present invention may be provided as a method, apparatus, or computer program product. Accordingly, embodiments of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the invention may take the form of a computer program product on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.
Embodiments of the present invention are described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal device to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal device, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiment and all such alterations and modifications as fall within the scope of the embodiments of the invention.
Finally, it is further noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or terminal device comprising the element.
The foregoing has outlined a speech breaking method and a speech breaking device according to the present invention, and specific examples have been applied to illustrate the principles and embodiments of the present invention, the above examples being only for aiding in the understanding of the method and core idea of the present invention; meanwhile, as those skilled in the art will have variations in the specific embodiments and application scope in accordance with the ideas of the present invention, the present description should not be construed as limiting the present invention in view of the above.
Claims (8)
1. A method of interrupting speech, comprising:
when receiving user voice sent by a user in the process of playing the broadcast voice, acquiring the current playing time length of the broadcast voice;
recognizing the user voice to obtain a recognition result;
based on a preset judgment rule aiming at preset parameters, interrupting the broadcasting voice which is being played by adopting the current playing time length and the identification result, and the method comprises the following steps: generating a breaking mark according to the identification result and the preset judgment rule aiming at preset parameters; determining a breaking moment according to the current playing time length and the preset judging rule aiming at preset parameters; interrupting the broadcasting voice which is being played by adopting the interruption moment and the interruption mark;
the preset judgment rule for the preset parameters comprises the following steps: judging whether the voice semantics of the user are larger than or equal to a second preset word number threshold value or not and whether the voice semantics of the user are not matched with the second preset semantics or not; the break identifiers comprise a third break identifier; the method further comprises the following steps:
when the user voice semantics are not matched in the first preset semantics, judging whether the user voice word number is larger than or equal to the second preset word number threshold; if yes, matching the user voice semantics in the second preset semantics; and when the matching fails, generating the third interrupt identifier.
2. The method of claim 1, wherein the recognition result comprises a number of words of a user's voice; the preset judging rule for the preset parameters comprises the following steps: a rule for judging whether the number of the user voice words is larger than or equal to a first preset word number threshold value; the breaking mark comprises a first breaking mark; the step of generating a breaking mark according to the identification result and a preset judgment rule of the preset parameter comprises the following steps:
judging whether the number of the user voice words is larger than or equal to the first preset word number threshold;
if yes, the first breaking identification is generated.
3. The method of claim 2, wherein the recognition result further comprises user speech semantics; the preset judging rule for the preset parameter further comprises: a rule for judging whether the voice semantics of the user are matched with a first preset semantics; the break mark also comprises a second break mark; the method further comprises the steps of:
when the number of the user voice words is smaller than the first preset word number threshold value, matching the user voice semantics in the first preset semantics;
and when the matching is successful, generating the second breaking mark.
4. A method according to claim 1 or 2 or 3, wherein the preset parameters further comprise a permissible interruption time period; the preset judging rule for the preset parameter further comprises: judging whether the current playing time length is greater than or equal to a rule of a preset allowable interrupt time length; the breaking time comprises a first breaking time; the step of determining the interruption time according to the current playing time and the preset judging rule aiming at the preset parameter comprises the following steps:
judging whether the current playing time length is greater than or equal to the preset allowable interrupt time length;
if yes, determining an identifier generation time for generating the interrupt identifier;
and determining the mark generation time as the first breaking time.
5. The method of claim 4, wherein the break-out time further comprises a second break-out time; the method further comprises the following steps:
and when the current playing time length is smaller than the allowed breaking time length, determining the time when the broadcasting time length of the broadcasting voice is equal to the allowed breaking time length as the second breaking time.
6. A speech breaking device, comprising:
the current playing time length acquisition module is used for acquiring the current playing time length of the broadcast voice when receiving the user voice sent by the user in the process of playing the broadcast voice;
the recognition module is used for recognizing the user voice to obtain a recognition result;
the breaking module is configured to break the broadcasting voice being played by adopting the current playing duration and the recognition result based on a preset judging rule for a preset parameter, and includes: the breaking mark generation sub-module is used for generating a breaking mark according to the identification result and the preset judgment rule aiming at the preset parameter; a breaking moment determining sub-module, configured to determine a breaking moment according to the current playing duration and the preset judging rule for the preset parameter; the breaking submodule is used for breaking the broadcasting voice which is being played by adopting the breaking moment and the breaking mark;
the preset judgment rule for the preset parameters comprises the following steps: judging whether the voice semantics of the user are larger than or equal to a second preset word number threshold value or not and whether the voice semantics of the user are not matched with the second preset semantics or not; the break identifiers comprise a third break identifier; the interrupt identifier generation sub-module further includes:
the second preset word number threshold judging sub-module is used for judging whether the user voice word number is larger than or equal to the second preset word number threshold when the user voice semantics are not matched in the first preset semantics; the second preset semantic matching unit is used for matching the user voice semantics in the second preset semantics; and the third breaking identification unit is used for generating the third breaking identification when the matching fails.
7. An apparatus, comprising: a processor, a memory and a computer program stored on the memory and capable of running on the processor, which when executed by the processor implements the steps of the speech disruption method according to any one of claims 1 to 5.
8. A computer readable storage medium, characterized in that the computer readable storage medium has stored thereon a computer program which, when executed by a processor, implements the steps of the speech breaking method according to any of claims 1-5.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010232214.7A CN111540349B (en) | 2020-03-27 | 2020-03-27 | Voice breaking method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010232214.7A CN111540349B (en) | 2020-03-27 | 2020-03-27 | Voice breaking method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111540349A CN111540349A (en) | 2020-08-14 |
CN111540349B true CN111540349B (en) | 2023-10-10 |
Family
ID=71974815
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010232214.7A Active CN111540349B (en) | 2020-03-27 | 2020-03-27 | Voice breaking method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111540349B (en) |
Families Citing this family (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112185393A (en) * | 2020-09-30 | 2021-01-05 | 深圳供电局有限公司 | Voice recognition processing method for power supply intelligent client |
CN112185392A (en) * | 2020-09-30 | 2021-01-05 | 深圳供电局有限公司 | Voice recognition processing system for power supply intelligent client |
CN112037799B (en) * | 2020-11-04 | 2021-04-06 | 深圳追一科技有限公司 | Voice interrupt processing method and device, computer equipment and storage medium |
CN112714058B (en) * | 2020-12-21 | 2023-05-12 | 浙江百应科技有限公司 | Method, system and electronic device for immediately interrupting AI voice |
CN112669842A (en) * | 2020-12-22 | 2021-04-16 | 平安普惠企业管理有限公司 | Man-machine conversation control method, device, computer equipment and storage medium |
CN113779208A (en) * | 2020-12-24 | 2021-12-10 | 北京汇钧科技有限公司 | Method and device for man-machine conversation |
CN112700775B (en) * | 2020-12-29 | 2024-07-26 | 维沃移动通信有限公司 | Voice receiving period updating method and device and electronic equipment |
CN113113013B (en) * | 2021-04-15 | 2022-03-18 | 北京帝派智能科技有限公司 | Intelligent voice interaction interruption processing method, device and system |
CN113160817B (en) * | 2021-04-22 | 2024-06-28 | 平安科技(深圳)有限公司 | Voice interaction method and system based on intention recognition |
CN113488024B (en) * | 2021-05-31 | 2023-06-23 | 杭州摸象大数据科技有限公司 | Telephone interrupt recognition method and system based on semantic recognition |
CN113656550A (en) * | 2021-08-19 | 2021-11-16 | 中国银行股份有限公司 | Intelligent outbound method and device, storage medium and electronic equipment |
CN113656551A (en) * | 2021-08-19 | 2021-11-16 | 中国银行股份有限公司 | Intelligent outbound interruption method and device, storage medium and electronic equipment |
CN113656552A (en) * | 2021-08-19 | 2021-11-16 | 中国银行股份有限公司 | Intelligent outbound interruption recovery method and device, storage medium and electronic equipment |
CN115273911A (en) * | 2022-07-18 | 2022-11-01 | 上海湃舵智能科技有限公司 | Voice interruption judgment method, system and terminal |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102374864A (en) * | 2010-08-13 | 2012-03-14 | 国基电子(上海)有限公司 | Voice navigation equipment and voice navigation method |
CN105704554A (en) * | 2016-01-22 | 2016-06-22 | 广州视睿电子科技有限公司 | Audio playing method and device |
CN107342085A (en) * | 2017-07-24 | 2017-11-10 | 深圳云知声信息技术有限公司 | Method of speech processing and device |
CN107369439A (en) * | 2017-07-31 | 2017-11-21 | 北京捷通华声科技股份有限公司 | A kind of voice awakening method and device |
CN110427460A (en) * | 2019-08-06 | 2019-11-08 | 北京百度网讯科技有限公司 | Method and device for interactive information |
CN110853638A (en) * | 2019-10-23 | 2020-02-28 | 吴杰 | Method and equipment for interrupting voice robot in real time in voice interaction process |
CN110867197A (en) * | 2019-10-23 | 2020-03-06 | 吴杰 | Method and equipment for interrupting voice robot in real time in voice interaction process |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10311862B2 (en) * | 2015-12-23 | 2019-06-04 | Rovi Guides, Inc. | Systems and methods for conversations with devices about media using interruptions and changes of subjects |
US20180261223A1 (en) * | 2017-03-13 | 2018-09-13 | Amazon Technologies, Inc. | Dialog management and item fulfillment using voice assistant system |
CN108831455A (en) * | 2018-05-25 | 2018-11-16 | 四川斐讯全智信息技术有限公司 | A kind of method and system of intelligent sound box streaming interaction |
-
2020
- 2020-03-27 CN CN202010232214.7A patent/CN111540349B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102374864A (en) * | 2010-08-13 | 2012-03-14 | 国基电子(上海)有限公司 | Voice navigation equipment and voice navigation method |
CN105704554A (en) * | 2016-01-22 | 2016-06-22 | 广州视睿电子科技有限公司 | Audio playing method and device |
CN107342085A (en) * | 2017-07-24 | 2017-11-10 | 深圳云知声信息技术有限公司 | Method of speech processing and device |
CN107369439A (en) * | 2017-07-31 | 2017-11-21 | 北京捷通华声科技股份有限公司 | A kind of voice awakening method and device |
CN110427460A (en) * | 2019-08-06 | 2019-11-08 | 北京百度网讯科技有限公司 | Method and device for interactive information |
CN110853638A (en) * | 2019-10-23 | 2020-02-28 | 吴杰 | Method and equipment for interrupting voice robot in real time in voice interaction process |
CN110867197A (en) * | 2019-10-23 | 2020-03-06 | 吴杰 | Method and equipment for interrupting voice robot in real time in voice interaction process |
Non-Patent Citations (2)
Title |
---|
Su-Hyun jin,et al..Interrupted speech perception:The effects of hearing sensitive and frequency resolution.《Acoustical Society of America》.2010,第128卷(第2期),全文. * |
李恒庭等.SkyEye模拟器的音频输出模拟模块设计与实现.《厦门大学学报(自然科学版)》.2010,第49卷(第2期),全文. * |
Also Published As
Publication number | Publication date |
---|---|
CN111540349A (en) | 2020-08-14 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111540349B (en) | Voice breaking method and device | |
CN108962233B (en) | Voice conversation processing method and system for voice conversation platform | |
CN110661927B (en) | Voice interaction method and device, computer equipment and storage medium | |
US7392188B2 (en) | System and method enabling acoustic barge-in | |
US10074371B1 (en) | Voice control of remote device by disabling wakeword detection | |
CN1220176C (en) | Method for training or adapting to phonetic recognizer | |
US20240071382A1 (en) | Temporary account association with voice-enabled devices | |
CN110557451B (en) | Dialogue interaction processing method and device, electronic equipment and storage medium | |
US9704478B1 (en) | Audio output masking for improved automatic speech recognition | |
US10714085B2 (en) | Temporary account association with voice-enabled devices | |
JP2020525903A (en) | Managing Privilege by Speaking for Voice Assistant System | |
WO2003038804A2 (en) | Non-target barge-in detection | |
WO2015094907A1 (en) | Attribute-based audio channel arbitration | |
US11763819B1 (en) | Audio encryption | |
CN102282610A (en) | Voice conversation device, conversation control method, and conversation control program | |
JP2014191029A (en) | Voice recognition system and method for controlling voice recognition system | |
CN112581938B (en) | Speech breakpoint detection method, device and equipment based on artificial intelligence | |
CN113779208A (en) | Method and device for man-machine conversation | |
WO2021082133A1 (en) | Method for switching between man-machine dialogue modes | |
CN114385800A (en) | Voice conversation method and device | |
CN114328867A (en) | Intelligent interruption method and device in man-machine conversation | |
CN112700767B (en) | Man-machine conversation interruption method and device | |
US10923122B1 (en) | Pausing automatic speech recognition | |
CN110660393B (en) | Voice interaction method, device, equipment and storage medium | |
CN118366458A (en) | Full duplex dialogue system and method, electronic equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |