CN113936655A - Voice broadcast processing method and device, computer equipment and storage medium - Google Patents

Voice broadcast processing method and device, computer equipment and storage medium Download PDF

Info

Publication number
CN113936655A
CN113936655A CN202111115957.7A CN202111115957A CN113936655A CN 113936655 A CN113936655 A CN 113936655A CN 202111115957 A CN202111115957 A CN 202111115957A CN 113936655 A CN113936655 A CN 113936655A
Authority
CN
China
Prior art keywords
intention
voice
processing logic
standard
logic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111115957.7A
Other languages
Chinese (zh)
Inventor
毛振苏
徐勇攀
李乾
王诗达
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Pudong Development Bank Co Ltd
Original Assignee
Shanghai Pudong Development Bank Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Pudong Development Bank Co Ltd filed Critical Shanghai Pudong Development Bank Co Ltd
Priority to CN202111115957.7A priority Critical patent/CN113936655A/en
Publication of CN113936655A publication Critical patent/CN113936655A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/10Speech classification or search using distance or distortion measures between unknown speech and reference templates
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/1822Parsing for meaning understanding
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L15/222Barge in, i.e. overridable guidance for interrupting prompts
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/54Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for retrieval
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/221Announcement of recognition results

Abstract

The application relates to a voice broadcast processing method and device, computer equipment and a storage medium. The method comprises the following steps: acquiring voice information of a user terminal in the process of communicating between a voice robot and the user terminal; performing intention recognition on the voice information to obtain an intention recognition result; acquiring processing logic corresponding to the intention recognition result; the processing logic is executed. By adopting the method, the error interruption can be avoided.

Description

Voice broadcast processing method and device, computer equipment and storage medium
Technical Field
The present application relates to the field of artificial intelligence technologies, and in particular, to a method and an apparatus for processing a voice broadcast, a computer device, and a storage medium.
Background
With the development of computer technology, intelligent customer service has emerged. Wherein there is the scene of broadcasting interruption at intelligent customer service, there are following several kinds of solutions at present under this scene:
the first is a voice interaction scheme based on half-duplex, in the scheme, the interaction process of a user and a system is in the final stage of the whole process, namely, the user interaction turn can be entered after the voice broadcasting is finished, if the user needs to interrupt the broadcasting and stop the process, the user turn is entered in advance, the key is needed to be manually operated, and the system immediately performs interrupt processing after receiving the key operation of the user.
The second is a scheme for detecting interruption of voice trigger at the user side, which can support voice broadcast while receiving voice input at the user side, and has broadcast interruption capability, and the principle thereof is generally that through the characteristics of voice signals at the user side, parameters such as energy, zero crossing rate, entropy (entropy), treble (pitch) and the like and derived parameters thereof are used to judge whether the signals are voice signals, that is, an endpoint detection vad (voice Activity detector) technology, and when a signal stream detects a voice signal, the system triggers interruption, that is, when a voice signal sent by the user side is detected, the broadcast is interrupted.
And the third scheme is a scheme for triggering interruption based on a word number threshold of a voice stream at the user side, wherein the voice robot detects an incoming voice stream sent by a user while outputting voice in the interactive process of the user and the voice robot, detects the incoming voice stream, counts the number of words in the voice stream, and executes the operation of interrupting the voice if the number of the voice streams exceeds a preset word threshold.
However, the voice interaction scheme based on half-duplex is not in line with the cognition of the user to the voice intelligent customer service, the user intention cannot be really reflected only by interrupting the voice broadcast in a key-press mode, and the user expects that the will can be interrupted by voice expression like manual customer service. The scheme of detecting user-side voice trigger interruption cannot distinguish complex voice scenes. According to the scheme for triggering interruption based on the word number threshold of the voice stream at the user side, although the voice output by the user is recognized at the user side, only the number of recognized characters is counted, and when the number of recognized characters exceeds the threshold due to output of more meaningless sentences or other noise interference at the user side, the current conversation process is still interrupted, so that the probability of interruption caused by mistaken touch of the scheme is still high.
Disclosure of Invention
In view of the above, it is necessary to provide a voice broadcast processing method, apparatus, computer device, and storage medium capable of improving accuracy of interruption processing in view of the above technical problems.
A voice broadcast processing method, the method comprising:
acquiring voice information of a user terminal in the process of communicating between a voice robot and the user terminal;
performing intention recognition on the voice information to obtain an intention recognition result;
acquiring processing logic corresponding to the intention recognition result;
the processing logic is executed.
In one embodiment, the performing intent recognition on the voice information to obtain an intent recognition result includes:
extracting semantic features corresponding to the voice information;
matching the semantic features with standard features to obtain a preset number of standard features with similarity meeting requirements;
and counting intention classifications corresponding to the standard features with the similarity meeting the requirements in the preset number as intention recognition results corresponding to the voice information.
In one embodiment, before the matching the semantic features and the standard features to obtain a preset number of standard features with similarity satisfying the requirement, the method further includes:
receiving an intention configuration instruction, wherein the intention configuration instruction carries a standard text and an intention name corresponding to the standard text;
and performing intention configuration according to the standard text by using an intention name corresponding to the standard text to obtain a standard intention, and generating a standard feature according to the standard text.
In one embodiment, before the receiving the intention configuration instruction, the method further includes:
receiving an intention type selection instruction, and displaying a corresponding intention configuration interface according to the intention type selection instruction;
an intent configuration instruction is received via the intent configuration interface.
In one embodiment, before the obtaining the processing logic corresponding to the intention recognition result, the method further includes:
receiving a session logic configuration instruction;
and configuring and obtaining session logic according to the session logic configuration instruction, wherein the session logic comprises normal processing logic and reference processing logic corresponding to the standard intention.
In one embodiment, the reference processing logic comprises interrupt processing logic and words corresponding to the interrupt processing logic; the executing the processing logic comprises:
interrupting the current voice broadcast of the voice robot, and continuously broadcasting the voice operation corresponding to the interruption processing logic.
In one embodiment, the reference processing logic comprises non-interrupt processing logic; the executing the processing logic comprises:
and continuing the current voice broadcast of the voice robot.
A voice broadcast processing apparatus, the apparatus comprising:
the acquisition module is used for acquiring voice information of the user terminal in the process of communicating the voice robot with the user terminal;
the recognition module is used for carrying out intention recognition on the voice information to obtain an intention recognition result;
a logic obtaining module for obtaining processing logic corresponding to the intention recognition result;
an execution module to execute the processing logic.
A computer device comprising a memory storing a computer program and a processor implementing the steps of the method described above when executing the computer program.
A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the above-mentioned method.
According to the voice broadcast processing method, the voice broadcast processing device, the computer equipment and the storage medium, in the process of communication between the voice robot and the user terminal, the voice information of the user terminal is collected, and the voice information is identified to obtain the intention identification result, so that the processing logic corresponding to the intention identification result can be inquired, the semantic broadcast is processed according to the processing logic, and the error interruption is avoided.
Drawings
Fig. 1 is an application environment diagram of a voice broadcast processing method in an embodiment;
fig. 2 is a schematic flow chart of a voice broadcast processing method in one embodiment;
FIG. 3 is a diagram of a complete conversation flow framework in one embodiment;
FIG. 4 is an interface diagram of a semantic level uninterrupted intent configuration in one embodiment;
FIG. 5 is an interface diagram to break an intended configuration in one embodiment;
FIG. 6 is an interface diagram of a configuration of session logic in one embodiment;
FIG. 7 is a flow diagram of interrupt logic for the flow of speech information processing in one embodiment;
FIG. 8 is a schematic diagram of voice message processing flow in one embodiment;
fig. 9 is a block diagram showing the structure of a voice broadcast processing apparatus according to an embodiment;
FIG. 10 is a diagram showing an internal structure of a computer device according to an embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
The voice broadcast processing method provided by the application can be applied to the application environment shown in fig. 1. The user terminal 102 communicates with the call center 104 through a network, and the call center 104 collects voice information of the user terminal in the process of the voice robot communicating with the user terminal 102, and performs intention identification on the voice information to obtain an intention identification result; acquiring processing logic corresponding to the intention recognition result; the processing logic is executed. Thereby avoiding false interruptions. The user terminal 102 may be, but not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices, and the call center 104 may be implemented by an independent server or a server cluster composed of a plurality of servers.
In one embodiment, as shown in fig. 2, a method for processing a voice broadcast is provided, which is described by taking the method as an example for being applied to the call center 104 in fig. 1, and includes the following steps:
s202: and in the process of the communication between the voice robot and the user terminal, acquiring the voice information of the user terminal.
Specifically, the call center can adopt a full-duplex interactive mode, so that the voice robot can receive voice information uploaded by the user terminal in real time while broadcasting voice.
Wherein the voice information can be some tone words sent by the user or some replies given by the user to the voice played by the voice robot.
The communication between the voice robot and the user terminal refers to a process that the voice robot plays voice to the connected user terminal according to a preset speech technology, in the process, the user has some doubt or does not want to continue broadcasting, the user speaks, and the user terminal collects the spoken words of the user, namely voice information, and sends the voice information to the call center.
S204: and performing intention recognition on the voice information to obtain an intention recognition result.
Specifically, the intention Recognition refers to performing text Recognition and intention determination on voice information received by a call center, where the call center may transmit the received voice information to a voice information processing module in an audio stream manner, for example, transmit the audio stream to an ASR module (Automatic Speech Recognition) through MRCP (media resource control protocol) to recognize the voice information to obtain a voice text, and then match the voice text with a preset intention to obtain an intention Recognition result.
S206: processing logic corresponding to the intent recognition result is obtained.
Specifically, the processing logic is a pre-configured complete conversation process framework, wherein the call center can be configured in advance according to the conversation to obtain the complete conversation process framework, the conversation process framework includes the playing conversation process of the normal voice robot, and the new intention added in each conversation process and the processing process corresponding to the intention, as exemplified by the complete conversation process framework shown in fig. 3, after the opening, there is a waiting 2, and after the waiting 2, there are multiple processing branches, such as number errors, wherein in this embodiment, after the voice robot broadcasts the opening, the call center can receive the voice information collected by the user terminal side, and identify the intention to obtain the number error, and at this time, the processing logic corresponding to the intention of the number error is queried directly, i.e. it can be interrupted, the call center interrupts the current voice broadcast of the voice robot according to the processing logic of the number error, and acquires the dialect corresponding to the number error, thereby broadcasting the new dialect.
S208: the processing logic is executed.
Specifically, the executing processing logic is to execute processing logic corresponding to the intention recognition result, and includes directly obtaining a corresponding interruptible utterance when the intention recognition result is interruptible, so as to broadcast a new utterance, and if the intention recognition result is interruptible, continuing the current voice broadcast by the voice robot.
It should be noted that, in this embodiment, after obtaining the intention recognition result, when determining that the intention recognition result is an interruption intention, the call center generates an interruption identifier, and the interruption identifier triggers the call center to interrupt a currently broadcasted voice of the voice robot on the one hand, and to obtain a reply dialog corresponding to the interruption intention on the other hand, that is, to obtain a processing logic corresponding to the intention recognition result, and to execute the processing logic. And if the intention is not interrupted, directly acquiring the processing logic corresponding to the intention recognition result, and executing the processing logic.
According to the voice broadcast processing method, in the process of communication between the voice robot and the user terminal, the voice information of the user terminal is collected, and the voice information is identified to obtain the intention identification result, so that the processing logic corresponding to the intention identification result can be inquired, the voice broadcast is processed according to the processing logic, and the error interruption is avoided.
In one embodiment, performing intent recognition on the voice information to obtain an intent recognition result includes: extracting semantic features corresponding to the voice information; matching the semantic features with the standard features to obtain a preset number of standard features with similarity meeting requirements; and counting intention classifications corresponding to the standard features with the similarity meeting the requirement and with the preset number as intention recognition results corresponding to the voice information.
In order to correctly understand the meaning of the voice input of the user terminal and accurately hit different intention configurations, the KNN algorithm is adopted by the flow framework to solve the problem of semantic understanding classification during conversation.
Specifically, the call center extracts semantic features corresponding to the voice information through a voice module so as to enable the voice information and standard features corresponding to the standard intentions to be in the same feature space, so that the standard features with the similarity meeting requirements in a preset number are selected, and intention classifications corresponding to the standard features with the similarity meeting requirements in the preset number are counted and used as intention recognition results corresponding to the voice information.
In particular, if a speech feature belongs to a certain class in most of the k most similar (i.e., nearest neighbors in feature space) standard features in feature space, the speech feature also belongs to this class, where k is typically an integer no greater than 20. The algorithm comprises the following steps:
first of all supposeDefining the standard characteristic set as T { (x)1,y1),(x2,y2),…,(xn,yn) Therein of
Figure BDA0003275305050000061
Is a standard feature vector of dimension n. y isi∈Y={c1,c2,ckAn example category, where i ═ 1,2, …, N, and the semantic feature is x.
And finding k standard features closest to x in the standard feature set T according to an Euclidean distance measurement method, and recording a set which can always represent the k standard features as N _ k (x), wherein the Euclidean distance is shown as a formula (1).
Figure BDA0003275305050000071
The category y to which the instance x belongs is determined according to the principle of majority voting as follows:
Figure BDA0003275305050000072
in formula (2), I is an indicator function:
Figure BDA0003275305050000073
in the above embodiment, the recognition method of the intention recognition result is given.
In one embodiment, before matching the semantic features with the standard features to obtain a preset number of standard features with similarity satisfying the requirement, the method further includes: receiving an intention configuration instruction, wherein the intention configuration instruction carries a standard text and an intention name corresponding to the standard text; and performing intention configuration according to the standard text by using an intention name corresponding to the standard text to obtain a standard intention, and generating a standard feature according to the standard text.
In one embodiment, before receiving the intention configuration instruction, the method further includes: receiving an intention type selection instruction, and displaying a corresponding intention configuration interface according to the intention type selection instruction; an intent configuration instruction is received via an intent configuration interface.
Specifically, in the present embodiment, a method of intention configuration is mainly introduced, where in a voice broadcast process, when a user terminal has a voice stream input, it needs to determine in advance which expressions of a user should be interrupted, perform corresponding intention configuration, identify content according to semantics, match corresponding intention, and trigger related operations.
In the present embodiment, the intention configuration mainly includes a semantic level non-interrupting intention configuration and an interrupting intention configuration.
Referring to fig. 4, fig. 4 is an interface diagram of semantic level non-interruption intention configuration in an embodiment, in this embodiment, when an intelligent voice customer service reports a current speech, if it is detected that a voice stream is input at a user terminal and a result after ASR semantic recognition is a word such as "i know", "yes", or "go", the word is actively spoken by a user but is a tone word or has no specific meaning, and no interruption should be made according to daily dialog logic.
Accordingly, as shown in fig. 4, an intention configuration with semantic level non-interrupting intention, such as "water word" and "moose word", is established in the intention list of the system, wherein the "water word" refers to a word without a specific meaning, such as "good", "yes", and the like; the term "mood word" refers to the word that eliminates the ambiguity, such as "mood word", "forehead", etc. Therefore, according to the configured intention information, the system triggers the intention, the broadcasting cannot be interrupted, and otherwise, the broadcasting can jump back to the current speech broadcasting until the next voice stream input of the user side.
FIG. 5, shown in conjunction with FIG. 5, is an interface diagram for breaking an intended configuration in one embodiment. In this embodiment, when the system recognizes the voice stream of the user terminal, and after semantic recognition, expresses information that has participated in the related service or other information that can trigger an interruption intention, an interruption operation is executed, the present broadcast flow is ended, and a next speech node is entered, that is, a next speech node corresponding to interruption processing is entered.
In the above embodiments, two types of intended configurations are given.
In one embodiment, before the obtaining of the processing logic corresponding to the intention recognition result, the method further includes: receiving a session logic configuration instruction; and configuring and obtaining the session logic according to the session logic configuration instruction, wherein the session logic comprises normal processing logic and reference processing logic corresponding to the standard intention.
Specifically, referring to fig. 6, fig. 6 is an interface diagram of a configuration of a session logic in an embodiment, in which a call center configures the entire session logic in advance, and in other embodiments, the entire session logic may be configured by another server and interact with the call center. Before call initiation, the whole session logic needs to be configured, and when the semantic recognition module 902 recognizes that no intention is interrupted or an intention is interrupted in the call process, other servers return special interruption events and replies to corresponding processes to the call center.
With reference to fig. 6, the user may pre-configure the session logic, for example, for the profile 1.2 interruptible node, configure the corresponding disconnected session, for example, the "cash-in-balance day-day gain version" in the "call-back-talk text in fig. 6, i introduce you to the next bar", continue the subsequent voice information collection based on the call-back-talk, perform intent recognition, and repeat the above process until the whole session is ended.
In one embodiment, the reference processing logic includes and corresponds to the break processing logic; execution processing logic comprising: interrupting the current voice broadcast of the voice robot and continuing to broadcast the speech corresponding to the interruption processing logic.
In one embodiment, the reference processing logic comprises non-interrupt processing logic; execution processing logic comprising: and continuing the current voice broadcast of the voice robot.
With reference to fig. 3, the whole speech operation execution sequence is executed in a principle of from top to bottom and from left to right, starting from the open-field node, entering to wait for the user side speech stream to be input after the corresponding open-field is broadcasted, entering the next node speech operation if the user terminal speech stream is confirmed to be answered by the user through the semantic recognition result, that is, simply introducing the related services, and determining which next node speech operation to enter according to the user speech semantic recognition result after the current speech operation is broadcasted.
If the voice stream input at the user side is detected in the broadcasting field opening white process of the voice robot, and after the results are uninterruptable intentions such as kay and good after ASR semantic recognition, the conversation process frame can carry out node jump back operation, namely jump back to the original voice broadcasting link, and continue the broadcasting of the field opening white until the broadcasting of the field opening white is finished. While waiting for user side voice to flow in.
Specifically, in order to make the present application more fully understood by those skilled in the art, please refer to fig. 7 and 8, wherein fig. 7 is a flow chart of the interrupt logic of the speech information processing flow in one embodiment, and fig. 8 is a schematic diagram of the speech information processing flow in one embodiment.
In this embodiment, the call center needs to adopt a full-duplex interactive mode, and receive the audio stream of the user side in real time while the voice robot broadcasts TTS (synthesizing voice from text), and the MRCP transmits the audio stream to asr (automatic Speech recognition) to be recognized as a text result.
The call center can bring an identifier while transmitting the result of ASR recognition to the conversation flow frame, the identifier records whether the words spoken by the client are interrupted to the words broadcasted by the robot, the conversation flow frame can make judgment after the identifier is taken, and if the intention expressed by the client accords with the semantic interruption logic, the corresponding robot words are returned; if the intention expressed by the client is not in accordance with the expectation, returning to the call center to a null value, and if the call center takes the null value, continuously broadcasting the current TTS by default without interruption.
With reference to fig. 8, when the voice robot broadcasts TTS, the call center receives the audio stream from the user terminal, performs ASR voice recognition on the audio stream, and feeds back the recognition result to the call center, where the call center first recognizes whether the interruption service intention is hit through the semantic recognition portion, and if so, the dialog flow framework returns a special interruption event and a corresponding answer, and determines whether to stop playing the last TTS or broadcast the answer according to the type of the previous interruption intention.
It should be understood that, although the steps in the flowchart of fig. 2 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a portion of the steps in fig. 2 may include multiple steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed in turn or alternately with other steps or at least a portion of the other steps or stages.
In one embodiment, as shown in fig. 9, there is provided a voice broadcast processing apparatus including: an acquisition module 901, an identification module 902, a logic acquisition module 903 and an execution module 904, wherein:
the acquisition module 901 is used for acquiring voice information of a user terminal in the process of a call between the voice robot and the user terminal;
the recognition module 902 is configured to perform intent recognition on the voice information to obtain an intent recognition result;
a logic obtaining module 903, configured to obtain processing logic corresponding to the intention recognition result;
an execution module 904 to execute the processing logic.
In one embodiment, the identifying module 902 includes:
the extraction unit is used for extracting semantic features corresponding to the voice information;
the matching unit is used for matching the semantic features with the standard features to obtain a preset number of standard features with similarity meeting the requirement;
and the output unit is used for counting intention classifications corresponding to the standard features with the similarity meeting the requirement and the preset number as intention identification results corresponding to the voice information.
In one embodiment, the identifying module 902 further includes:
the first receiving unit is used for receiving an intention configuration instruction, and the intention configuration instruction carries a standard text and an intention name corresponding to the standard text;
and the configuration unit is used for performing intention configuration according to the standard text by the intention name corresponding to the standard text to obtain a standard intention and generating standard characteristics according to the standard text.
In one embodiment, the identifying module 902 further includes:
the second receiving unit is used for receiving the intention type selection instruction and displaying a corresponding intention configuration interface according to the intention type selection instruction;
and the third receiving unit is used for receiving the intention configuration instruction through the intention configuration interface.
In one embodiment, the voice broadcast processing apparatus further includes:
a receiving module, configured to receive a session logic configuration instruction;
and the configuration module is used for configuring and obtaining the session logic according to the session logic configuration instruction, and the session logic comprises normal processing logic and reference processing logic corresponding to the standard intention.
In one embodiment, the reference processing logic includes and corresponds to the break processing logic; the execution module is used for interrupting the current voice broadcast of the voice robot and continuously broadcasting the voice corresponding to the interruption processing logic.
In one embodiment, the reference processing logic comprises non-interrupt processing logic; the execution module is used for continuing the current voice broadcast of the voice robot.
For specific limitations of the voice broadcast processing apparatus, reference may be made to the above limitations on the voice broadcast processing method, which is not described herein again. All modules in the voice broadcast processing device can be completely or partially realized through software, hardware and a combination of the software and the hardware. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
In one embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be as shown in fig. 10. The computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used to store session processing logic. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a voice broadcast processing method.
Those skilled in the art will appreciate that the architecture shown in fig. 10 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
In one embodiment, a computer device is provided, comprising a memory and a processor, the memory having a computer program stored therein, the processor implementing the following steps when executing the computer program: collecting voice information of a user terminal in the process of communicating between the voice robot and the user terminal; performing intention recognition on the voice information to obtain an intention recognition result; acquiring processing logic corresponding to the intention recognition result; the processing logic is executed.
In one embodiment, the performing intent recognition on voice information, as implemented by a processor executing a computer program, results in an intent recognition result, comprising: extracting semantic features corresponding to the voice information; matching the semantic features with the standard features to obtain a preset number of standard features with similarity meeting requirements; and counting intention classifications corresponding to the standard features with the similarity meeting the requirement and with the preset number as intention recognition results corresponding to the voice information.
In one embodiment, before the matching of the semantic features and the standard features to obtain a preset number of standard features with similarity satisfying the requirement when the processor executes the computer program, the method further includes: receiving an intention configuration instruction, wherein the intention configuration instruction carries a standard text and an intention name corresponding to the standard text; and performing intention configuration according to the standard text by using an intention name corresponding to the standard text to obtain a standard intention, and generating a standard feature according to the standard text.
In one embodiment, before receiving the intention configuration instruction, the intention configuration instruction being implemented when the processor executes the computer program, the intention configuration instruction further includes: and receiving an intention type selection instruction, and displaying a corresponding intention configuration interface according to the intention type selection instruction to receive the intention configuration instruction through the intention configuration interface.
In one embodiment, before the processing logic implemented when the processor executes the computer program to obtain the result corresponding to the intention recognition result, the method further comprises: receiving a session logic configuration instruction; and configuring and obtaining the session logic according to the session logic configuration instruction, wherein the session logic comprises normal processing logic and reference processing logic corresponding to the standard intention.
In one embodiment, the reference processing logic involved in the execution of the computer program by the processor includes and corresponds to the break processing logic; execution processing logic implemented when the processor executes the computer program, comprising: interrupting the current voice broadcast of the voice robot and continuing to broadcast the speech corresponding to the interruption processing logic.
In one embodiment, the reference processing logic involved in executing the computer program by the processor comprises non-disruptive processing logic; execution processing logic implemented when the processor executes the computer program, comprising: and continuing the current voice broadcast of the voice robot.
In one embodiment, a computer-readable storage medium is provided, having a computer program stored thereon, which when executed by a processor, performs the steps of: collecting voice information of a user terminal in the process of communicating between the voice robot and the user terminal; performing intention recognition on the voice information to obtain an intention recognition result; acquiring processing logic corresponding to the intention recognition result; the processing logic is executed.
In one embodiment, the intention recognition of the voice information to obtain the intention recognition result, which is realized when the computer program is executed by the processor, comprises the following steps: extracting semantic features corresponding to the voice information; matching the semantic features with the standard features to obtain a preset number of standard features with similarity meeting requirements; and counting intention classifications corresponding to the standard features with the similarity meeting the requirement and with the preset number as intention recognition results corresponding to the voice information.
In one embodiment, before the matching of the semantic features with the standard features to obtain a preset number of standard features with similarity satisfying the requirement, when the computer program is executed by the processor, the method further includes: receiving an intention configuration instruction, wherein the intention configuration instruction carries a standard text and an intention name corresponding to the standard text; and performing intention configuration according to the standard text by using an intention name corresponding to the standard text to obtain a standard intention, and generating a standard feature according to the standard text.
In one embodiment, the computer program, when executed by the processor, further comprises, prior to receiving the intent configuration instructions: and receiving an intention type selection instruction, and displaying a corresponding intention configuration interface according to the intention type selection instruction to receive the intention configuration instruction through the intention configuration interface.
In one embodiment, the computer program, when executed by the processor, further comprises prior to obtaining processing logic corresponding to the intent recognition result: receiving a session logic configuration instruction; and configuring and obtaining the session logic according to the session logic configuration instruction, wherein the session logic comprises normal processing logic and reference processing logic corresponding to the standard intention.
In one embodiment, the reference processing logic involved in the execution of the computer program by the processor comprises and corresponds to the break processing logic; execution processing logic implemented when the computer program is executed by a processor, comprising: interrupting the current voice broadcast of the voice robot and continuing to broadcast the speech corresponding to the interruption processing logic.
In one embodiment, the reference processing logic involved in the execution of the computer program by the processor comprises non-disruptive processing logic; execution processing logic implemented when the computer program is executed by a processor, comprising: and continuing the current voice broadcast of the voice robot.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database or other medium used in the embodiments provided herein can include at least one of non-volatile and volatile memory. Non-volatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical storage, or the like. Volatile Memory can include Random Access Memory (RAM) or external cache Memory. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM), among others.
The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims (10)

1. A voice broadcast processing method, the method comprising:
acquiring voice information of a user terminal in the process of communicating between a voice robot and the user terminal;
performing intention recognition on the voice information to obtain an intention recognition result;
acquiring processing logic corresponding to the intention recognition result;
the processing logic is executed.
2. The method of claim 1, wherein the performing intent recognition on the voice information to obtain an intent recognition result comprises:
extracting semantic features corresponding to the voice information;
matching the semantic features with standard features to obtain a preset number of standard features with similarity meeting requirements;
and counting intention classifications corresponding to the standard features with the similarity meeting the requirements in the preset number as intention recognition results corresponding to the voice information.
3. The method according to claim 2, wherein before the matching the semantic features with the standard features to obtain a preset number of standard features with similarity satisfying the requirement, the method further comprises:
receiving an intention configuration instruction, wherein the intention configuration instruction carries a standard text and an intention name corresponding to the standard text;
and performing intention configuration according to the standard text by using an intention name corresponding to the standard text to obtain a standard intention, and generating a standard feature according to the standard text.
4. The method of claim 3, wherein prior to receiving the intent configuration instruction, further comprising:
receiving an intention type selection instruction, and displaying a corresponding intention configuration interface according to the intention type selection instruction;
an intent configuration instruction is received via the intent configuration interface.
5. The method of claim 3 or 4, wherein the obtaining processing logic corresponding to the intent recognition result is preceded by:
receiving a session logic configuration instruction;
and configuring and obtaining session logic according to the session logic configuration instruction, wherein the session logic comprises normal processing logic and reference processing logic corresponding to the standard intention.
6. The method of claim 5, wherein the reference processing logic comprises interrupt processing logic and words corresponding to the interrupt processing logic; the executing the processing logic comprises:
interrupting the current voice broadcast of the voice robot, and continuously broadcasting the voice operation corresponding to the interruption processing logic.
7. The method of claim 5, wherein the reference processing logic comprises non-disruptive processing logic; the executing the processing logic comprises:
and continuing the current voice broadcast of the voice robot.
8. A voice broadcast processing apparatus, the apparatus comprising:
the acquisition module is used for acquiring voice information of the user terminal in the process of communicating the voice robot with the user terminal;
the recognition module is used for carrying out intention recognition on the voice information to obtain an intention recognition result;
a logic obtaining module for obtaining processing logic corresponding to the intention recognition result;
an execution module to execute the processing logic.
9. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor, when executing the computer program, implements the steps of the method of any of claims 1 to 7.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.
CN202111115957.7A 2021-09-23 2021-09-23 Voice broadcast processing method and device, computer equipment and storage medium Pending CN113936655A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111115957.7A CN113936655A (en) 2021-09-23 2021-09-23 Voice broadcast processing method and device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111115957.7A CN113936655A (en) 2021-09-23 2021-09-23 Voice broadcast processing method and device, computer equipment and storage medium

Publications (1)

Publication Number Publication Date
CN113936655A true CN113936655A (en) 2022-01-14

Family

ID=79276406

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111115957.7A Pending CN113936655A (en) 2021-09-23 2021-09-23 Voice broadcast processing method and device, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113936655A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115134466A (en) * 2022-06-07 2022-09-30 马上消费金融股份有限公司 Intention recognition method and device and electronic equipment

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115134466A (en) * 2022-06-07 2022-09-30 马上消费金融股份有限公司 Intention recognition method and device and electronic equipment

Similar Documents

Publication Publication Date Title
EP3611895B1 (en) Method and device for user registration, and electronic device
CN107798032B (en) Method and device for processing response message in self-service voice conversation
US10629186B1 (en) Domain and intent name feature identification and processing
CN110557451B (en) Dialogue interaction processing method and device, electronic equipment and storage medium
US6438520B1 (en) Apparatus, method and system for cross-speaker speech recognition for telecommunication applications
WO2020238209A1 (en) Audio processing method, system and related device
CN111627432B (en) Active outbound intelligent voice robot multilingual interaction method and device
CN112201246B (en) Intelligent control method and device based on voice, electronic equipment and storage medium
CN111429899A (en) Speech response processing method, device, equipment and medium based on artificial intelligence
CN108447471A (en) Audio recognition method and speech recognition equipment
CN109086276B (en) Data translation method, device, terminal and storage medium
CN108899036A (en) A kind of processing method and processing device of voice data
US8868419B2 (en) Generalizing text content summary from speech content
WO2020024620A1 (en) Voice information processing method and device, apparatus, and storage medium
CN112669842A (en) Man-machine conversation control method, device, computer equipment and storage medium
CN110992955A (en) Voice operation method, device, equipment and storage medium of intelligent equipment
CN113779208A (en) Method and device for man-machine conversation
WO2015188454A1 (en) Method and device for quickly accessing ivr menu
CN112767916A (en) Voice interaction method, device, equipment, medium and product of intelligent voice equipment
CN112530417B (en) Voice signal processing method and device, electronic equipment and storage medium
CN116417003A (en) Voice interaction system, method, electronic device and storage medium
CN113936655A (en) Voice broadcast processing method and device, computer equipment and storage medium
WO2021098318A1 (en) Response method, terminal, and storage medium
CN110838284A (en) Method and device for processing voice recognition result and computer equipment
CN106371905B (en) Application program operation method and device and server

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination