WO2021135548A1 - 语音的意图识别方法、装置、计算机设备及存储介质 - Google Patents

语音的意图识别方法、装置、计算机设备及存储介质 Download PDF

Info

Publication number
WO2021135548A1
WO2021135548A1 PCT/CN2020/123205 CN2020123205W WO2021135548A1 WO 2021135548 A1 WO2021135548 A1 WO 2021135548A1 CN 2020123205 W CN2020123205 W CN 2020123205W WO 2021135548 A1 WO2021135548 A1 WO 2021135548A1
Authority
WO
WIPO (PCT)
Prior art keywords
voice
reply
user
data
current
Prior art date
Application number
PCT/CN2020/123205
Other languages
English (en)
French (fr)
Inventor
叶怡周
马骏
王少军
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2021135548A1 publication Critical patent/WO2021135548A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/02Marketing; Price estimation or determination; Fundraising
    • G06Q30/0281Customer communication at a business location, e.g. providing product or service information, consulting
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/1822Parsing for meaning understanding
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • G10L15/30Distributed recognition, e.g. in client-server systems, for mobile phones or network applications
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • This application relates to the technical field of artificial intelligence speech and semantics, and in particular to a method, device, computer equipment, and storage medium for speech intention recognition.
  • dialogue management is used to control the core part of the interaction between the intelligent customer service robot and the customer.
  • the inventor found that the dialogue management is mainly based on the NLU model (i.e. natural language understanding model) to understand and judge the intention of the user’s speech, but due to the conversion of the customer’s voice through the ASR technology (i.e. automatic speech recognition technology)
  • the accuracy rate is not high, and the NLU model cannot accurately identify the user's intention in a short period of time, which leads to a long time-consuming process of dialogue with the intelligent customer service robot and low processing efficiency.
  • the embodiments of the application provide a voice intention recognition method, device, computer equipment, and storage medium, aiming to solve the problem of converting customer voice into text through automatic voice recognition technology in the intelligent customer service robot system in the prior art.
  • the rate is not high, and the natural language understanding model cannot accurately identify the user’s intention in a short period of time, which leads to a long time-consuming process of dialogue with the intelligent customer service robot and inefficiency.
  • an embodiment of the present application provides a voice intention recognition method, which includes:
  • the current reply voice text data includes affirmative reply keywords or negative reply keywords, call the locally stored target word slot corresponding to the current reply text; wherein the target word slot includes the target word slot name and the target NLP Model coding, target word slot fixed speech model; and
  • the corresponding target NLP model is obtained by encoding the target NLP model, and the first transaction processing voice data is recognized through the target NLP model to obtain the corresponding first transaction processing voice data. Recognition results.
  • an embodiment of the present application provides a voice intention recognition device, which includes:
  • the first voice recognition unit is configured to perform voice recognition on the user's initial voice data if the user's initial voice data sent by the user terminal is received, to obtain the initial voice text data corresponding to the user's initial voice data;
  • the current reply voice obtaining unit is used to obtain the current reply text corresponding to the initial voice text data by calling a pre-trained natural language understanding model, and to generate the current reply voice corresponding to the current reply text by calling the pre-trained natural language generation model , Sending the current reply voice to the user end;
  • the second voice recognition unit is configured to, if receiving user reply voice data corresponding to the current reply voice sent by the user terminal, perform voice recognition on the user reply voice data to obtain the corresponding current reply voice text data;
  • the keyword judgment unit is used to judge whether the current reply voice text data includes positive reply keywords, negative reply keywords, or jump manual service keywords;
  • the target word slot acquiring unit is configured to call a locally stored target word slot corresponding to the current reply text if the current reply voice text data includes a positive reply keyword or a negative reply keyword; wherein, the target word slot is Including target word slot name, target NLP model coding, target word slot fixed speech model; and
  • the item voice recognition unit is used to, if it is detected that the user’s first item processing voice data is received, the target NLP model code is used to obtain the corresponding target NLP model, and the first item processing voice data is processed through the target NLP model. Recognize, and get the corresponding first recognition result.
  • an embodiment of the present application provides a computer device, which includes a memory, a processor, and a computer program stored on the memory and running on the processor, and the processor executes the computer The following steps are implemented during the program:
  • the current reply voice text data includes affirmative reply keywords or negative reply keywords, call the locally stored target word slot corresponding to the current reply text; wherein the target word slot includes the target word slot name and the target NLP Model coding, target word slot fixed speech model; and
  • the corresponding target NLP model is obtained by encoding the target NLP model, and the first transaction processing voice data is recognized through the target NLP model to obtain the corresponding first transaction processing voice data. Recognition results.
  • the embodiments of the present application also provide a computer-readable storage medium, wherein the computer-readable storage medium stores a computer program, which when executed by a processor causes the processor to perform the following operations :
  • the current reply voice text data includes affirmative reply keywords or negative reply keywords, call the locally stored target word slot corresponding to the current reply text; wherein the target word slot includes the target word slot name and the target NLP Model coding, target word slot fixed speech model; and
  • the corresponding target NLP model is obtained by encoding the target NLP model, and the first transaction processing voice data is recognized through the target NLP model to obtain the corresponding first transaction processing voice data. Recognition results.
  • the embodiments of the present application provide a voice intention recognition method, device, computer equipment, and storage medium, including if the user's initial voice data sent by the user terminal is received, voice recognition is performed on the user's initial voice data to obtain the corresponding initial voice text Data; by calling the natural language understanding model to obtain the current reply text corresponding to the initial voice text data, by calling the natural language generation model to generate the current reply voice corresponding to the current reply text, and send the current reply voice to the user end; if receiving the user end The sent user reply voice data corresponding to the current reply voice, and perform voice recognition on the user reply voice data to obtain the corresponding current reply voice text data; if the current reply voice text data includes positive reply keywords or negative reply keywords, call Locally stored target word slot corresponding to the current reply text; and if it is detected that the user's first item processing voice data is detected, the target NLP model code is used to obtain the corresponding target NLP model, and the target NLP model is used to process the voice data for the first item Perform recognition and obtain the corresponding first recognition
  • FIG. 1 is a schematic diagram of an application scenario of a voice intention recognition method provided by an embodiment of the application
  • FIG. 2 is a schematic flowchart of a voice intention recognition method provided by an embodiment of this application.
  • FIG. 3 is a schematic diagram of a sub-flow of a voice intention recognition method provided by an embodiment of this application.
  • FIG. 4 is a schematic block diagram of a voice intention recognition apparatus provided by an embodiment of the application.
  • FIG. 5 is a schematic block diagram of subunits of a voice intention recognition apparatus provided by an embodiment of this application.
  • Fig. 6 is a schematic block diagram of a computer device provided by an embodiment of the application.
  • FIG. 1 is a schematic diagram of an application scenario of a voice intention recognition method provided by an embodiment of this application
  • FIG. 2 is a schematic flowchart of a voice intention recognition method provided by an embodiment of this application, and the voice intention recognition
  • the method is applied to a server, and the method is executed by application software installed in the server.
  • the method includes steps S110 to S160.
  • the user side is an intelligent terminal used by the user (such as a terminal such as a smart phone).
  • the user can use the intelligent dialogue system provided by the user side and the server to communicate with each other to achieve specific matters. That is, the user terminal can send the collected user voice to the server.
  • the second is the server.
  • the server handles various matters based on the received user voice sent by the user terminal, combined with the voice recognition function of the local intelligent dialogue system.
  • the server judges whether the user's initial voice data sent by the user terminal is received.
  • the corresponding scenario is that after the user terminal establishes a connection with the server at this time, the user can communicate with the intelligent dialogue system deployed on the server side.
  • the first voice sent by the intelligent dialogue system on the server side to the user terminal usually includes a welcome message and an inquiry sentence about the type of business to be done, such as "Welcome to call XXX company, what kind of business do you need to handle" voice.
  • the user terminal When the user terminal receives the first voice sent by the server, the user will respond according to the first voice. At this time, the user terminal collects the voice sent by the user in answer to the first voice to obtain the corresponding user initial voice data .
  • the server receives the user's initial voice data to recognize it, and obtains the initial voice text data.
  • step S110 includes:
  • the N-gram model is the N-gram model
  • the N-gram model is a language model (Language Model, LM)
  • the language model is a probability-based discriminant model
  • its input is a sentence (word of Sequence sequence)
  • the output is the probability of this sentence, that is, the joint probability of these words (joint probability).
  • the N-gram model can also be used for speech and text recognition.
  • the server When the server receives the user's initial voice data sent by the user terminal, it can perform voice recognition on the user's initial voice data by calling the N-ary model to obtain the corresponding initial voice text data.
  • the N-ary model is used for speech recognition, which improves the accuracy of converting customer speech into text.
  • the natural language understanding model is the NLU model (the full name of NLU is Natural Language Understanding).
  • the natural language processing model ie, the NLP model
  • the natural language processing model generally includes a natural language understanding model and a natural language generation model (ie, the NLG model, the full name of NLG is Natural Language Generation).
  • NLU is responsible for understanding content
  • NLG is responsible for generating content.
  • the commonly used natural language understanding model is the Transformer model (a codec model based entirely on the attention mechanism, that is, the translation model), which uses the encoer-decoder architecture.
  • the specific processing process of the Transformer model is as follows: the input sequence is first subjected to wordembedding (that is, word embedding, that is, the input sequence is converted into a word vector), and then added to positional encoding (that is, positional encoding), and then input to the encoder (ie, encoder)
  • the output sequence of the encoder is processed the same as the input sequence, and then input to the decoder (ie, the decoder), and finally the final output sequence corresponding to the input sequence is obtained.
  • the intelligent dialogue system needs to convert it into voice data and send it to the user end. At this time, the current reply voice can be sent to the user end. For example, referring to the above example, when the user says to the intelligent dialogue system "automatic deduction failed when my bank card is returned to the credit card", the intelligent dialogue system tells the user "may you activate the automatic repayment function".
  • the natural language understanding model and the natural language generation model in step S120 are both stored in a blockchain network.
  • corresponding summary information is obtained based on the natural language understanding model and the natural language generation model.
  • the summary information is hashed by the natural language understanding model and the natural language generation model Obtained, such as using sha256 algorithm processing.
  • Uploading summary information to the blockchain can ensure its security and fairness and transparency to users.
  • the user equipment can download the summary information from the blockchain to verify whether the natural language understanding model and the natural language generation model have been tampered with.
  • the blockchain referred to in this example is a new application mode of computer technology such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm.
  • Blockchain essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information for verification. The validity of the information (anti-counterfeiting) and the generation of the next block.
  • the blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer. .
  • the natural language understanding model and the natural language generation model called in the server are both stored in the blockchain network to ensure that the model cannot be tampered with.
  • the natural language understanding model and the natural language generation model uploaded by the server as a blockchain node device can both be invoked by other blockchain node devices in the same blockchain network as the server.
  • the user terminal After the user terminal receives the current reply voice (for example, do you have the automatic repayment function enabled), it will respond according to the current reply voice, that is, the user terminal collects the user corresponding to the current reply voice After replying the voice data, the user's replying voice data is sent to the server. At this time, the server can still perform voice recognition on the user's reply voice data through the N-ary model to obtain the corresponding current reply voice text data.
  • the server can still perform voice recognition on the user's reply voice data through the N-ary model to obtain the corresponding current reply voice text data.
  • S140 Determine whether the current reply voice text data includes positive reply keywords, negative reply keywords, or jump manual service keywords.
  • the server can determine whether the current reply voice text data includes a positive reply keyword (the positive reply keyword is specifically yes), or whether it includes a negative reply keyword (the negative reply keyword is specifically no ), or whether to include the redirected manual service keywords, once the current reply voice text data includes one of the above three types of keywords, the corresponding processing flow will be executed; if the current reply voice text data does not include When any one of the above three types of keywords is used, the corresponding processing flow is also executed.
  • the current reply voice text data includes affirmative reply keywords or negative reply keywords, call a locally stored target word slot corresponding to the current reply text; wherein the target word slot includes the target word slot name, Target NLP model coding, target word slot fixed speech model.
  • the current reply voice text data when it is determined that the current reply voice text data includes a positive reply key word or a negative reply key word, it means that the user has made a positive or negative reply to the current reply voice, indicating that he has entered to handle the current matter.
  • the normal process at this time, in order to improve the recognition efficiency of the user's subsequent dialogue, at this time, the locally stored target word slot corresponding to the current reply text can be called.
  • the target word slot includes the target NLP model code corresponding to the NLP model used by the intelligent dialogue system in the next conversation with the user, and the target word slot fixed speech model used.
  • the target word slot fixed speech technique model is set with the speech technique that the intelligent dialogue system will use in the next conversation with the user. For example, you have not activated the automatic repayment function of our bank. If you want to activate it, please say "yes", and you do not need to activate it. Say "return”. Since the target NLP model is called at this time to recognize and convert the user's subsequent speech and text, and the target NLP model is a model trained for such a dialogue scenario, it can have a higher recognition rate and a more accurate understanding of user intent. Moreover, since the fixed speech model is set in the target word slot, users can be guided to complete the matters more quickly based on the fixed speech model, and the data processing efficiency of the matters required by each user is improved.
  • the target NLP model code is used to obtain the corresponding target NLP model, and the first item handling voice data is recognized through the target NLP model to obtain the corresponding The first recognition result.
  • the corresponding target NLP model code is set in the target word slot, after the target NLP model code is called locally on the server according to the target NLP model code, the corresponding target NLP model code is obtained through the target NLP model code.
  • the NLP model recognizes the first item processing voice data, and obtains the corresponding first recognition result. Obtain the target NLP model through this orientation. Since the target NLP model is a model trained for this kind of dialogue scene, it can have a higher recognition rate and a more accurate understanding of user intentions.
  • the method further includes:
  • the current reply voice text data includes the jump manual service keyword
  • a connection request to obtain the agent terminal whose current status is idle is sent to the user terminal.
  • the user terminal selects to receive the connection with the agent terminal, it can assist the user to complete the subsequent process through the manual service provided by the agent terminal.
  • manual service intervention in the matter process it can assist users to complete the matter more quickly.
  • step S160 the method further includes:
  • the current reply voice text data does not include any one of reply keywords, negative reply keywords, and jump manual service keywords, it means that the user can reply to the user's initial voice data according to the user terminal.
  • the initial voice and text data obtained is used to further determine the type of matters that the user needs to handle.
  • the above reply includes the two keywords of fixed credit card limit and query.
  • the item flow data corresponding to the above two keywords is called locally, and the corresponding items are sequentially assigned according to the order of the flow in the item flow data.
  • the process question is sent to the user terminal to guide the user to complete the matter through the self-service process.
  • step S160 or step S170, or if the current reply voice text data does not include any one of reply keywords, negative reply keywords, and jump manual service keywords, call the locally stored
  • the steps of the item flow data corresponding to the current reply voice text data further include:
  • the initial voice text data is sent to the silent agent terminal whose current state is idle;
  • the generation of an unrecognized instruction can be triggered at this time.
  • the server detects the generated unrecognized instruction it means that it can switch to the silent seat end to assist the user in handling matters.
  • the difference between the silent agent terminal and the aforementioned agent terminal is that the silent agent terminal does not communicate with the user through its own voice communication, but the server converts every conversation of the user into text and then displays it on the display interface of the silent agent terminal. According to the text configuration of the dialogue, the silent agent will convert it into a silent reply text and send it to the server.
  • the server When the server receives the silent reply text from the silent agent side, it converts the silent reply text into the corresponding manual assistance voice data and sends it to the user side, that is, through a way of artificial silent assistance participation, it guides the user to complete the matter handling .
  • This method realizes the recognition of user intentions in a variety of different ways, improves the accuracy of user's voice-based intention recognition, and reduces the time-consuming process of dialogue with intelligent customer service robots.
  • the embodiment of the present application also provides a voice intention recognition device, and the voice intention recognition device is used to execute any embodiment of the aforementioned voice intention recognition method.
  • FIG. 4 is a schematic block diagram of a voice intention recognition apparatus provided by an embodiment of the present application.
  • the voice intention recognition device 100 can be configured in a server.
  • the voice intention recognition device 100 includes: a first voice recognition unit 110, a current reply voice acquisition unit 120, a second voice recognition unit 130, a keyword judgment unit 140, a target word slot acquisition unit 150, and an item voice Recognition unit 160.
  • the first voice recognition unit 110 is configured to, if the user's initial voice data sent by the user terminal is received, perform voice recognition on the user's initial voice data to obtain the initial voice text data corresponding to the user's initial voice data.
  • the server judges whether the user's initial voice data sent by the user terminal is received.
  • the corresponding scenario is that after the user terminal establishes a connection with the server at this time, the user can communicate with the intelligent dialogue system deployed on the server side.
  • the first voice sent by the intelligent dialogue system on the server side to the user terminal usually includes a welcome message and an inquiry sentence about the type of business to be done, such as "Welcome to call XXX company, what kind of business do you need to handle" voice.
  • the user terminal When the user terminal receives the first voice sent by the server, the user will respond according to the first voice. At this time, the user terminal collects the voice sent by the user in answer to the first voice to obtain the corresponding user initial voice data .
  • the server receives the user's initial voice data to recognize it, and obtains the initial voice text data.
  • the first voice recognition unit 110 is also used to:
  • the N-gram model is the N-gram model
  • the N-gram model is a language model (Language Model, LM)
  • the language model is a probability-based discriminant model
  • its input is a sentence (word of Sequence sequence)
  • the output is the probability of this sentence, that is, the joint probability of these words (joint probability).
  • the N-gram model can also be used for speech and text recognition.
  • the server When the server receives the user's initial voice data sent by the user terminal, it can perform voice recognition on the user's initial voice data by calling the N-ary model to obtain the corresponding initial voice text data.
  • the N-ary model is used for speech recognition, which improves the accuracy of converting customer speech into text.
  • the current reply voice obtaining unit 120 is configured to obtain the current reply text corresponding to the initial voice text data by calling a pre-trained natural language understanding model, and to generate the current reply corresponding to the current reply text by calling the pre-trained natural language generation model Voice, sending the current reply voice to the user terminal.
  • the natural language understanding model is the NLU model (the full name of NLU is Natural Language Understanding).
  • the natural language processing model ie, the NLP model
  • the natural language processing model generally includes a natural language understanding model and a natural language generation model (ie, the NLG model, the full name of NLG is Natural Language Generation).
  • NLU is responsible for understanding content
  • NLG is responsible for generating content.
  • the commonly used natural language understanding model is the Transformer model (a codec model based entirely on the attention mechanism, that is, the translation model), which uses the encoer-decoder architecture.
  • the specific processing process of the Transformer model is as follows: the input sequence is first subjected to wordembedding (that is, word embedding, that is, the input sequence is converted into a word vector), and then added to positional encoding (that is, positional encoding), and then input to the encoder (ie, encoder)
  • the output sequence of the encoder is processed the same as the input sequence, and then input to the decoder (ie, the decoder), and finally the final output sequence corresponding to the input sequence is obtained.
  • the intelligent dialogue system needs to convert it into voice data and send it to the user end. At this time, the current reply voice can be sent to the user end. For example, referring to the above example, when the user says to the intelligent dialogue system "automatic deduction failed when my bank card is returned to the credit card", the intelligent dialogue system tells the user "may you activate the automatic repayment function".
  • the natural language understanding model and the natural language generation model in the current reply voice acquisition unit 120 are both stored in a blockchain network.
  • corresponding summary information is obtained based on the natural language understanding model and the natural language generation model.
  • the summary information is hashed by the natural language understanding model and the natural language generation model Obtained, such as using sha256 algorithm processing.
  • Uploading summary information to the blockchain can ensure its security and fairness and transparency to users.
  • the user equipment can download the summary information from the blockchain to verify whether the natural language understanding model and the natural language generation model have been tampered with.
  • the blockchain referred to in this example is a new application mode of computer technology such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm.
  • Blockchain essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information for verification. The validity of the information (anti-counterfeiting) and the generation of the next block.
  • the blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer. .
  • the natural language understanding model and the natural language generation model called in the server are both stored in the blockchain network to ensure that the model cannot be tampered with.
  • the natural language understanding model and the natural language generation model uploaded by the server as a blockchain node device can both be invoked by other blockchain node devices in the same blockchain network as the server.
  • the second voice recognition unit 130 is configured to, if receiving user reply voice data corresponding to the current reply voice sent by the user terminal, perform voice recognition on the user reply voice data to obtain the corresponding current reply voice text data.
  • the user terminal After the user terminal receives the current reply voice (for example, do you have the automatic repayment function enabled), it will respond according to the current reply voice, that is, the user terminal collects the user corresponding to the current reply voice After replying the voice data, the user's replying voice data is sent to the server. At this time, the server can still perform voice recognition on the user's reply voice data through the N-ary model to obtain the corresponding current reply voice text data.
  • the server can still perform voice recognition on the user's reply voice data through the N-ary model to obtain the corresponding current reply voice text data.
  • the keyword judgment unit 140 is configured to judge whether the current reply voice text data includes a positive reply keyword, a negative reply keyword, or a jump manual service keyword.
  • the server can determine whether the current reply voice text data includes a positive reply keyword (the positive reply keyword is specifically yes), or whether it includes a negative reply keyword (the negative reply keyword is specifically no ), or whether to include the redirected manual service keywords, once the current reply voice text data includes one of the above three types of keywords, the corresponding processing flow will be executed; if the current reply voice text data does not include When any one of the above three types of keywords is used, the corresponding processing flow is also executed.
  • the target word slot acquiring unit 150 is configured to call a locally stored target word slot corresponding to the current reply text if the current reply voice text data includes a positive reply keyword or a negative reply keyword; wherein, the target word slot It includes the target word slot name, the target NLP model code, and the target word slot fixed speech model.
  • the current reply voice text data when it is determined that the current reply voice text data includes a positive reply key word or a negative reply key word, it means that the user has made a positive or negative reply to the current reply voice, indicating that he has entered to handle the current matter.
  • the normal process at this time, in order to improve the recognition efficiency of the user's subsequent dialogue, at this time, the locally stored target word slot corresponding to the current reply text can be called.
  • the target word slot includes the target NLP model code corresponding to the NLP model used by the intelligent dialogue system in the next conversation with the user, and the target word slot fixed speech model used.
  • the target word slot fixed speech technique model sets the speech technique that the intelligent dialogue system will use in the next conversation with the user. For example, you have not activated the automatic repayment function of our bank for the time being. If you want to activate it, please say "Yes", and you do not need to activate it. Say "return”. Since the target NLP model is called at this time to recognize and convert the user's subsequent speech and text, and the target NLP model is a model trained for such a dialogue scenario, it can have a higher recognition rate and a more accurate understanding of user intent. Moreover, since the fixed speech model is set in the target word slot, users can be guided to complete the matters more quickly according to the fixed speech model, and the data processing efficiency of the matters required by each user is improved.
  • the item voice recognition unit 160 is configured to, if it is detected that the user's first item handling voice data is received, the target NLP model code is used to obtain the corresponding target NLP model, and the first item handling voice data is processed through the target NLP model Perform recognition and obtain the corresponding first recognition result.
  • the corresponding target NLP model code is set in the target word slot, after the target NLP model code is called locally on the server according to the target NLP model code, the corresponding target NLP model code is obtained through the target NLP model code.
  • the NLP model recognizes the first item processing voice data, and obtains the corresponding first recognition result. Obtain the target NLP model through this orientation. Since the target NLP model is a model trained for this kind of dialogue scene, it can have a higher recognition rate and a more accurate understanding of user intentions.
  • the voice intention recognition apparatus 100 further includes:
  • the manual service jump unit 170 is configured to, if the current reply voice text data includes a jump manual service keyword, obtain the connection request of the agent terminal whose current status is idle and send it to the user terminal.
  • the current reply voice text data includes the jump manual service keyword
  • a connection request to obtain the agent terminal whose current status is idle is sent to the user terminal.
  • the user terminal selects to receive the connection with the agent terminal, it can assist the user to complete the subsequent process through the manual service provided by the agent terminal.
  • manual service intervention in the matter process it can assist users to complete the matter more quickly.
  • the voice intention recognition device 100 further includes:
  • the self-service prompting unit is used to call the locally stored and the current reply voice text data if any one of the reply keywords, negative reply keywords, and jump manual service keywords is not included in the current reply voice text data Corresponding event process data.
  • the current reply voice text data does not include any one of reply keywords, negative reply keywords, and jump manual service keywords, it means that the user can respond to the user's initial voice data according to the user terminal.
  • the initial voice and text data obtained is used to further determine the type of matters that the user needs to handle.
  • the above reply includes the two keywords of fixed credit card limit and query.
  • the item flow data corresponding to the above two keywords is called locally, and the corresponding items are sequentially assigned according to the order of the flow in the item flow data.
  • the process question is sent to the user terminal to guide the user to complete the matter through the self-service process.
  • the voice intention recognition device 100 further includes:
  • the silent agent terminal communication unit is configured to send the initial voice and text data to the silent agent terminal whose current state is idle if an unrecognized instruction corresponding to the item flow data is detected;
  • the silent reply text conversion unit is used to receive the silent reply text from the silent agent side, convert the silent reply text into corresponding manual assistance voice data, and send it to the user side.
  • the generation of an unrecognized instruction can be triggered at this time.
  • the server detects the generated unrecognized instruction it means that it can switch to the silent seat end to assist the user in handling matters.
  • the difference between the silent agent terminal and the aforementioned agent terminal is that the silent agent terminal does not communicate with the user through its own voice communication, but the server converts every conversation of the user into text and then displays it on the display interface of the silent agent terminal. According to the text configuration of the dialogue, the silent agent will convert it into a silent reply text and send it to the server.
  • the server When the server receives the silent reply text from the silent agent side, it converts the silent reply text into the corresponding manual assistance voice data and sends it to the user side, that is, through a way of artificial silent assistance participation, it guides the user to complete the matter handling .
  • the device realizes the recognition of the user's intention in a variety of different ways, improves the accuracy of the user's voice-based intention recognition, and reduces the time-consuming process of dialogue with the intelligent customer service robot.
  • the above-mentioned voice intention recognition apparatus can be implemented in the form of a computer program, and the computer program can be run on a computer device as shown in FIG. 6.
  • FIG. 6 is a schematic block diagram of a computer device according to an embodiment of the present application.
  • the computer device 500 is a server, and the server may be an independent server or a server cluster composed of multiple servers.
  • the computer device 500 includes a processor 502, a memory, and a network interface 505 connected through a system bus 501, where the memory may include a non-volatile storage medium 503 and an internal memory 504.
  • the non-volatile storage medium 503 can store an operating system 5031 and a computer program 5032.
  • the processor 502 can execute the voice intention recognition method.
  • the processor 502 is used to provide calculation and control capabilities, and support the operation of the entire computer device 500.
  • the internal memory 504 provides an environment for the operation of the computer program 5032 in the non-volatile storage medium 503.
  • the processor 502 can execute the voice intention recognition method.
  • the network interface 505 is used for network communication, such as providing data information transmission.
  • the structure shown in FIG. 6 is only a block diagram of part of the structure related to the solution of the present application, and does not constitute a limitation on the computer device 500 to which the solution of the present application is applied.
  • the specific computer device 500 may include more or fewer components than shown in the figure, or combine certain components, or have a different component arrangement.
  • the processor 502 is configured to run a computer program 5032 stored in a memory to implement the voice intention recognition method disclosed in the embodiment of the present application.
  • the embodiment of the computer device shown in FIG. 6 does not constitute a limitation on the specific configuration of the computer device.
  • the computer device may include more or less components than those shown in the figure. Or some parts are combined, or different parts are arranged.
  • the computer device may only include a memory and a processor. In such embodiments, the structures and functions of the memory and the processor are consistent with the embodiment shown in FIG. 6, and will not be repeated here.
  • the processor 502 may be a central processing unit (Central Processing Unit, CPU), and the processor 502 may also be other general-purpose processors, digital signal processors (Digital Signal Processors, DSPs), and special purpose processors.
  • Integrated circuit Application Specific Integrated Circuit, ASIC
  • off-the-shelf programmable gate array Field-Programmable Gate Array, FPGA
  • the general-purpose processor may be a microprocessor or the processor may also be any conventional processor.
  • a computer-readable storage medium In another embodiment of the present application, a computer-readable storage medium is provided.
  • the computer-readable storage medium may be non-volatile or volatile.
  • the computer-readable storage medium stores a computer program, where the computer program is executed by a processor to implement the voice intention recognition method disclosed in the embodiments of the present application.
  • the disclosed equipment, device, and method may be implemented in other ways.
  • the device embodiments described above are only illustrative.
  • the division of the units is only a logical function division. In actual implementation, there may be other division methods, or the units with the same function may be combined into one. Units, for example, multiple units or components can be combined or integrated into another system, or some features can be omitted or not implemented.
  • the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may also be electrical, mechanical or other forms of connection.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments of the present application.
  • the functional units in the various embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
  • the above-mentioned integrated unit can be implemented in the form of hardware or software functional unit.
  • the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a storage medium.
  • the technical solution of this application is essentially or the part that contributes to the existing technology, or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium. It includes several instructions to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the method described in each embodiment of the present application.
  • the aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), magnetic disk or optical disk and other media that can store program codes.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Theoretical Computer Science (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Business, Economics & Management (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Strategic Management (AREA)
  • Finance (AREA)
  • Development Economics (AREA)
  • Mathematical Physics (AREA)
  • Accounting & Taxation (AREA)
  • General Health & Medical Sciences (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Data Mining & Analysis (AREA)
  • Game Theory and Decision Science (AREA)
  • Databases & Information Systems (AREA)
  • Economics (AREA)
  • Marketing (AREA)
  • General Business, Economics & Management (AREA)
  • Telephonic Communication Services (AREA)

Abstract

语音的意图识别方法、装置、计算机设备及存储介质,涉及人工智能的语音语义,若接收用户初始语音数据,对其进行识别得到初始语音文本数据;调用NLU模型获取其对应的当前回复文本,调用NLG模型将当前回复文本生成当前回复语音;若接收到用户回复语音数据,对其进行识别得到当前回复语音文本数据;若当前回复语音文本数据中包括肯定回复关键词或否定回复关键词,调用对应的目标词槽;由目标NLP模型编码获取目标NLP模型,通过其对第一事项办理语音数据进行识别得到第一识别结果。实现了通过多种不同的方式完成对用户意图识别,提高了基于用户语音的意图识别准确率,降低与智能客服机器人对话进行事项办理的耗时。

Description

语音的意图识别方法、装置、计算机设备及存储介质
本申请要求于2020年6月5日提交中国专利局、申请号为202010507190.1,申请名称为“语音的意图识别方法、装置、计算机设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及人工智能的语音语义技术领域,尤其涉及一种语音的意图识别方法、装置、计算机设备及存储介质。
背景技术
目前,在智能客服机器人系统中,对话管理是用来控制智能客服机器人与客户之间交互的核心部分。发明人发现对话管理中主要是根据NLU模型(即自然语言理解模型)对用户的说话来进行理解和判断意图,但由于通过ASR技术(即自动语音识别技术)对客户语音进行语音转文字时转化正确率不高,导致NLU模型无法在短时间内准确识别用户意图,从而导致与智能客服机器人对话进行事项办理时间耗时较长,而且处理效率低下。
发明内容
本申请实施例提供了一种语音的意图识别方法、装置、计算机设备及存储介质,旨在解决现有技术中智能客服机器人系统中通过自动语音识别技术对客户语音进行语音转文字,由于转化正确率不高,自然语言理解模型无法在短时间内准确识别用户意图,导致与智能客服机器人对话进行事项办理时间耗时较长,而且处理效率低下的问题。
第一方面,本申请实施例提供了一种语音的意图识别方法,其包括:
若接收到用户端发送的用户初始语音数据,对所述用户初始语音数据进行语音识别,得到与所述用户初始语音数据对应的初始语音文本数据;
通过调用预先训练的自然语言理解模型以获取所述初始语音文本数据对应的当前回复文本,通过调用预先训练的自然语言生成模型以将当前回复文本对应生成当前回复语音,将所述当前回复语音发送至用户端;
若接收用户端所发送的与所述当前回复语音对应的用户回复语音数据,对所述用户回复语音数据进行语音识别得到对应的当前回复语音文本数据;
判断所述当前回复语音文本数据中是否包括肯定回复关键词、否定回复关键词、或是跳转人工服务关键词;
若所述当前回复语音文本数据中包括肯定回复关键词或否定回复关键词,调用本地存储的与所述当前回复文本对应的目标词槽;其中,目标词槽中包括目标词槽名称、目标NLP模型编码、目标词槽固定话术模型;以及
若检测接收到用户的第一事项办理语音数据,由所述目标NLP模型编码获取对应的目标NLP模型,通过所述目标NLP模型对所述第一事项办理语音数据进行识别,得到对应的第一识别结果。
第二方面,本申请实施例提供了一种语音的意图识别装置,其包括:
第一语音识别单元,用于若接收到用户端发送的用户初始语音数据,对所述用户初始语音数据进行语音识别,得到与所述用户初始语音数据对应的初始语音文本数据;
当前回复语音获取单元,用于通过调用预先训练的自然语言理解模型以获取所述初始语音文本数据对应的当前回复文本,通过调用预先训练的自然语言生成模型以将当前回复文本对应生成当前回复语音,将所述当前回复语音发送至用户端;
第二语音识别单元,用于若接收用户端所发送的与所述当前回复语音对应的用户回复语音数据,对所述用户回复语音数据进行语音识别得到对应的当前回复语音文本数据;
关键词判断单元,用于判断所述当前回复语音文本数据中是否包括肯定回复关键词、否定回复关键词、或是跳转人工服务关键词;
目标词槽获取单元,用于若所述当前回复语音文本数据中包括肯定回复关键词或否定回复关键词,调用本地存储的与所述当前回复文本对应的目标词槽;其中,目标词槽中包括目标词槽名称、目标NLP模型编码、目标词槽固定话术模型;以及
事项语音识别单元,用于若检测接收到用户的第一事项办理语音数据,由所述目标NLP模型编码获取对应的目标NLP模型,通过所述目标NLP模型对所述第一事项办理语音数据进行识别,得到对应的第一识别结果。
第三方面,本申请实施例又提供了一种计算机设备,其包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现以下步骤:
若接收到用户端发送的用户初始语音数据,对所述用户初始语音数据进行语音识别,得到与所述用户初始语音数据对应的初始语音文本数据;
通过调用预先训练的自然语言理解模型以获取所述初始语音文本数据对应的当前回复文本,通过调用预先训练的自然语言生成模型以将当前回复文本对应生成当前回复语音,将所述当前回复语音发送至用户端;
若接收用户端所发送的与所述当前回复语音对应的用户回复语音数据,对所述用户回复语音数据进行语音识别得到对应的当前回复语音文本数据;
判断所述当前回复语音文本数据中是否包括肯定回复关键词、否定回复关键词、或是跳转人工服务关键词;
若所述当前回复语音文本数据中包括肯定回复关键词或否定回复关键词,调用本地存储的与所述当前回复文本对应的目标词槽;其中,目标词槽中包括目标词槽名称、目标NLP模型编码、目标词槽固定话术模型;以及
若检测接收到用户的第一事项办理语音数据,由所述目标NLP模型编码获取对应的目标NLP模型,通过所述目标NLP模型对所述第一事项办理语音数据进行识别,得到对应的第一识别结果。
第四方面,本申请实施例还提供了一种计算机可读存储介质,其中所述计算机可读存储介质存储有计算机程序,所述计算机程序当被处理器执行时使所述处理器执行以下操作:
若接收到用户端发送的用户初始语音数据,对所述用户初始语音数据进行语音识别,得到与所述用户初始语音数据对应的初始语音文本数据;
通过调用预先训练的自然语言理解模型以获取所述初始语音文本数据对应的当前回复文本,通过调用预先训练的自然语言生成模型以将当前回复文本对应生成当前回复语音,将所述当前回复语音发送至用户端;
若接收用户端所发送的与所述当前回复语音对应的用户回复语音数据,对所述用户回复语音数据进行语音识别得到对应的当前回复语音文本数据;
判断所述当前回复语音文本数据中是否包括肯定回复关键词、否定回复关键词、或是跳转人工服务关键词;
若所述当前回复语音文本数据中包括肯定回复关键词或否定回复关键词,调用本地存储的与所述当前回复文本对应的目标词槽;其中,目标词槽中包括目标词槽名称、目标NLP模型编码、目标词槽固定话术模型;以及
若检测接收到用户的第一事项办理语音数据,由所述目标NLP模型编码获取对应的目标NLP模型,通过所述目标NLP模型对所述第一事项办理语音数据进行识别,得到对应的第一识别结果。
本申请实施例提供了一种语音的意图识别方法、装置、计算机设备及存储介质,包括若接收到用户端发送的用户初始语音数据,对用户初始语音数据进行语音识别,得到对应的初始语音文本数据;通过调用自然语言理解模型以获取初始语音文本数据对应的当前回复文本, 通过调用自然语言生成模型以将当前回复文本对应生成当前回复语音,将当前回复语音发送至用户端;若接收用户端所发送的与当前回复语音对应的用户回复语音数据,对用户回复语音数据进行语音识别得到对应的当前回复语音文本数据;若当前回复语音文本数据中包括肯定回复关键词或否定回复关键词,调用本地存储的与当前回复文本对应的目标词槽;以及若检测接收到用户的第一事项办理语音数据,由目标NLP模型编码获取对应的目标NLP模型,通过目标NLP模型对第一事项办理语音数据进行识别,得到对应的第一识别结果。该方法实现了通过多种不同的方式完成对用户意图识别,提高了基于用户语音的意图识别准确率,降低与智能客服机器人对话进行事项办理的耗时。
附图说明
为了更清楚地说明本申请实施例技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1为本申请实施例提供的语音的意图识别方法的应用场景示意图;
图2为本申请实施例提供的语音的意图识别方法的流程示意图;
图3为本申请实施例提供的语音的意图识别方法的的子流程示意图;
图4为本申请实施例提供的语音的意图识别装置的示意性框图;
图5为本申请实施例提供的语音的意图识别装置的子单元示意性框图;
图6为本申请实施例提供的计算机设备的示意性框图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
应当理解,当在本说明书和所附权利要求书中使用时,术语“包括”和“包含”指示所描述特征、整体、步骤、操作、元素和/或组件的存在,但并不排除一个或多个其它特征、整体、步骤、操作、元素、组件和/或其集合的存在或添加。
还应当理解,在此本申请说明书中所使用的术语仅仅是出于描述特定实施例的目的而并不意在限制本申请。如在本申请说明书和所附权利要求书中所使用的那样,除非上下文清楚地指明其它情况,否则单数形式的“一”、“一个”及“该”意在包括复数形式。
还应当进一步理解,在本申请说明书和所附权利要求书中使用的术语“和/或”是指相关联列出的项中的一个或多个的任何组合以及所有可能组合,并且包括这些组合。
请参阅图1和图2,图1为本申请实施例提供的语音的意图识别方法的应用场景示意图,图2为本申请实施例提供的语音的意图识别方法的流程示意图,该语音的意图识别方法应用于服务器中,该方法通过安装于服务器中的应用软件进行执行。
如图2所示,该方法包括步骤S110~S160。
S110、若接收到用户端发送的用户初始语音数据,对所述用户初始语音数据进行语音识别,得到与所述用户初始语音数据对应的初始语音文本数据。
在本实施例中,为了更清楚的理解技术方案,对具体实现场景所涉及的终端进行详细介绍。本申请是在服务器的角度描述技术方案。
一是用户端,用户端为用户所使用的智能终端(如智能手机等终端),用户可以使用用户端与服务器对应提供的智能对话系统进行语音交流,以实现具体的事项办理。也即用户端可以将所采集到的用户语音发送至服务器。
二是服务器,服务器对所接收到的由用户端所发送的用户语音,结合本地的智能对话系统的语音识别功能,进行各种事项办理。
服务器判断是否接收到用户端发送的用户初始语音数据,对应的场景是此时通过用户端 与服务器建立连接后,用户可与服务器一侧所部署的智能对话系统进行交流。一般服务器这一侧的智能对话系统向用户端发送的第一段语音一般是包括欢迎词和待办业务类型询问语句,例如“欢迎致电XXX公司,请问您需要办理哪种业务”这一类型的语音。
当用户端接收了服务器发送的第一段语音后,用户会根据第一段语音对应进行回答,此时用户端对用户回答第一段语音而发出的语音进行采集,得到对应的用户初始语音数据。服务器接收了用户初始语音数据对其进行识别,得到初始语音文本数据。
在一实施例中,步骤S110包括:
通过调用预先存储的N元模型对所述用户初始语音数据进行语音识别,得到对应的初始语音文本数据。
在本实施例中,N元模型即N-gram模型,N-gram模型是一种语言模型(Language Model,LM),语言模型是一个基于概率的判别模型,它的输入是一句话(单词的顺序序列),输出是这句话的概率,即这些单词的联合概率(jointprobability)。N-gram模型也可以用于语音文本识别。
当服务器接收到了用户端所发送的用户初始语音数据时,可以通过调用N元模型所述对所述用户初始语音数据进行语音识别,以得到对应的初始语音文本数据。通过N元模型进行语音识别,提高了对客户语音进行语音转文字时的正确率。
S120、通过调用预先训练的自然语言理解模型以获取所述初始语音文本数据对应的当前回复文本,通过调用预先训练的自然语言生成模型以将当前回复文本对应生成当前回复语音,将所述当前回复语音发送至用户端。
在本实施例中,自然语言理解模型即NLU模型(NLU的全称是Natural Language Understanding)。自然语言处理模型(即NLP模型)一般是包括自然语言理解模型和自然语言生成模型(即NLG模型,NLG的全称是Natural Language Generation)。其中,NLU负责理解内容,NLG负责生成内容。当用户对智能对话系统说“我的银行卡还信用卡时自动扣款失败”,首先需要利用NLU模型判断用户意图,理解用户想要什么,然后利用NLG模型说出“请问您是否开通自动还款功能”。
其中,常用的自然语言理解模型是Transformer模型(是一个完全基于注意力机制的编解码器模型,即翻译模型),用了encoer-decoder架构。Transformer模型的具体处理过程如下:输入序列首先经过wordembedding(即词嵌入,也就是将输入序列转化为词向量),再和positional encoding(即位置编码)相加后,输入到encoder(即编码器)中,编码器的输出序列经过的处理和输入序列一样,然后输入到decoder(即解码器中),最终得到与输入序列对应的最终输出序列。
之后由于最终输出序列是文本数据,而此时智能对话系统需要将其转化为语音数据后发送至用户端,此时则可将该当前回复语音发送至用户端。例如仍然参考上述实例,当用户对智能对话系统说“我的银行卡还信用卡时自动扣款失败”,智能对话系统对用户说出“请问您是否开通自动还款功能”。
在一实施例中,步骤S120中所述自然语言理解模型及所述自然语言生成模型均存储于区块链网络中。
在本实施例中,基于所述自然语言理解模型及所述自然语言生成模型得到对应的摘要信息,具体来说,摘要信息由所述自然语言理解模型及所述自然语言生成模型进行散列处理得到,比如利用sha256算法处理得到。将摘要信息上传至区块链可保证其安全性和对用户的公正透明性。用户设备可以从区块链中下载得该摘要信息,以便查证所述自然语言理解模型及所述自然语言生成模型是否被篡改。
本示例所指区块链是分布式数据存储、点对点传输、共识机制、加密算法等计算机技术的新型应用模式。区块链(Blockchain),本质上是一个去中心化的数据库,是一串使用密码学方法相关联产生的数据块,每一个数据块中包含了一批次网络交易的信息,用于验证其信息的有效性(防伪)和生成下一个区块。区块链可以包括区块链底层平台、平台产品服务层以及 应用服务层等。。
此时,服务器中所述调用的所述自然语言理解模型及所述自然语言生成模型均存储于区块链网络中,以确保模型的不可篡改。而且服务器作为区块链节点设备上传的所述自然语言理解模型及所述自然语言生成模型,均可以被与服务器处于同一区块链网络中的其他区块链节点设备所调用。
S130、若接收用户端所发送的与所述当前回复语音对应的用户回复语音数据,对所述用户回复语音数据进行语音识别得到对应的当前回复语音文本数据。
在本实施例中,由于用户端接收了当前回复语音(例如请问您是否开通自动还款功能)后,会根据该当前回复语音进行回复,即用户端在采集到的与当前回复语音对应的用户回复语音数据后,将用户回复语音数据发送至服务器。此时在服务器中还是可以通过N元模型对所述用户回复语音数据进行语音识别以得到对应的当前回复语音文本数据。
S140、判断所述当前回复语音文本数据中是否包括肯定回复关键词、否定回复关键词、或是跳转人工服务关键词。
在本实施例中,在服务器中可以判断所述当前回复语音文本数据中是否包括肯定回复关键词(肯定回复关键词具体为是),或是否包括否定回复关键词(否定回复关键词具体为不是),或是否包括跳转人工服务关键词,一旦所述当前回复语音文本数据中包括上述三种类型关键词其中一种时则执行对应的处理流程;若所述当前回复语音文本数据中未包括上述三种类型关键词任意一种时则也执行对应的处理流程。
S150、若所述当前回复语音文本数据中包括肯定回复关键词或否定回复关键词,调用本地存储的与所述当前回复文本对应的目标词槽;其中,目标词槽中包括目标词槽名称、目标NLP模型编码、目标词槽固定话术模型。
在本实施例中,当判定所述当前回复语音文本数据中包括肯定回复关键词或否定回复关键词,表示用户针对所述当前回复语音做出了肯定或者否定的回复,表示进入了办理当前事项的正常流程。此时为了提高用户后续对话的识别效率,此时可以调用本地存储的与所述当前回复文本对应的目标词槽。
目标词槽中包括针对智能对话系统在接下来与用户的交谈中所采用的NLP模型对应的目标NLP模型编码,以及所采用的目标词槽固定话术模型。该目标词槽固定话术模型设置了智能对话系统在接下来与用户的交谈中固定使用的话术,例如您暂未开通我行自动还款功能,如需开通请说“是”,无需开通请说“返回”。由于此时调用了目标NLP模型对用户的后续语音文本进行识别转换,且该目标NLP模型就是针对此种对话场景训练的模型,故能有更高的识别率,能更准确的理解用户意图。而且由于目标词槽中设置了固定话术模型,可以根据该固定话术模型引导用户更快速的办理完事项,提高了对每一用户所需办理事项的数据处理效率。
S160、若检测接收到用户的第一事项办理语音数据,由所述目标NLP模型编码获取对应的目标NLP模型,通过所述目标NLP模型对所述第一事项办理语音数据进行识别,得到对应的第一识别结果。
在本实施例中,由于目标词槽中设置了对应的目标NLP模型编码,当根据目标NLP模型编码在服务器的本地调用与所述目标NLP模型编码获取对应的目标NLP模型后,通过所述目标NLP模型对所述第一事项办理语音数据进行识别,得到对应的第一识别结果。通过这一定向获取目标NLP模型,由于该目标NLP模型就是针对此种对话场景训练的模型,故能有更高的识别率,能更准确的理解用户意图。
在一实施例中,如图3所示,步骤S160之后还包括:
S170、若所述当前回复语音文本数据中包括跳转人工服务关键词,获取当前状态为空闲的坐席端连接请求发送至用户端。
在本实施例中,当判定所述当前回复语音文本数据中包括跳转人工服务关键词,表示用户针对所述当前回复语音不太理解,可能需要跳转人工服务。此时,获取当前状态为空闲的 坐席端连接请求发送至用户端,用户端选定接收与该坐席端进行连接后,即可通过坐席端提供的人工服务辅助用户完成后续流程。通过人工服务介入事项流程,能辅助用户更快捷的完成事项办理。
在一实施例中,步骤S160之后还包括:
若所述当前回复语音文本数据中未包括回复关键词、否定回复关键词及跳转人工服务关键词中其中任意一个,调用本地存储的与所述当前回复语音文本数据对应的事项流程数据。
在本实施例中,当判定若所述当前回复语音文本数据中未包括回复关键词、否定回复关键词及跳转人工服务关键词中其中任意一个,表示可以根据用户端针对用户初始语音数据回复而得到的初始语音文本数据来进一步判断用户所需办理事项的事项类型。
例如,用户针对“请问您是否开通自动还款功能”这一问题,所给出的答复并非为是、不是、或者跳转人工服务中的任意一种时,而是答复“我要查询信用卡固定额度”,此时上述回复中包括信用卡固定额度及查询这两个关键词,此时在本地调用与上述两个关键词对应的事项流程数据,并按照事项流程数据中流程顺序依序将对应的流程问题发送至用户端,以通过自助办理的流程指引用户完成事项办理。
在一实施例中,步骤S160、或者步骤S170或者所述若所述当前回复语音文本数据中未包括回复关键词、否定回复关键词及跳转人工服务关键词中其中任意一个,调用本地存储的与所述当前回复语音文本数据对应的事项流程数据的步骤还包括:
若检测到与所述事项流程数据对应的未识别指令,将初始语音文本数据发送至当前状态为空闲的静默坐席端;
接收静默坐席端的静默回复文本,将所述静默回复文本转化为对应的人工协助语音数据并发送至用户端。
在本实施例中,若通过所述事项流程数据的指引,仍未引导用户成功办理事项,此时可以触发生成未识别指令。此时若服务器检测到生成的未识别指令,表示可以切换至静默坐席端以辅助用户办理事项。此时用户不再与智能对话系统进行对话,而且切换为静默坐席服务。
静默坐席端与前述坐席端的区别在于,该静默坐席端不是通过其本人语音沟通的方式与用户进行沟通,而是服务器将用户的每一句对话转化为文本后展示在静默坐席端的显示界面上,即静默坐席端根据该对话的文本配置对应的话术后转化为静默回复文本发送至服务器。
当服务器接收到收到静默坐席端的静默回复文本,将所述静默回复文本转化为对应的人工协助语音数据并发送至用户端,也即通过一种人工静默辅助参与的方式,引导用户完成事项办理。
该方法实现了通过多种不同的方式完成对用户意图识别,提高了基于用户语音的意图识别准确率,降低与智能客服机器人对话进行事项办理的耗时。
本申请实施例还提供一种语音的意图识别装置,该语音的意图识别装置用于执行前述语音的意图识别方法的任一实施例。具体地,请参阅图4,图4是本申请实施例提供的语音的意图识别装置的示意性框图。该语音的意图识别装置100可以配置于服务器中。
如图4所示,语音的意图识别装置100包括:第一语音识别单元110、当前回复语音获取单元120、第二语音识别单元130、关键词判断单元140、目标词槽获取单元150、事项语音识别单元160。
第一语音识别单元110,用于若接收到用户端发送的用户初始语音数据,对所述用户初始语音数据进行语音识别,得到与所述用户初始语音数据对应的初始语音文本数据。
在本实施例中,服务器判断是否接收到用户端发送的用户初始语音数据,对应的场景是此时通过用户端与服务器建立连接后,用户可与服务器一侧所部署的智能对话系统进行交流。一般服务器这一侧的智能对话系统向用户端发送的第一段语音一般是包括欢迎词和待办业务类型询问语句,例如“欢迎致电XXX公司,请问您需要办理哪种业务”这一类型的语音。
当用户端接收了服务器发送的第一段语音后,用户会根据第一段语音对应进行回答,此时用户端对用户回答第一段语音而发出的语音进行采集,得到对应的用户初始语音数据。服 务器接收了用户初始语音数据对其进行识别,得到初始语音文本数据。
在一实施例中,第一语音识别单元110还用于:
通过调用预先存储的N元模型对所述用户初始语音数据进行语音识别,得到对应的初始语音文本数据。
在本实施例中,N元模型即N-gram模型,N-gram模型是一种语言模型(Language Model,LM),语言模型是一个基于概率的判别模型,它的输入是一句话(单词的顺序序列),输出是这句话的概率,即这些单词的联合概率(jointprobability)。N-gram模型也可以用于语音文本识别。
当服务器接收到了用户端所发送的用户初始语音数据时,可以通过调用N元模型所述对所述用户初始语音数据进行语音识别,以得到对应的初始语音文本数据。通过N元模型进行语音识别,提高了对客户语音进行语音转文字时的正确率。
当前回复语音获取单元120,用于通过调用预先训练的自然语言理解模型以获取所述初始语音文本数据对应的当前回复文本,通过调用预先训练的自然语言生成模型以将当前回复文本对应生成当前回复语音,将所述当前回复语音发送至用户端。
在本实施例中,自然语言理解模型即NLU模型(NLU的全称是Natural Language Understanding)。自然语言处理模型(即NLP模型)一般是包括自然语言理解模型和自然语言生成模型(即NLG模型,NLG的全称是Natural Language Generation)。其中,NLU负责理解内容,NLG负责生成内容。当用户对智能对话系统说“我的银行卡还信用卡时自动扣款失败”,首先需要利用NLU模型判断用户意图,理解用户想要什么,然后利用NLG模型说出“请问您是否开通自动还款功能”。
其中,常用的自然语言理解模型是Transformer模型(是一个完全基于注意力机制的编解码器模型,即翻译模型),用了encoer-decoder架构。Transformer模型的具体处理过程如下:输入序列首先经过wordembedding(即词嵌入,也就是将输入序列转化为词向量),再和positional encoding(即位置编码)相加后,输入到encoder(即编码器)中,编码器的输出序列经过的处理和输入序列一样,然后输入到decoder(即解码器中),最终得到与输入序列对应的最终输出序列。
之后由于最终输出序列是文本数据,而此时智能对话系统需要将其转化为语音数据后发送至用户端,此时则可将该当前回复语音发送至用户端。例如仍然参考上述实例,当用户对智能对话系统说“我的银行卡还信用卡时自动扣款失败”,智能对话系统对用户说出“请问您是否开通自动还款功能”。
在一实施例中,当前回复语音获取单元120中所述自然语言理解模型及所述自然语言生成模型均存储于区块链网络中。
在本实施例中,基于所述自然语言理解模型及所述自然语言生成模型得到对应的摘要信息,具体来说,摘要信息由所述自然语言理解模型及所述自然语言生成模型进行散列处理得到,比如利用sha256算法处理得到。将摘要信息上传至区块链可保证其安全性和对用户的公正透明性。用户设备可以从区块链中下载得该摘要信息,以便查证所述自然语言理解模型及所述自然语言生成模型是否被篡改。
本示例所指区块链是分布式数据存储、点对点传输、共识机制、加密算法等计算机技术的新型应用模式。区块链(Blockchain),本质上是一个去中心化的数据库,是一串使用密码学方法相关联产生的数据块,每一个数据块中包含了一批次网络交易的信息,用于验证其信息的有效性(防伪)和生成下一个区块。区块链可以包括区块链底层平台、平台产品服务层以及应用服务层等。。
此时,服务器中所述调用的所述自然语言理解模型及所述自然语言生成模型均存储于区块链网络中,以确保模型的不可篡改。而且服务器作为区块链节点设备上传的所述自然语言理解模型及所述自然语言生成模型,均可以被与服务器处于同一区块链网络中的其他区块链节点设备所调用。
第二语音识别单元130,用于若接收用户端所发送的与所述当前回复语音对应的用户回复语音数据,对所述用户回复语音数据进行语音识别得到对应的当前回复语音文本数据。
在本实施例中,由于用户端接收了当前回复语音(例如请问您是否开通自动还款功能)后,会根据该当前回复语音进行回复,即用户端在采集到的与当前回复语音对应的用户回复语音数据后,将用户回复语音数据发送至服务器。此时在服务器中还是可以通过N元模型对所述用户回复语音数据进行语音识别以得到对应的当前回复语音文本数据。
关键词判断单元140,用于判断所述当前回复语音文本数据中是否包括肯定回复关键词、否定回复关键词、或是跳转人工服务关键词。
在本实施例中,在服务器中可以判断所述当前回复语音文本数据中是否包括肯定回复关键词(肯定回复关键词具体为是),或是否包括否定回复关键词(否定回复关键词具体为不是),或是否包括跳转人工服务关键词,一旦所述当前回复语音文本数据中包括上述三种类型关键词其中一种时则执行对应的处理流程;若所述当前回复语音文本数据中未包括上述三种类型关键词任意一种时则也执行对应的处理流程。
目标词槽获取单元150,用于若所述当前回复语音文本数据中包括肯定回复关键词或否定回复关键词,调用本地存储的与所述当前回复文本对应的目标词槽;其中,目标词槽中包括目标词槽名称、目标NLP模型编码、目标词槽固定话术模型。
在本实施例中,当判定所述当前回复语音文本数据中包括肯定回复关键词或否定回复关键词,表示用户针对所述当前回复语音做出了肯定或者否定的回复,表示进入了办理当前事项的正常流程。此时为了提高用户后续对话的识别效率,此时可以调用本地存储的与所述当前回复文本对应的目标词槽。
目标词槽中包括针对智能对话系统在接下来与用户的交谈中所采用的NLP模型对应的目标NLP模型编码,以及所采用的目标词槽固定话术模型。该目标词槽固定话术模型设置了智能对话系统在接下来与用户的交谈中固定使用的话术,例如您暂未开通我行自动还款功能,如需开通请说“是”,无需开通请说“返回”。由于此时调用了目标NLP模型对用户的后续语音文本进行识别转换,且该目标NLP模型就是针对此种对话场景训练的模型,故能有更高的识别率,能更准确的理解用户意图。而且由于目标词槽中设置了固定话术模型,可以根据该固定话术模型引导用户更快速的办理完事项,提高了对每一用户所需办理事项的数据处理效率。
事项语音识别单元160,用于若检测接收到用户的第一事项办理语音数据,由所述目标NLP模型编码获取对应的目标NLP模型,通过所述目标NLP模型对所述第一事项办理语音数据进行识别,得到对应的第一识别结果。
在本实施例中,由于目标词槽中设置了对应的目标NLP模型编码,当根据目标NLP模型编码在服务器的本地调用与所述目标NLP模型编码获取对应的目标NLP模型后,通过所述目标NLP模型对所述第一事项办理语音数据进行识别,得到对应的第一识别结果。通过这一定向获取目标NLP模型,由于该目标NLP模型就是针对此种对话场景训练的模型,故能有更高的识别率,能更准确的理解用户意图。
在一实施例中,如图5所示,语音的意图识别装置100还包括:
人工服务跳转单元170,用于若所述当前回复语音文本数据中包括跳转人工服务关键词,获取当前状态为空闲的坐席端连接请求发送至用户端。
在本实施例中,当判定所述当前回复语音文本数据中包括跳转人工服务关键词,表示用户针对所述当前回复语音不太理解,可能需要跳转人工服务。此时,获取当前状态为空闲的坐席端连接请求发送至用户端,用户端选定接收与该坐席端进行连接后,即可通过坐席端提供的人工服务辅助用户完成后续流程。通过人工服务介入事项流程,能辅助用户更快捷的完成事项办理。
在一实施例中,语音的意图识别装置100还包括:
自助办理提示单元,用于若所述当前回复语音文本数据中未包括回复关键词、否定回复 关键词及跳转人工服务关键词中其中任意一个,调用本地存储的与所述当前回复语音文本数据对应的事项流程数据。
在本实施例中,当判定若所述当前回复语音文本数据中未包括回复关键词、否定回复关键词及跳转人工服务关键词中其中任意一个,表示可以根据用户端针对用户初始语音数据回复而得到的初始语音文本数据来进一步判断用户所需办理事项的事项类型。
例如,用户针对“请问您是否开通自动还款功能”这一问题,所给出的答复并非为是、不是、或者跳转人工服务中的任意一种时,而是答复“我要查询信用卡固定额度”,此时上述回复中包括信用卡固定额度及查询这两个关键词,此时在本地调用与上述两个关键词对应的事项流程数据,并按照事项流程数据中流程顺序依序将对应的流程问题发送至用户端,以通过自助办理的流程指引用户完成事项办理。
在一实施例中,语音的意图识别装置100还包括:
静默坐席端通讯单元,用于若检测到与所述事项流程数据对应的未识别指令,将初始语音文本数据发送至当前状态为空闲的静默坐席端;
静默回复文本转化单元,用于接收静默坐席端的静默回复文本,将所述静默回复文本转化为对应的人工协助语音数据并发送至用户端。
在本实施例中,若通过所述事项流程数据的指引,仍未引导用户成功办理事项,此时可以触发生成未识别指令。此时若服务器检测到生成的未识别指令,表示可以切换至静默坐席端以辅助用户办理事项。此时用户不再与智能对话系统进行对话,而且切换为静默坐席服务。
静默坐席端与前述坐席端的区别在于,该静默坐席端不是通过其本人语音沟通的方式与用户进行沟通,而是服务器将用户的每一句对话转化为文本后展示在静默坐席端的显示界面上,即静默坐席端根据该对话的文本配置对应的话术后转化为静默回复文本发送至服务器。
当服务器接收到收到静默坐席端的静默回复文本,将所述静默回复文本转化为对应的人工协助语音数据并发送至用户端,也即通过一种人工静默辅助参与的方式,引导用户完成事项办理。
该装置实现了通过多种不同的方式完成对用户意图识别,提高了基于用户语音的意图识别准确率,降低与智能客服机器人对话进行事项办理的耗时。
上述语音的意图识别装置可以实现为计算机程序的形式,该计算机程序可以在如图6所示的计算机设备上运行。
请参阅图6,图6是本申请实施例提供的计算机设备的示意性框图。该计算机设备500是服务器,服务器可以是独立的服务器,也可以是多个服务器组成的服务器集群。
参阅图6,该计算机设备500包括通过系统总线501连接的处理器502、存储器和网络接口505,其中,存储器可以包括非易失性存储介质503和内存储器504。
该非易失性存储介质503可存储操作系统5031和计算机程序5032。该计算机程序5032被执行时,可使得处理器502执行语音的意图识别方法。
该处理器502用于提供计算和控制能力,支撑整个计算机设备500的运行。
该内存储器504为非易失性存储介质503中的计算机程序5032的运行提供环境,该计算机程序5032被处理器502执行时,可使得处理器502执行语音的意图识别方法。
该网络接口505用于进行网络通信,如提供数据信息的传输等。本领域技术人员可以理解,图6中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备500的限定,具体的计算机设备500可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。
其中,所述处理器502用于运行存储在存储器中的计算机程序5032,以实现本申请实施例公开的语音的意图识别方法。
本领域技术人员可以理解,图6中示出的计算机设备的实施例并不构成对计算机设备具体构成的限定,在其他实施例中,计算机设备可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。例如,在一些实施例中,计算机设备可以仅包括存储器及 处理器,在这样的实施例中,存储器及处理器的结构及功能与图6所示实施例一致,在此不再赘述。
应当理解,在本申请实施例中,处理器502可以是中央处理单元(Central ProcessingUnit,CPU),该处理器502还可以是其他通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现成可编程门阵列(Field-Programmable GateArray,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。其中,通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。
在本申请的另一实施例中提供计算机可读存储介质。所述计算机可读存储介质可以是非易失性,也可以是易失性。该计算机可读存储介质存储有计算机程序,其中计算机程序被处理器执行时实现本申请实施例公开的语音的意图识别方法。
所属领域的技术人员可以清楚地了解到,为了描述的方便和简洁,上述描述的设备、装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、计算机软件或者二者的结合来实现,为了清楚地说明硬件和软件的可互换性,在上述说明中已经按照功能一般性地描述了各示例的组成及步骤。这些功能究竟以硬件还是软件方式来执行取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
在本申请所提供的几个实施例中,应该理解到,所揭露的设备、装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为逻辑功能划分,实际实现时可以有另外的划分方式,也可以将具有相同功能的单元集合成一个单元,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另外,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口、装置或单元的间接耦合或通信连接,也可以是电的,机械的或其它的形式连接。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本申请实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以是两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分,或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(ROM,Read-Only Memory)、磁碟或者光盘等各种可以存储程序代码的介质。
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到各种等效的修改或替换,这些修改或替换都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以权利要求的保护范围为准。

Claims (20)

  1. 一种语音的意图识别方法,其中,包括:
    若接收到用户端发送的用户初始语音数据,对所述用户初始语音数据进行语音识别,得到与所述用户初始语音数据对应的初始语音文本数据;
    通过调用预先训练的自然语言理解模型以获取所述初始语音文本数据对应的当前回复文本,通过调用预先训练的自然语言生成模型以将当前回复文本对应生成当前回复语音,将所述当前回复语音发送至用户端;
    若接收用户端所发送的与所述当前回复语音对应的用户回复语音数据,对所述用户回复语音数据进行语音识别得到对应的当前回复语音文本数据;
    判断所述当前回复语音文本数据中是否包括肯定回复关键词、否定回复关键词、或是跳转人工服务关键词;
    若所述当前回复语音文本数据中包括肯定回复关键词或否定回复关键词,调用本地存储的与所述当前回复文本对应的目标词槽;其中,目标词槽中包括目标词槽名称、目标NLP模型编码、目标词槽固定话术模型;以及
    若检测接收到用户的第一事项办理语音数据,由所述目标NLP模型编码获取对应的目标NLP模型,通过所述目标NLP模型对所述第一事项办理语音数据进行识别,得到对应的第一识别结果。
  2. 根据权利要求1所述的语音的意图识别方法,其中,所述判断所述当前回复语音文本数据中是否包括肯定回复关键词、否定回复关键词、或是跳转人工服务关键词之后,还包括:
    若所述当前回复语音文本数据中包括跳转人工服务关键词,获取当前状态为空闲的坐席端连接请求发送至用户端。
  3. 根据权利要求1所述的语音的意图识别方法,其中,所述判断所述当前回复语音文本数据中是否包括肯定回复关键词、否定回复关键词、或是跳转人工服务关键词之后,还包括:
    若所述当前回复语音文本数据中未包括回复关键词、否定回复关键词及跳转人工服务关键词中其中任意一个,调用本地存储的与所述当前回复语音文本数据对应的事项流程数据。
  4. 根据权利要求3所述的语音的意图识别方法,其中,还包括:
    若检测到与所述事项流程数据对应的未识别指令,将初始语音文本数据发送至当前状态为空闲的静默坐席端;
    接收静默坐席端的静默回复文本,将所述静默回复文本转化为对应的人工协助语音数据并发送至用户端。
  5. 根据权利要求1所述的语音的意图识别方法,其中,所述对所述用户初始语音数据进行语音识别,得到与所述用户初始语音数据对应的初始语音文本数据,包括:
    通过调用预先存储的N元模型对所述用户初始语音数据进行语音识别,得到对应的初始语音文本数据。
  6. 根据权利要求1所述的语音的意图识别方法,其中,所述自然语言理解模型及所述自然语言生成模型均存储于区块链网络中。
  7. 根据权利要求1所述的语音的意图识别方法,其中,所述自然语言理解模型为Transformer模型;
    所述通过调用预先训练的自然语言理解模型以获取所述初始语音文本数据对应的当前回复文本,包括:
    将所述初始语音文本数据依次经过词嵌入、与位置编码相加、及编码器编码和解码器解码,得到所述初始语音文本数据对应的当前回复文本。
  8. 根据权利要求1所述的语音的意图识别方法,其中,所述对所述用户回复语音数据进行语音识别得到对应的当前回复语音文本数据,包括:
    通过N元模型对所述用户回复语音数据进行语音识别以得到对应的当前回复语音文本数据。
  9. 一种语音的意图识别装置,其中,包括:
    第一语音识别单元,用于若接收到用户端发送的用户初始语音数据,对所述用户初始语音数据进行语音识别,得到与所述用户初始语音数据对应的初始语音文本数据;
    当前回复语音获取单元,用于通过调用预先训练的自然语言理解模型以获取所述初始语音文本数据对应的当前回复文本,通过调用预先训练的自然语言生成模型以将当前回复文本对应生成当前回复语音,将所述当前回复语音发送至用户端;
    第二语音识别单元,用于若接收用户端所发送的与所述当前回复语音对应的用户回复语音数据,对所述用户回复语音数据进行语音识别得到对应的当前回复语音文本数据;
    关键词判断单元,用于判断所述当前回复语音文本数据中是否包括肯定回复关键词、否定回复关键词、或是跳转人工服务关键词;
    目标词槽获取单元,用于若所述当前回复语音文本数据中包括肯定回复关键词或否定回复关键词,调用本地存储的与所述当前回复文本对应的目标词槽;其中,目标词槽中包括目标词槽名称、目标NLP模型编码、目标词槽固定话术模型;以及
    事项语音识别单元,用于若检测接收到用户的第一事项办理语音数据,由所述目标NLP模型编码获取对应的目标NLP模型,通过所述目标NLP模型对所述第一事项办理语音数据进行识别,得到对应的第一识别结果。
  10. 一种计算机设备,包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机程序,其中,所述处理器执行所述计算机程序时实现以下步骤:
    若接收到用户端发送的用户初始语音数据,对所述用户初始语音数据进行语音识别,得到与所述用户初始语音数据对应的初始语音文本数据;
    通过调用预先训练的自然语言理解模型以获取所述初始语音文本数据对应的当前回复文本,通过调用预先训练的自然语言生成模型以将当前回复文本对应生成当前回复语音,将所述当前回复语音发送至用户端;
    若接收用户端所发送的与所述当前回复语音对应的用户回复语音数据,对所述用户回复语音数据进行语音识别得到对应的当前回复语音文本数据;
    判断所述当前回复语音文本数据中是否包括肯定回复关键词、否定回复关键词、或是跳转人工服务关键词;
    若所述当前回复语音文本数据中包括肯定回复关键词或否定回复关键词,调用本地存储的与所述当前回复文本对应的目标词槽;其中,目标词槽中包括目标词槽名称、目标NLP模型编码、目标词槽固定话术模型;以及
    若检测接收到用户的第一事项办理语音数据,由所述目标NLP模型编码获取对应的目标NLP模型,通过所述目标NLP模型对所述第一事项办理语音数据进行识别,得到对应的第一识别结果。
  11. 根据权利要求10所述的计算机设备,其中,所述判断所述当前回复语音文本数据中是否包括肯定回复关键词、否定回复关键词、或是跳转人工服务关键词之后,还包括:
    若所述当前回复语音文本数据中包括跳转人工服务关键词,获取当前状态为空闲的坐席端连接请求发送至用户端。
  12. 根据权利要求10所述的计算机设备,其中,所述判断所述当前回复语音文本数据中是否包括肯定回复关键词、否定回复关键词、或是跳转人工服务关键词之后,还包括:
    若所述当前回复语音文本数据中未包括回复关键词、否定回复关键词及跳转人工服务关键词中其中任意一个,调用本地存储的与所述当前回复语音文本数据对应的事项流程数据。
  13. 根据权利要求12所述的计算机设备,其中,还包括:
    若检测到与所述事项流程数据对应的未识别指令,将初始语音文本数据发送至当前状态为空闲的静默坐席端;
    接收静默坐席端的静默回复文本,将所述静默回复文本转化为对应的人工协助语音数据并发送至用户端。
  14. 根据权利要求10所述的计算机设备,其中,所述对所述用户初始语音数据进行语音识别,得到与所述用户初始语音数据对应的初始语音文本数据,包括:
    通过调用预先存储的N元模型对所述用户初始语音数据进行语音识别,得到对应的初始语音文本数据。
  15. 根据权利要求10所述的计算机设备,其中,所述自然语言理解模型及所述自然语言生成模型均存储于区块链网络中。
  16. 根据权利要求10所述的计算机设备,其中,所述自然语言理解模型为Transformer模型;
    所述通过调用预先训练的自然语言理解模型以获取所述初始语音文本数据对应的当前回复文本,包括:
    将所述初始语音文本数据依次经过词嵌入、与位置编码相加、及编码器编码和解码器解码,得到所述初始语音文本数据对应的当前回复文本。
  17. 根据权利要求10所述的计算机设备,其中,所述对所述用户回复语音数据进行语音识别得到对应的当前回复语音文本数据,包括:
    通过N元模型对所述用户回复语音数据进行语音识别以得到对应的当前回复语音文本数据。
  18. 一种计算机可读存储介质,其中,所述计算机可读存储介质存储有计算机程序,所述计算机程序当被处理器执行时使所述处理器执行以下操作:
    若接收到用户端发送的用户初始语音数据,对所述用户初始语音数据进行语音识别,得到与所述用户初始语音数据对应的初始语音文本数据;
    通过调用预先训练的自然语言理解模型以获取所述初始语音文本数据对应的当前回复文本,通过调用预先训练的自然语言生成模型以将当前回复文本对应生成当前回复语音,将所述当前回复语音发送至用户端;
    若接收用户端所发送的与所述当前回复语音对应的用户回复语音数据,对所述用户回复语音数据进行语音识别得到对应的当前回复语音文本数据;
    判断所述当前回复语音文本数据中是否包括肯定回复关键词、否定回复关键词、或是跳转人工服务关键词;
    若所述当前回复语音文本数据中包括肯定回复关键词或否定回复关键词,调用本地存储的与所述当前回复文本对应的目标词槽;其中,目标词槽中包括目标词槽名称、目标NLP模型编码、目标词槽固定话术模型;以及
    若检测接收到用户的第一事项办理语音数据,由所述目标NLP模型编码获取对应的目标NLP模型,通过所述目标NLP模型对所述第一事项办理语音数据进行识别,得到对应的第一识别结果。
  19. 根据权利要求18所述的计算机可读存储介质,其中,所述判断所述当前回复语音文本数据中是否包括肯定回复关键词、否定回复关键词、或是跳转人工服务关键词之后,还包括:
    若所述当前回复语音文本数据中包括跳转人工服务关键词,获取当前状态为空闲的坐席端连接请求发送至用户端。
  20. 根据权利要求18所述的计算机可读存储介质,其中,所述判断所述当前回复语音文本数据中是否包括肯定回复关键词、否定回复关键词、或是跳转人工服务关键词之后,还包括:
    若所述当前回复语音文本数据中未包括回复关键词、否定回复关键词及跳转人工服务关键词中其中任意一个,调用本地存储的与所述当前回复语音文本数据对应的事项流程数据。
PCT/CN2020/123205 2020-06-05 2020-10-23 语音的意图识别方法、装置、计算机设备及存储介质 WO2021135548A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010507190.1A CN111710336B (zh) 2020-06-05 2020-06-05 语音的意图识别方法、装置、计算机设备及存储介质
CN202010507190.1 2020-06-05

Publications (1)

Publication Number Publication Date
WO2021135548A1 true WO2021135548A1 (zh) 2021-07-08

Family

ID=72539507

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/123205 WO2021135548A1 (zh) 2020-06-05 2020-10-23 语音的意图识别方法、装置、计算机设备及存储介质

Country Status (2)

Country Link
CN (1) CN111710336B (zh)
WO (1) WO2021135548A1 (zh)

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113642334A (zh) * 2021-08-11 2021-11-12 科大讯飞股份有限公司 意图识别方法、装置、电子设备和存储介质
CN113689862A (zh) * 2021-08-23 2021-11-23 南京优飞保科信息技术有限公司 一种客服坐席语音数据的质检方法和系统
CN113727051A (zh) * 2021-08-31 2021-11-30 深圳市思迪信息技术股份有限公司 基于虚拟坐席的双向视频方法及系统、设备及存储介质
CN113794808A (zh) * 2021-09-01 2021-12-14 北京亿心宜行汽车技术开发服务有限公司 代驾电话下单方法及系统
CN113849604A (zh) * 2021-09-27 2021-12-28 广东纬德信息科技股份有限公司 一种基于nlp的电网调控方法、系统、设备及可存储介质
CN113988803A (zh) * 2021-10-28 2022-01-28 国网福建省电力有限公司 支持服务流程可视化配置的供电服务指挥人机交互引擎
CN114781401A (zh) * 2022-05-06 2022-07-22 马上消费金融股份有限公司 一种数据处理方法、装置、设备和存储介质
CN115936011A (zh) * 2022-12-28 2023-04-07 南京易米云通网络科技有限公司 智能对话中多意图语义识别方法
CN116450799A (zh) * 2023-06-16 2023-07-18 浪潮智慧科技有限公司 一种应用于交通管理服务的智能对话方法及设备
CN116664078A (zh) * 2023-07-24 2023-08-29 杭州所思互连科技有限公司 基于语意特征向量的rpa对象识别方法
CN117149983A (zh) * 2023-10-30 2023-12-01 山东高速信息集团有限公司 基于高速公路业务智能对话的方法、装置及设备
CN117594038A (zh) * 2024-01-19 2024-02-23 壹药网科技(上海)股份有限公司 语音服务改进方法及系统

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111710336B (zh) * 2020-06-05 2023-05-26 平安科技(深圳)有限公司 语音的意图识别方法、装置、计算机设备及存储介质
CN112365894B (zh) * 2020-11-09 2024-05-17 青岛易蓓教育科技有限公司 基于ai的复合语音交互方法、装置及计算机设备
CN112766990B (zh) * 2021-02-01 2023-07-28 北京邮电大学 基于多轮对话改进的智能客服辅助系统和方法
CN113114851B (zh) * 2021-03-24 2022-06-21 北京百度网讯科技有限公司 来电智能语音回复方法、装置、电子设备和存储介质
CN113160817A (zh) * 2021-04-22 2021-07-23 平安科技(深圳)有限公司 基于意图识别的语音交互方法及系统
CN113506573B (zh) * 2021-08-06 2022-03-18 百融云创科技股份有限公司 生成回复语音的方法及装置
CN114220432A (zh) * 2021-11-15 2022-03-22 交通运输部南海航海保障中心广州通信中心 基于海事单边带语音自动监听方法、系统及存储介质
CN115643229A (zh) * 2022-09-29 2023-01-24 深圳市毅光信电子有限公司 通话事项处理方法、装置、系统、电子设备及存储介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109829744A (zh) * 2018-12-15 2019-05-31 深圳壹账通智能科技有限公司 基于自然语言处理的咨询方法、装置、电子设备及介质
CN109961780A (zh) * 2017-12-22 2019-07-02 深圳市优必选科技有限公司 一种人机交互方法、装置、服务器和存储介质
CN110060663A (zh) * 2019-04-28 2019-07-26 北京云迹科技有限公司 一种应答服务的方法、装置及系统
WO2019207597A1 (en) * 2018-04-23 2019-10-31 Zubair Ahmed System and method of operating open ended interactive voice response in any spoken languages
CN110491383A (zh) * 2019-09-25 2019-11-22 北京声智科技有限公司 一种语音交互方法、装置、系统、存储介质及处理器
CN111710336A (zh) * 2020-06-05 2020-09-25 平安科技(深圳)有限公司 语音的意图识别方法、装置、计算机设备及存储介质

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109688281A (zh) * 2018-12-03 2019-04-26 复旦大学 一种智能语音交互方法及系统
CN109829036A (zh) * 2019-02-12 2019-05-31 浙江核新同花顺网络信息股份有限公司 一种对话管理方法及相关装置
CN109981910B (zh) * 2019-02-22 2021-08-10 中国联合网络通信集团有限公司 业务推荐方法及设备
CN110377716B (zh) * 2019-07-23 2022-07-12 百度在线网络技术(北京)有限公司 对话的交互方法、装置及计算机可读存储介质
CN110827816A (zh) * 2019-11-08 2020-02-21 杭州依图医疗技术有限公司 语音指令识别方法、装置、电子设备及存储介质

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109961780A (zh) * 2017-12-22 2019-07-02 深圳市优必选科技有限公司 一种人机交互方法、装置、服务器和存储介质
WO2019207597A1 (en) * 2018-04-23 2019-10-31 Zubair Ahmed System and method of operating open ended interactive voice response in any spoken languages
CN109829744A (zh) * 2018-12-15 2019-05-31 深圳壹账通智能科技有限公司 基于自然语言处理的咨询方法、装置、电子设备及介质
CN110060663A (zh) * 2019-04-28 2019-07-26 北京云迹科技有限公司 一种应答服务的方法、装置及系统
CN110491383A (zh) * 2019-09-25 2019-11-22 北京声智科技有限公司 一种语音交互方法、装置、系统、存储介质及处理器
CN111710336A (zh) * 2020-06-05 2020-09-25 平安科技(深圳)有限公司 语音的意图识别方法、装置、计算机设备及存储介质

Cited By (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113642334B (zh) * 2021-08-11 2023-12-05 科大讯飞股份有限公司 意图识别方法、装置、电子设备和存储介质
CN113642334A (zh) * 2021-08-11 2021-11-12 科大讯飞股份有限公司 意图识别方法、装置、电子设备和存储介质
CN113689862A (zh) * 2021-08-23 2021-11-23 南京优飞保科信息技术有限公司 一种客服坐席语音数据的质检方法和系统
CN113689862B (zh) * 2021-08-23 2024-03-22 南京优飞保科信息技术有限公司 一种客服坐席语音数据的质检方法和系统
CN113727051A (zh) * 2021-08-31 2021-11-30 深圳市思迪信息技术股份有限公司 基于虚拟坐席的双向视频方法及系统、设备及存储介质
CN113794808A (zh) * 2021-09-01 2021-12-14 北京亿心宜行汽车技术开发服务有限公司 代驾电话下单方法及系统
CN113794808B (zh) * 2021-09-01 2024-01-30 北京亿心宜行汽车技术开发服务有限公司 代驾电话下单方法及系统
CN113849604A (zh) * 2021-09-27 2021-12-28 广东纬德信息科技股份有限公司 一种基于nlp的电网调控方法、系统、设备及可存储介质
CN113988803A (zh) * 2021-10-28 2022-01-28 国网福建省电力有限公司 支持服务流程可视化配置的供电服务指挥人机交互引擎
CN114781401A (zh) * 2022-05-06 2022-07-22 马上消费金融股份有限公司 一种数据处理方法、装置、设备和存储介质
CN115936011B (zh) * 2022-12-28 2023-10-20 南京易米云通网络科技有限公司 智能对话中多意图语义识别方法
CN115936011A (zh) * 2022-12-28 2023-04-07 南京易米云通网络科技有限公司 智能对话中多意图语义识别方法
CN116450799B (zh) * 2023-06-16 2023-09-12 浪潮智慧科技有限公司 一种应用于交通管理服务的智能对话方法及设备
CN116450799A (zh) * 2023-06-16 2023-07-18 浪潮智慧科技有限公司 一种应用于交通管理服务的智能对话方法及设备
CN116664078A (zh) * 2023-07-24 2023-08-29 杭州所思互连科技有限公司 基于语意特征向量的rpa对象识别方法
CN116664078B (zh) * 2023-07-24 2023-10-10 杭州所思互连科技有限公司 基于语意特征向量的rpa对象识别方法
CN117149983A (zh) * 2023-10-30 2023-12-01 山东高速信息集团有限公司 基于高速公路业务智能对话的方法、装置及设备
CN117149983B (zh) * 2023-10-30 2024-02-27 山东高速信息集团有限公司 基于高速公路业务智能对话的方法、装置及设备
CN117594038A (zh) * 2024-01-19 2024-02-23 壹药网科技(上海)股份有限公司 语音服务改进方法及系统
CN117594038B (zh) * 2024-01-19 2024-04-02 壹药网科技(上海)股份有限公司 语音服务改进方法及系统

Also Published As

Publication number Publication date
CN111710336A (zh) 2020-09-25
CN111710336B (zh) 2023-05-26

Similar Documents

Publication Publication Date Title
WO2021135548A1 (zh) 语音的意图识别方法、装置、计算机设备及存储介质
KR102297394B1 (ko) 적절한 에이전트의 자동화된 어시스턴트 호출
US7873149B2 (en) Systems and methods for gathering information
US9088652B2 (en) System and method for speech-enabled call routing
US7983399B2 (en) Remote notification system and method and intelligent agent therefor
CN109087639B (zh) 用于语音识别的方法、装置、电子设备及计算机可读介质
CN106537494B (zh) 语音识别装置和语音识别方法
US20030105634A1 (en) Method for dialog management
WO2016054110A1 (en) Pattern-controlled automated messaging system
US20090088215A1 (en) Method and apparatus for secure electronic business card exchange
EP2206329A2 (en) Method and apparatus for identification of conference call participants
KR20100065317A (ko) 개인용 통신 장치를 위한 음성-텍스트 전사
WO2009042105A1 (en) Method and apparatus for mapping of conference call participants using positional presence
US8954317B1 (en) Method and apparatus of processing user text input information
US20120020577A1 (en) System and method for efficient unified messaging system support for speech-to-text service
US10897368B2 (en) Integrating an interactive virtual assistant into a meeting environment
US7139717B1 (en) System for dialog management
US11734515B2 (en) Adaptive natural language steganography and watermarking for virtual assistants
JP2024520659A (ja) 対話型通信システムを動的にナビゲートするための方法、装置、及びシステム
US8085927B2 (en) Interactive voice response system with prioritized call monitoring
CN111711523B (zh) 事项办理权限授权方法、装置、计算机设备及存储介质
CN112786041B (zh) 语音处理方法及相关设备
CN107111613A (zh) 基于计算机的翻译系统和方法
EP4272111A1 (en) Systems and methods for conducting an automated dialogue
US20240169152A1 (en) Selection system for contextual prediction processing versus classical prediction processing

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20909387

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20909387

Country of ref document: EP

Kind code of ref document: A1