WO2023097745A1

WO2023097745A1 - Deep learning-based intelligent human-computer interaction method and system, and terminal

Info

Publication number: WO2023097745A1
Application number: PCT/CN2021/136927
Authority: WO
Inventors: 张庆茂; 刘培刚
Original assignee: 山东远联信息科技有限公司
Priority date: 2021-12-03
Filing date: 2021-12-10
Publication date: 2023-06-08
Also published as: CN114240454A

Abstract

A deep learning-based intelligent human-computer interaction method and system, and a terminal. The method comprises: acquiring voice feature information of a calling user (S101); inputting the voice feature information into a trained deep learning neural network, and determining a response policy (S102); and responding to the user according to the response policy (S103). In a session with a user, intelligent customer service does not use the traditional template language, and prioritizes letting the user describe their problem. Then, the words of the description of the problem are analyzed and a response policy is obtained, thereby ensuring that the response addresses the user's problem, avoiding repeatedly inquiring about the user's needs, and improving user satisfaction.

Description

An intelligent interaction method, system and terminal based on deep learning

technical field

This application relates to the field of artificial intelligence interaction technology, in particular to an intelligent interaction method, system and terminal based on deep learning.

Background technique

Artificial intelligence is a branch of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can respond in a similar way to human intelligence. Research in this field includes robotics, language recognition, image recognition, natural language processing and expert systems, etc. Since the birth of artificial intelligence, the theory and technology have become increasingly mature, and the application fields have also continued to expand. It can be imagined that the technological products brought by artificial intelligence in the future will be the "container" of human wisdom. Artificial intelligence can simulate the information process of human consciousness and thinking. Artificial intelligence is not human intelligence, but it can think like human beings, and it may surpass human intelligence.

In particular, speech recognition and natural language processing are widely used in smart terminals and online customer service in the service industry, such as mobile, China Unicom, China Telecom and other operators, as well as government service hotlines. The artificial intelligence dialogue in traditional technology generally sets a fixed dialogue template. When the user accesses it, the intelligent customer service will guide the user to make their own request through the templated language through the guiding language. After identifying the user's request, give the corresponding response according to the user's request.

Although the traditional intelligent customer service can realize the basic voice recognition function, if the user asks in a dialect, or does not use a template language when making an inquiry, the intelligent customer service will enter an endless loop at this time, constantly asking the user's needs, and then will reduce user satisfaction.

Contents of the invention

In order to solve the above technical problems, the application proposes the following technical solutions:

In the first aspect, the embodiment of the present application provides an intelligent interaction method based on deep learning, including: obtaining the voice feature information of the access user; inputting the voice feature information into the trained deep learning neural network, and determining the response strategy ; Answer the user according to the answer policy.

Using the above-mentioned implementation method, the traditional thin-plate language is abandoned when the intelligent customer service talks with the user, and the user is given priority in explaining the appeal. Then analyze the utterance of the appeal to obtain a response strategy, thereby ensuring a response to the user's appeal, so that there is no need to repeatedly ask the user's needs, thereby improving user satisfaction.

With reference to the first aspect, in the first possible implementation manner of the first aspect, the acquiring the voice feature information of the access user includes: matching the language of the user's voice to determine the language information; according to the language information and the corresponding The language base determines the semantic and intonation meaning of phonetic correspondence.

With reference to the first possible implementation of the first aspect, in the second possible implementation of the first aspect, determining the semantics and intonation meaning corresponding to the voice according to the language information and the corresponding language library includes: according to the language Each word in the user's sentence is determined by the database and voice voiceprint information; the determined words are combined and then part-of-speech is divided to determine the semantics of the user's voice; the user's intonation meaning is determined by combining the intonation of the voice and the intonation feature information of the current language.

In combination with the second possible implementation of the first aspect, in the third possible implementation of the first aspect, the speech feature information is input into a trained deep learning neural network to obtain a response strategy, including: the depth The learning neural network determines the user's emotional characteristics according to the intonation meaning; if the emotional characteristics represent the user's emotional stability, then select the corresponding response words from the response database according to the semantics of the user's voice; or, if the emotional characteristics represent If the user is emotionally anxious, it will be transferred to manual service.

In combination with the third possible implementation of the first aspect, in the fourth possible implementation of the first aspect, if the agent is busy when transferring to a manual agent, a transfer intelligent customer service is temporarily established, and the transfer intelligent customer service imitates In the status of manual customer access, when the manual customer service is idle, it will directly switch to the manual customer service.

In the second aspect, the embodiment of the present application provides an intelligent interaction system based on deep learning, including: an acquisition module, used to acquire the voice feature information of the access user; a determination module, used to input the voice feature information into the trained In the deep learning neural network, the response strategy is determined; the response module is used to respond to the user according to the response strategy.

With reference to the second aspect, in the first possible implementation manner of the second aspect, the acquisition module includes: a first determining unit, configured to match the language of the user's voice, and determine language information; a second determining unit, configured to It is used to determine the semantics and intonation meaning corresponding to the voice according to the language type information and the corresponding language database.

With reference to the first possible implementation of the second aspect, in the second possible implementation of the second aspect, the second determination unit includes: a first determination subunit configured to The information determines each word in the user sentence; the second determining subunit is used to combine the determined words and then perform part-of-speech division to determine the semantics of the user's voice; the third determining subunit is used to combine voice intonation and current The intonation feature information of the language determines the intonation meaning of the user.

With reference to the second possible implementation of the second aspect, in a third possible implementation of the second aspect, the determining module includes: a third determining unit, configured for the deep learning neural network to determine according to the intonation meaning The user's emotional characteristics; the processing unit is used to select the corresponding response words from the response database according to the semantics of the user's voice if the emotional characteristics represent the user's emotional stability; or, if the emotional characteristics represent the user's emotional anxiety, Then transfer to manual service.

In a third aspect, an embodiment of the present application provides a terminal, including: a processor; a memory for storing computer-executable instructions; when the processor executes the computer-executable instructions, the processor executes the first The method described in any possible implementation manner of the aspect or the first aspect realizes intelligent voice interaction.

Description of drawings

FIG. 1 is a schematic flow diagram of a deep learning-based intelligent interaction method provided by an embodiment of the present application;

FIG. 2 is a schematic diagram of an intelligent interactive system based on deep learning provided by an embodiment of the present application;

FIG. 3 is a schematic diagram of a terminal provided by an embodiment of the present application.

Detailed ways

The scheme will be described below in conjunction with the accompanying drawings and specific implementation methods.

Fig. 1 is a schematic flowchart of a deep learning-based intelligent interaction method provided by the embodiment of the present application. Referring to Fig. 1, the deep learning-based intelligent interaction method provided by the embodiment of the present application includes:

S101. Acquire voice feature information of an access user.

Intelligent voice interaction in traditional technologies generally realizes communication in fixed language types intelligently, such as mobile operators, convenience service hotlines, etc. Generally, the users who visit are required to express their appeals in Mandarin, and the intelligent customer service determines the content of the response based on the analysis of the user's voice. However, if there is no user whose pronunciation is non-Mandarin or non-fixed language, the intelligent customer service cannot answer.

Based on the above reasons, after the user's voice is received in the embodiment of the present application, the language of the user's voice is first matched to determine the language information. In order to realize the above functions, it is necessary to access databases of multiple languages and pronunciation databases of local dialects. When the corresponding language information is matched, the semantics and intonation meaning of the user's voice are determined in combination with the corresponding language database. Obviously, the semantics of voice is to understand the meaning of the user, while the meaning of intonation is to determine the tone and mood of the customer when speaking.

In this embodiment, in order to realize the determination of the meaning of the user's voice semantics and intonation, first determine each individual character in the user's sentence according to the language library and voice print information, and then perform part-of-speech division and determination after combining the determined individual characters Semantics of user speech. When determining the semantics, it is necessary to accurately divide the words according to the characteristics of the corresponding language category, so that the semantics fit the meaning expressed by the user. After the semantics of the user's voice is determined, the intonation meaning of the user is determined by combining the intonation of the voice and the intonation feature information of the current language. In this embodiment, it is particularly important to determine the meaning of the tone of the user's voice, because the meaning of the tone of voice can determine the current emotional characteristics of the user. For example, taking Mandarin Chinese as an example, if the user is emotionally excited or anxious, he will have the following intonation characteristics when speaking: speaking fast or loudly. However, for some languages, fast speaking speed and loud voice are their unique normal intonation features, which need to be determined from other aspects.

S102. Input the speech feature information into the trained deep learning neural network to determine a response strategy.

After determining the meaning of the user's speech semantics and intonation in S101, input them into the trained deep learning neural network, and the deep learning neural network first determines the user's emotional characteristics according to the meaning of the intonation. If the emotional feature indicates that the user is emotionally stable, a corresponding response utterance is selected from the response database according to the semantics of the user's voice. However, if the emotional characteristics represent the user's emotional anxiety, then transfer to manual service. At this time, using intelligent customer service to interact with the user may not be able to solve the user's appeal, and may even cause user dissatisfaction.

For example, users who access appeals are more anxious at this time. For example, when it comes to complaints, if the current artificial intelligence customer service repeatedly asks "what is your complaint?", it will cause user dissatisfaction. If users in such situations are directly connected to the manual customer service through transfer, targeted humanized services can be provided through the manual customer service, so as to solve user demands to the greatest extent.

S103. Answer the user according to the answer policy.

According to the response strategy determined in S102, if the intelligent customer service is adopted, the corresponding response sentence is retrieved from the corresponding database to respond to the user through the semantics of the user's voice. If it needs to be transferred to manual, the service will be performed manually.

It should be pointed out that if the agent is busy when the agent is transferred, a transfer intelligent customer service is temporarily set up. The transfer intelligent customer service imitates the state of manual customer access, and when the artificial customer service is idle, it is directly switched to the artificial customer service.

Corresponding to the deep learning-based intelligent interaction method provided in the above embodiments, the present application also provides an embodiment of a deep learning-based intelligent interaction system. Referring to FIG. 2, the deep learning-based intelligent interaction system 20 includes: An acquisition module 201 , a determination module 202 and a response module 203 .

The acquiring module 201 is configured to acquire voice feature information of an access user. The determining module 202 is configured to input the speech feature information into the trained deep learning neural network to determine the response strategy. An answering module 203, configured to answer the user according to the answering strategy.

In this embodiment, the acquiring module 201 includes: a first determining unit and a second determining unit. The first determination unit is configured to match the language of the user's voice and determine the language information; the second determination unit is configured to determine the semantics and intonation meaning of the voice according to the language information and the corresponding language library.

Further, the second determination unit includes: a first determination subunit, a second determination subunit and a third determination subunit. The first determining subunit is configured to determine each word in the user sentence according to the language library and voiceprint information. The second determining subunit is used to combine the determined words and then perform part-of-speech division to determine the semantics of the user's voice. The third determining subunit is used to determine the meaning of the user's intonation in combination with the intonation feature information of the current language.

The determining module 202 includes: a third determining unit and a processing unit. The third determination unit is used for the deep learning neural network to determine the user's emotional characteristics according to the meaning of the intonation. The processing unit is used to select the corresponding response utterance from the response database according to the semantics of the user's voice if the emotional feature indicates that the user is emotionally stable; or, if the emotional feature indicates that the user is emotionally anxious, transfer to a manual Serve.

The present application also provides an embodiment of a terminal. Referring to FIG. 3 , a terminal 30 includes: a processor 301 , a memory 302 and a communication interface 303 .

In FIG. 3 , the processor 301 , the memory 302 and the communication interface 303 can be connected to each other through a bus; the bus can be divided into an address bus, a data bus, a control bus, and the like. For ease of representation, only one thick line is used in FIG. 3 , but it does not mean that there is only one bus or one type of bus.

The processor 301 usually controls the overall functions of the terminal 30, such as starting the terminal 30, and obtaining the voice feature information of the access user after the terminal 30 starts; input the voice feature information into the trained deep learning neural network, and determine the response policy; answering the user according to the answering policy.

The processor 301 may be a general processor, for example, a central processing unit (English: central processing unit, abbreviated: CPU), a network processor (English: network processor, abbreviated: NP) or a combination of CPU and NP. The processor may also be a microprocessor (MCU). Processors may also include hardware chips. The aforementioned hardware chip may be an Application Specific Integrated Circuit (ASIC), a Programmable Logic Device (PLD) or a combination thereof. The above-mentioned PLD may be a complex programmable logic device (CPLD), a field programmable logic gate array (FPGA) or the like.

The memory 302 is configured to store computer-executable instructions to support the operation of terminal 30 data. The memory 301 can be implemented by any type of volatile or non-volatile storage device or their combination, such as static random access memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable Programmable Read Only Memory (EPROM), Programmable Read Only Memory (PROM), Read Only Memory (ROM), Magnetic Memory, Flash Memory, Magnetic or Optical Disk.

After the terminal 30 is started, the processor 301 and the memory 302 are powered on, and the processor 301 reads and executes the computer-executable instructions stored in the memory 302 to complete all or part of the above-mentioned embodiments of the intelligent interaction method based on deep learning step.

The communication interface 303 is used for the terminal 30 to transmit data, such as realizing communication with network devices and servers. The communication interface 303 includes a wired communication interface, and may also include a wireless communication interface. Wherein, the wired communication interface includes a USB interface, a Micro USB interface, and may also include an Ethernet interface. The wireless communication interface may be a WLAN interface, a cellular network communication interface or a combination thereof.

In an exemplary embodiment, the terminal 30 provided in the embodiment of the present application further includes a power supply component, which provides power for various components of the terminal 30 . Power components may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power to terminal 30 .

A communication component, the communication component is configured to facilitate wired or wireless communication between the terminal 30 and other devices. The terminal 30 can access a wireless network based on communication standards, such as WiFi, 4G or 5G, or a combination thereof. The communication component receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. The communication component also includes a Near Field Communication (NFC) module to facilitate short-range communication. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, Infrared Data Association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology and other technologies.

In an exemplary embodiment, terminal 30 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Programmable Gate Array (FPGA) or other electronic component implementation.

It should be noted that in this article, relative terms such as "first" and "second" are only used to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply these No such actual relationship or order exists between entities or operations. Furthermore, the term "comprises", "comprises" or any other variation thereof is intended to cover a non-exclusive inclusion such that a process, method, article, or apparatus comprising a set of elements includes not only those elements, but also includes elements not expressly listed. other elements of or also include elements inherent in such a process, method, article, or device. Without further limitations, an element defined by the phrase "comprising a..." does not exclude the presence of additional identical elements in the process, method, article or apparatus comprising said element.

Claims

An intelligent interaction method based on deep learning, characterized in that it includes:

Obtain the voice feature information of the access user;

Input the voice feature information into the trained deep learning neural network to determine the response strategy;

Answer the user according to the answer policy.
The intelligent interaction method based on deep learning according to claim 1, wherein said acquiring voice feature information of an access user comprises:

Match the language of the user's voice to determine the language information;

Determine the semantics and intonation meaning corresponding to the voice according to the language type information and the corresponding language database.
The intelligent interaction method based on deep learning according to claim 2, characterized in that, according to the language type information and the corresponding language library, the semantic and intonation meaning corresponding to the voice is determined, including:

Determining each individual word in the user sentence according to the language library and voice print information;

Combining the determined words and then performing part-of-speech division to determine the semantics of the user's voice;

Combining the voice intonation and intonation feature information of the current language to determine the intonation meaning of the user.
The intelligent interaction method based on deep learning according to claim 3, wherein the speech feature information is input into a trained deep learning neural network to obtain a response strategy, including:

The deep learning neural network determines the user's emotional characteristics according to the meaning of the intonation;

If the emotional feature represents the user's emotional stability, then select the corresponding answering utterance from the answering database according to the semantics of the user's voice;

Or, if the emotional feature represents the user's emotional anxiety, transfer to manual service.
The intelligent interaction method based on deep learning according to claim 4, wherein if the agent is busy when transferring to an artificial agent, a transfer intelligent customer service is temporarily established, and the transfer intelligent customer service imitates the state of artificial customer access , when the manual customer service is idle, switch directly to the manual customer service.
An intelligent interactive system based on deep learning, characterized in that it includes:

An acquisition module, configured to acquire voice feature information of an access user;

Determining module, for inputting described speech feature information in the deep learning neural network that has trained, determines response strategy;

An answering module, configured to answer the user according to the answering strategy.
The intelligent interactive system based on deep learning according to claim 6, wherein the acquisition module includes:

The first determining unit is used to match the language of the user's voice and determine the language information;

The second determination unit is configured to determine the semantics and intonation meaning corresponding to the voice according to the language type information and the corresponding language library.
The intelligent interaction system based on deep learning according to claim 7, wherein the second determining unit comprises:

The first determination subunit is used to determine each word in the user sentence according to the language library and voiceprint information;

The second determining subunit is used for combining the determined words and then performing part-of-speech division to determine the semantics of the user's voice;

The third determining subunit is used to determine the meaning of the user's intonation in combination with the intonation feature information of the current language.
The intelligent interactive system based on deep learning according to claim 8, wherein the determination module comprises:

The third determination unit is used for the deep learning neural network to determine the user's emotional characteristics according to the meaning of the intonation;

The processing unit is used to select the corresponding answering utterance from the answering database according to the semantics of the user's voice if the emotional feature represents the user's emotional stability;

Or, if the emotional feature represents the user's emotional anxiety, transfer to manual service.
A terminal, characterized in that, comprising:

processor;

memory for storing computer-executable instructions;

When the processor executes the computer-executable instructions, the processor executes the method according to any one of claims 1-5 to realize intelligent voice interaction.