CN111752523A

CN111752523A - Human-computer interaction method and device, computer equipment and storage medium

Info

Publication number: CN111752523A
Application number: CN202010400842.1A
Authority: CN
Inventors: 陈百灵
Original assignee: Shenzhen Zhuiyi Technology Co Ltd
Current assignee: Shenzhen Zhuiyi Technology Co Ltd
Priority date: 2020-05-13
Filing date: 2020-05-13
Publication date: 2020-10-09

Abstract

The application relates to a man-machine interaction method, a man-machine interaction device, a computer device and a storage medium.A telephone robot detects voice information in the environment where a target user is when playing target voice data, if the voice information is the voice information of the target user, the playing of the target voice data is paused, the playing position of the target voice data at the paused time is recorded, then the semantics of the voice information of the target user is identified, response operation corresponding to the semantics is executed, and after the response operation is finished, the residual content of the target voice data is played from the playing position at the paused time. The method can avoid low efficiency and incomplete information transmission caused by repeated voice broadcast due to interruption of the user, and improves the flexibility of the robot for dealing with the interruption of the user, so that the human-computer interaction efficiency is higher, and the robot is more flexible and intelligent.

Description

Human-computer interaction method and device, computer equipment and storage medium

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to a method and an apparatus for artificial interaction, a computer device, and a storage medium.

Background

With the development of artificial intelligence technology, intelligent robots have been applied to various fields, for example, various types of services such as automatic customer service, intelligent marketing, content navigation, intelligent voice control, entertainment chat, and the like are provided in various fields such as telecom operators, financial services, e-government affairs, e-commerce, various intelligent terminals, and personal internet information services.

Most intelligent robots can basically communicate with users in the service process, but when the robots play conversations, if the environment in which the robots are located has large noise, such as noise of speaking of other people in the environment around the users, the speech of the robots can be wrongly interrupted; or the robot may mechanically replay the dialog from the beginning after the user has been disconnected multiple times in succession.

Therefore, the existing robot has the problems of insufficient flexibility and intelligence in human-computer interaction.

Disclosure of Invention

In view of the above, there is a need to provide a human-computer interaction method, an apparatus, a computer device and a storage medium, which can make human-computer interaction more flexible and intelligent.

In a first aspect, a human-computer interaction method is provided, and the method includes:

detecting voice information in the environment where a target user is located when target voice data are played;

if the voice information is the voice information of the target user, pausing playing the target voice data, and recording the playing position of the target voice data at the pausing moment;

recognizing the semantics of the voice information of the target user, and executing response operation corresponding to the semantics;

after the response operation is finished, the remaining content of the target voice data is played from the play position at the pause time.

In one embodiment, the recognizing the semantics of the voice information of the target user and performing a response operation corresponding to the semantics includes:

determining the information type of the voice information according to the semantics of the voice information of the target user;

and according to the information type, executing response operation corresponding to the semantics.

In one embodiment, the performing, according to the information type, a response operation corresponding to the semantics includes: if the information type is the inquiry type, matching reply content corresponding to the semantics of the voice information of the target user from a preset knowledge base, and playing the reply content; the knowledge base comprises reply contents corresponding to various information types; or,

and if the information type is the attachment type, pausing the playing position at the pause moment after the preset time length, and playing the residual content of the target voice data.

In one embodiment, the recording the playing position of the target voice data at the pause time includes:

determining clauses corresponding to pause time in the target voice data;

and determining the position of the clause corresponding to the pause time in the target voice data as the playing position of the target voice data at the pause time.

In one embodiment, after detecting the voice information in the environment where the target user is located when the target voice data is played, the method includes:

and if the voice information is the voice information of the non-target user, no response is made to the voice information, and the target voice data is continuously played.

In one embodiment, the method further comprises:

and identifying whether the voice information in the environment where the target user is located is the voice information of the target user through a preset sound characteristic detection model.

In one embodiment, the training process of the sound feature detection model includes:

acquiring a plurality of sample voice characteristics and user identifications corresponding to the sample voice characteristics;

and learning the corresponding relation between each voice feature and the corresponding user identification through a deep learning algorithm so as to carry out iterative training on the initial voice feature detection model until the variation amplitude of the loss function of the voice feature detection model is smaller than a preset threshold value, thereby obtaining the voice feature detection model.

In a second aspect, the present application provides a human-computer interaction device, comprising:

the detection module is used for detecting the voice information in the environment where the target user is located when the target voice data are played;

the stopping module is used for pausing the playing of the target voice data and recording the playing position of the target voice data at the pausing moment if the voice information is the voice information of the target user;

the response module is used for identifying the semantics of the voice information of the target user and executing response operation corresponding to the semantics;

and the playing module is used for playing the residual content of the target voice data from the playing position of the pause time after the response operation is finished.

In one embodiment, the response module includes:

the type determining unit is used for determining the information type of the voice information according to the semantic meaning of the voice information of the target user;

and the execution response unit is used for executing response operation corresponding to the semantics according to the information type.

In an embodiment, the execution response unit is specifically configured to, if the information type is an inquiry type, match, from a preset knowledge base, reply content corresponding to the semantics of the voice information of the target user, and play the reply content; the knowledge base comprises reply contents corresponding to various information types; or if the information type is the attachment type, stopping for a preset time length and continuing to play the rest content of the target voice data from the playing position of the pause time.

In one embodiment, the stopping module includes:

the clause determining unit is used for determining a clause corresponding to the pause time in the target voice data;

and the position determining unit is used for determining the position of the clause corresponding to the pause time in the target voice data as the playing position of the target voice data at the pause time.

In an embodiment, the response module is further specifically configured to continue to play the target voice data if the voice information is the voice information of the non-target user and no response is made to the voice information.

In an embodiment, the detection module is specifically configured to identify whether the speech information in the environment where the target user is located is the speech information of the target user through a preset sound feature detection model.

In one embodiment, the apparatus further comprises:

a sample obtaining module for obtaining multiple sample voice characteristics and user identification corresponding to each sample voice characteristic

And the model training module is used for learning the corresponding relation between each voice feature and the corresponding user identifier through a deep learning algorithm so as to carry out iterative training on the initial voice feature detection model until the variation amplitude of the loss function value of the voice feature detection model is smaller than a preset threshold value, thereby obtaining the voice feature detection model.

In a third aspect, the present application provides a computer device, including a memory and a processor, where the memory stores a computer program, and the processor implements the steps of the human-computer interaction method in any one of the embodiments of the first aspect when executing the computer program.

In a fourth aspect, the present application provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the human-computer interaction method of any one of the embodiments of the first aspect described above.

According to the man-machine interaction method, the man-machine interaction device, the computer equipment and the storage medium, when the target voice data are played, the telephone robot detects voice information in the environment where the target user is located, if the voice information is the voice information of the target user, playing of the target voice data is paused, the playing position of the target voice data at the paused time is recorded, then the semantics of the voice information of the target user is recognized, response operation corresponding to the semantics is executed, and after the response operation is finished, the residual content of the target voice data is played from the playing position at the paused time. In the method, after the voice information of a target user is detected when the target voice data is played by the telephone robot, the played target voice data is paused firstly, the corresponding response is made to the target user according to the semantic meaning of the voice information output by the current target user, and then the remaining target voice data can be continuously played from the paused position, namely, after the telephone robot is interrupted by the target user, the corresponding response is firstly carried out according to the interruption intention of the target user, after the response is completed, the target voice data does not need to be played from the beginning, and only the remaining content is played, so that the low efficiency and incomplete information transmission caused by repeated voice broadcast due to the interruption of the user are avoided, the interruption of the user by the robot is promoted, the human-computer interaction efficiency is higher, and the robot is more flexible and intelligent.

Drawings

FIG. 1 is a diagram of an application environment of a human-computer interaction method according to an embodiment;

FIG. 1a is a diagram illustrating an internal structure of a phone robot according to an exemplary embodiment;

FIG. 2 is a flowchart illustrating a human-computer interaction method according to an embodiment;

FIG. 3 is a flowchart illustrating a human-computer interaction method according to another embodiment;

FIG. 4 is a flowchart illustrating a human-computer interaction method according to another embodiment;

FIG. 5 is a diagram illustrating a human-computer interaction method, according to an embodiment;

FIG. 6 is a block diagram of a human-computer interaction device according to an embodiment;

FIG. 7 is a block diagram of a human-computer interaction device according to another embodiment;

FIG. 8 is a diagram illustrating an internal structure of a computer device according to an embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

Referring to fig. 1, the present application provides an application environment of a human-computer interaction method, where a telephone robot 01 may perform voice interaction with a user, and the interaction scenario includes, but is not limited to, service scenarios such as online question answering, consultation, instruction execution, and the like. The telephone robot 01 includes, but is not limited to, a calling robot, a chat robot, an intelligent customer service, an intelligent assistant, and other robots of various service types. The internal structure of the telephone robot can be seen in fig. 1a, and the telephone robot comprises a processor, a memory, a network interface and a database which are connected through a system bus. Wherein the processor is configured to provide computational and control capabilities. The memory comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database is used for storing data of a human-computer interaction method. The network interface is used for communicating with an external terminal through network connection. The computer program is executed by a processor to implement a human-computer interaction method. It is understood that the internal structure of the intelligent robot shown in fig. 1a is only an example and is not intended to be limiting.

The embodiment of the application provides a human-computer interaction method and device, computer equipment and a storage medium, so that human-computer interaction is more flexible and intelligent. The following describes in detail the technical solutions of the present application and how the technical solutions of the present application solve the above technical problems by embodiments and with reference to the drawings. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments. It should be noted that, in the human-computer interaction method provided by the present application, the execution main body of fig. 2 to 5 is a telephone robot. The execution main bodies in fig. 2 to 5 may also be man-machine interaction devices, and the devices may be implemented as part or all of a telephone robot by software, hardware, or a combination of software and hardware.

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments.

In an embodiment, fig. 2 provides a human-computer interaction method, where this embodiment relates to a specific process in which when a telephone robot plays target voice data, if voice information of a target user is detected, the playing is paused, a corresponding operation is performed according to semantics of the voice information, and after the operation is finished, the playing of the target voice data is continued from a position where the playing is paused, as shown in fig. 2, the method includes:

s101, voice information in the environment where the target user is located is detected when the target voice data are played.

The target voice data refers to the voice data being played by the telephone robot at the current time, for example, the telephone robot is playing an important information notification or playing marketing information in a marketing scene.

The telephone robot is an intelligent voice service device, accurately understands the user intention and asks by artificial intelligence technologies such as voice recognition, voice synthesis, semantic understanding and natural language generation, and provides services such as autonomous online question answering, consultation and instruction execution by natural and smooth man-machine interaction. Therefore, in practical applications, the telephone robot is in a scene interacting with the target user when playing the target voice data, and in such a scene, the telephone robot can detect the voice information in the environment where the target user is located through a voice detection manner, for example, detect whether the voice information belongs to the voice output by the target user, detect whether the voice information is the voice information output by the target user to the telephone robot, or the voice information output by the target user to other objects, and so on.

The telephone robot needs to detect and collect voice information of an environment where a target user is located, and the main purpose of collecting the voice information of the environment where the target user is located is to detect whether the voice information of the target user exists in the collected voice information. It should be understood that, in practical applications, the voice information in the environment where the target user is located includes not only the voice information of the target user, but also the voice information of other people or objects may exist, and therefore, the telephone robot needs to detect whether the information of the target user exists in the collected voice information.

Optionally, in an embodiment, the telephone robot may recognize whether the voice information in the environment where the target user is located is the voice information of the target user through a preset sound feature detection model. In one embodiment, the training process of the sound feature detection model includes: acquiring a plurality of sample voice characteristics and user identifications corresponding to the sample voice characteristics; and learning the corresponding relation between each voice feature and the corresponding user identification through a deep learning algorithm so as to carry out iterative training on the initial voice feature detection model until the variation amplitude of the loss function of the voice feature detection model is smaller than a preset threshold value, thereby obtaining the voice feature detection model.

The sound feature detection model is a pre-trained model for detecting features of different sounds in a piece of speech information, for example, the features of the sounds are timbre, voiceprint, and the like. Therefore, the telephone robot can input the collected voice information of the environment where the target user is located into the sound characteristic detection model, and judge whether the voice information of the target user exists in the collected voice information according to the output result of the model. This sound characteristic detection model adopts multiple sample speech characteristics when training to and the user identification that each sample speech characteristics corresponds trains as the training set, and of course, in order to guarantee that the rate of accuracy of sound characteristic detection model discernment sound is higher, when choosing the training set, guarantees as far as possible that the sample is diversified, abundant, so that sound characteristic detection model can learn more comprehensive sound characteristics. And during training, taking the value of the loss function as the direction guide of model training, and determining that the training of the sound feature detection model is finished when the value of the loss function tends to be stable, namely the change amplitude of the loss function is small for a plurality of times continuously, for example, the change amplitude is less than 0.1, so as to obtain the trained sound feature detection model. Therefore, the voice information of the target user can be detected quickly and accurately through the pre-trained voice feature detection model.

Optionally, the telephone robot may further compare the voice information in the environment where the target user is located with the voice information of the target user, with the voice characteristics of the user that are stored in advance, that is, the telephone robot stores the voice characteristics of the target user in advance when interacting with the user, so that when detecting the voice information of the target user, the voice characteristics of the voice information in the environment may be compared with the voice characteristics of the target user, and it may be determined whether the voice information of the target user exists in the environment through the comparison.

S102, if the voice information is the voice information of the target user, the playing of the target voice data is paused, and the playing position of the target voice data at the pause time is recorded.

The scenario in the practical application corresponding to this step is that when the telephone robot is playing a piece of voice data, the telephone robot is suddenly interrupted by the voice information input by the target user, that is, the telephone robot detects that the voice information in the environment where the target user is located is the voice information of the target user, or the voice information in the environment where the target user is located includes the voice information of the target user.

Then, in the case where such a sudden interruption is made by the voice information input by the target user, the telephone robot pauses the target voice data currently being played. And when the telephone robot pauses, recording the playing position of the target voice data at the time of pausing, for example, the target voice data played by the telephone robot is a piece of insurance information, and when the target voice data is playing, the '10-year payment of the insurance need only 5 years, and the payment is made every month', the telephone robot detects the voice information of the target user, and at the moment, the playing position of the target voice data at the time of pausing is a position between the '10-year payment of the insurance need only 5 years' and the 'payment is made every month'.

S103, identifying the semantics of the voice information of the target user and executing response operation corresponding to the semantics.

After the telephone robot pauses the target voice data being played and records the pause position, the telephone robot needs to recognize the voice information semantics of the target user, namely needs to recognize the real intention of the target user to input the voice information, so as to execute the corresponding response operation according to the real intention of the target user.

It should be noted here that, the process of recognizing the semantic of the voice information of the target user by the telephone robot and the process of pausing the played target voice data and recording the pause position by the telephone robot may be performed simultaneously or sequentially, for example, in practical applications, the two processes may be performed simultaneously or sequentially in a manner of quickly completing the semantic recognition and recording the pause position, so as to ensure a smooth response process of the telephone robot, avoid the user sensing the pause time of the telephone robot, and improve the human-computer interaction experience.

For example, the telephone robot may recognize the semantics of the voice information of the target user by using a pre-trained semantic recognition network model, of course, the semantic recognition network model may be two different models independent from the sound feature detection model, and the different models are responsible for different detection functions; however, the same model may also be used, that is, the sound feature detection model may not only detect the sound feature, but also detect the specific semantic content of the target user voice information.

After recognizing the semantics of the voice information of the target user, the telephone robot needs to perform a response operation corresponding to the semantics. For example, the telephone robot may match a corresponding response reply from a preset knowledge base according to the semantic meaning of the voice message of the target user, and execute the response reply, or may adopt other manners, which is not limited in this embodiment.

And S104, after the response operation is finished, playing the residual content of the target voice data from the playing position of the pause time.

After the response operation is finished, the telephone robot reacquires the playing position of the just recorded target voice data at the pause time, and then continues to play the remaining content of the target voice data from the playing position, for example: the playing position of the target voice data recorded by the telephone robot at the pause time is the position between ' the payment for 10 years of insurance only needs 5 years ' and ' the payment for each month ', at the moment, the telephone robot continues playing from the position, namely ' 200 yuan is paid for each month, you can do? "in this way, the remaining content is continuously played from the just paused position without starting from the beginning, and the man-machine interaction efficiency is improved.

In the man-machine interaction method provided by this embodiment, the telephone robot detects the voice information in the environment where the target user is located when playing the target voice data, and if the voice information is the voice information of the target user, the playing of the target voice data is paused, and the playing position of the target voice data at the paused time is recorded, then the semantics of the voice information of the target user is identified, a response operation corresponding to the semantics is executed, and after the response operation is finished, the remaining content of the target voice data is played from the playing position at the paused time. In the method, after the voice information of a target user is detected when the target voice data is played by the telephone robot, the played target voice data is paused firstly, the corresponding response is made to the target user according to the semantic meaning of the voice information output by the current target user, and then the remaining target voice data can be continuously played from the paused position, namely, after the telephone robot is interrupted by the target user, the corresponding response is firstly carried out according to the interruption intention of the target user, after the response is completed, the target voice data does not need to be played from the beginning, and only the remaining content is played, so that the low efficiency and incomplete information transmission caused by repeated voice broadcast due to the interruption of the user are avoided, the interruption of the user by the robot is promoted, the human-computer interaction efficiency is higher, and the robot is more flexible and intelligent.

In the above embodiment, the description has been made on the voice information in the environment where the target user is located as the voice information of the target voice, but in practical applications, the voice information in the environment certainly contains the voice information of the non-target user, and then a description is given below of a case where the voice information in the environment is the voice information of the non-target user by using an embodiment, and in an embodiment, if the voice information is the voice information of the non-target user, the target voice data is continuously played without responding to the voice information.

In this embodiment, if the voice information in the environment where the target user is located detected by the telephone robot is the voice information of the non-target user, for example, the telephone robot detects that the voice information in the environment where the target user is located is the voice information of another person other than the target user or the voice information sent by another object in the environment through the sound feature detection model in the above embodiment, in this case, the telephone robot may ignore the voice information and make no response, and continue to play the target voice data, so that the interruption phenomenon caused by noise or other sounds in the environment to the telephone robot is avoided, and the intelligence of the telephone robot in the human-computer interaction is improved.

As for the process of recognizing the semantic meaning of the voice information of the target user and executing the response operation corresponding to the semantic meaning by the telephone robot, the present application provides a detailed description of a specific embodiment, and in an embodiment, as shown in fig. 3, the step S103 includes:

s201, determining the information type of the voice information according to the semantic meaning of the voice information of the target user.

The information type of the voice information is a division of the voice information of the target user into the type of the expressed intention, and for example, the information type can be an inquiry type, a contract type, a dispute type, a praise type and the like.

In practical application, the telephone robot can determine the actual intention of the voice information according to the semantic meaning of the voice information, and then divide the information types according to the intention. For example, the corresponding relationship between various semantics and information types may be preset, and the telephone robot matches the determined semantics with the preset corresponding relationship after determining the semantics of the voice information of the target user, and determines the information type corresponding to the semantics according to the matching, which is not specifically limited in this embodiment.

And S202, executing response operation corresponding to the semantics according to the information type.

After the specific information type is determined, the telephone robot executes corresponding response operation, which shows that different information types correspond to different response operations, so that the telephone robot flexibly responds to the target user, and the man-machine interaction intelligence is improved.

Response operations corresponding to two specific information types are provided below. Optionally, in an embodiment, the performing, according to the information type, a response operation corresponding to the semantics includes: if the information type is the inquiry type, matching reply content corresponding to the semantics of the voice information of the target user from a preset knowledge base, and playing the reply content; the knowledge base comprises reply contents corresponding to various information types.

The information type is an inquiry type which indicates that the current target user has a question about the voice content played by the telephone robot, so that the question of the inquiry can be sent out, after the telephone robot determines the question of the specific inquiry according to the semantic meaning of the voice information of the target user, the reply content corresponding to the semantic meaning is matched from a pre-established knowledge base, and the reply content is played, wherein the knowledge base is pre-established and stores the reply content corresponding to different voice information semantic meanings in a large number of information types.

For example, the telephone robot is playing the target voice data as "the insurance premium is paid for 10 years only 5 years, monthly", and it is suddenly detected that the target user uttered voice information, and the voice information is specifically "what name is just said about insurance? Then, the telephone robot matches and responds to the name returned as 'this insurance is healthy life insurance' from the knowledge base after determining the specific semantics of the voice message.

In this embodiment, when the telephone robot plays a piece of voice content, the telephone robot receives the inquiry-type voice of the target user, matches the reply corresponding to the intention of the target user in the knowledge base in time, and responds to the target user according to the matched reply, so that the user can quickly obtain the desired answer, and the human-computer interaction efficiency is improved.

Optionally, in another embodiment, the performing, according to the information type, a response operation corresponding to the semantics includes: and if the information type is the attachment type, pausing the playing position at the pause moment after the preset time length, and playing the residual content of the target voice data.

The information type in this embodiment is a "closed" type, which means that the voice information uttered by the target user is only simple and closed, for example, closed, good, yes, known, etc. and closed-form sentences.

If the telephone robot detects the voice information of the target user and recognizes that the voice information is only simple attached voice, the telephone robot does not need to answer any content, and only needs to pause for a preset time length and then continue to play the remaining content of the target voice data from the playing position of the pause time, for example, continue to play the voice content being played after 2 s. For example, if the target voice data is played by the telephone robot, that is, "the insurance payment for 10 years only needs 5 years, and the payment amount per month is yes", the voice message sent by the target user is suddenly detected, and the voice message is specifically "h", so that the telephone robot stops for 2s, and then continues to play subsequent contents from the pause position.

Of course, in practical applications, the telephone robot may also reply to some simple reply, for example, the target user sends a voice message "yes" and then "the telephone robot replies," which is not limited in this embodiment, as long as the telephone robot simply replies after detecting the target user's enclosed information without performing excessive reply operations, which avoids wasting time or resulting in incomplete playing of the target voice data.

As to the process of recording the pause position of the target voice data by the telephone robot, an embodiment is provided, as shown in fig. 4, and in an embodiment, the step S102 includes:

s301, clauses corresponding to pause time in the target voice data are determined.

The target voice data is a section of voice content being played by the telephone robot, and the section of voice content is composed of a plurality of sentences, so that the telephone robot determines the clause played at the time of pause after the target voice data is paused, for example, the telephone robot plays' the insurance premium only needs 5 years, and pays every month.

S302, determining the position of the clause corresponding to the pause time in the target voice data as the playing position of the target voice data at the pause time.

After the clause corresponding to the pause time in the target voice data is determined, the telephone robot determines the position of the clause in the target voice data as the playing position of the target voice data at the pause time.

For example, the clause at the time of pause is "pay per month", and then the position of the clause in the target voice data is "the insurance 10 years need only pay 5 years", that is, the playing position of the target voice data at the time of pause.

In the embodiment, the position of the clause at the pause time in the target voice data is determined, the position of the clause in the target voice data is determined as the playing position of the target voice data at the pause time, and the clause is used as a basis for determining the position, so that the accuracy of recording the pause position is improved.

In addition, the present application also provides a human-computer interaction method, as shown in fig. 5, the embodiment includes:

s1, playing the target voice data and simultaneously detecting the voice information in the environment where the target user is located;

s2, detecting whether the voice information is the voice information of the target user; if yes, executing S3, otherwise, executing S4;

s3, pausing playing the target voice data and recording the playing position of the target voice data at the pausing moment;

s4, the voice information is not responded, and the target voice data is played continuously;

s5, recognizing the semantic meaning of the voice information of the target user;

s6, determining the information type of the voice information according to the semantic meaning of the voice information of the target user; performing S7 and S8 according to the information type, respectively;

s7, if the information type is of an attachment type, pausing for a preset time and continuing playing the remaining content of the target voice data from the playing position of the pause time;

s8, if the information type is the inquiry type, matching reply content corresponding to the semantics of the voice information of the target user from a preset knowledge base, and playing the reply content; the knowledge base comprises reply contents corresponding to various information types;

s9, after the response operation is finished, the remaining content of the target voice data is played from the play position at the pause time.

In all the steps of the human-computer interaction method provided by the above embodiment, the implementation principle and technical effect are similar to those in the previous human-computer interaction method embodiments, and are not described herein again.

It should be understood that although the various steps in the flow charts of fig. 2-5 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in fig. 2-5 may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performance of the sub-steps or stages is not necessarily sequential, but may be performed in turn or alternating with other steps or at least some of the sub-steps or stages of other steps.

In one embodiment, as shown in fig. 6, there is provided a human-computer interaction device, including: a detection module 10, a stop module 11, a response module 12 and a play module 13, wherein:

the detection module 10 is used for detecting voice information in the environment where the target user is located when the target voice data is played;

the stopping module 11 is configured to pause playing of the target voice data and record a playing position of the target voice data at a pause time if the voice information is the voice information of the target user;

the response module 12 is used for identifying the semantics of the voice information of the target user and executing response operation corresponding to the semantics;

and the playing module 13 is configured to play the remaining content of the target voice data from the playing position at the pause time after the response operation is finished.

In this embodiment, when the telephone robot is playing a section of target voice data, if the detection module detects that voice information exists in the environment where the target user is located and the voice information is the voice of the target user, the stop module pauses the target voice data being played and records the playing position of the target voice data at the pause time. After the stopping module pauses the target voice data being played, the response module identifies the semantics of the voice information of the target user and executes corresponding response operation according to the analyzed semantics, and no matter which response operation is adopted, after the response operation is finished, the playing module plays the residual content of the target voice data from the playing position at the pause time. Like this, telephone robot is being broken by the target user back, corresponds the response according to target user's interruption intention earlier, and the response is accomplished the back, need not play target voice data from the beginning, only plays surplus content, has avoided causing the inefficiency and the information that the repeated report of pronunciation leads to convey incompletely because of user's interruption, has promoted the nimble user's of coping with of robot interruption for human-computer interaction efficiency is higher, and is more nimble, intelligent.

In one embodiment, as shown in fig. 7, the response module 12 includes: a type determining unit 121 and an execution responding unit 122;

a type determining unit 121, configured to determine an information type of the voice information according to semantics of the voice information of the target user;

and an execution response unit 122, configured to execute a response operation corresponding to the semantics according to the information type.

In this embodiment, after determining the actual intent of the voice information according to the semantics of the voice information, the type determining unit in the response module divides the information types according to the intent, for example, the information types may be an inquiry type, a contract type, a dispute type, a praise type, and the like; after the specific information type is determined, the telephone robot executes corresponding response operation, wherein different information types correspond to different response operations, so that the telephone robot flexibly responds to a target user, and the man-machine interaction intelligence is improved.

In the embodiment, two response operations corresponding to specific information types are provided, the information type is an inquiry type which indicates that the current target user has a question about the voice content played by the telephone robot, and the telephone robot needs to answer the question, namely, reply content corresponding to the semantics of the voice information of the target user is matched from a preset knowledge base and played, so that the user can quickly obtain a desired answer, and the man-machine interaction efficiency is improved; if the voice information sent by the target user is only simple and complicated, for example, yes, known and other complicated sentences, the telephone robot does not need to answer any content, only needs to pause for a preset time length and then continue to play the rest content of the target voice data from the playing position of the pause time, simply responds, does not need to do too many reply operations, and avoids time waste or incomplete playing of the target voice data.

In one embodiment, the stopping module 11 includes:

In this embodiment, after the telephone robot pauses playing the target voice data, the clause played at the pause time is determined, for example, the telephone robot plays "the insurance premium is paid for 10 years only for 5 years, and the payment is made every month", then the clause at the pause time is "the payment is made every month"; and then determining the position of the clause in the target voice data as the playing position of the target voice data at the pause time. Thus, the accuracy of recording the pause position is improved by taking the clause as the basis for determining the position.

In an embodiment, the response module 12 is further specifically configured to, if the voice information is the voice information of the non-target user, continue to play the target voice data without responding to the voice information.

In this embodiment, if the voice information is the voice information of the non-target user, the telephone robot may ignore the voice information, does not make any response, and continues to play the target voice data, so that the interruption phenomenon caused by noise or other sounds in the environment to the telephone robot is avoided, and the intelligence of the telephone robot in the man-machine interaction is improved.

In an embodiment, the detection module 10 is specifically configured to identify whether the speech information in the environment where the target user is located is the speech information of the target user through a preset sound feature detection model.

In this embodiment, the voice feature detection model is a pre-trained model for detecting features of different voices in a piece of voice information, for example, the features of the voices are timbre, voiceprint, and the like, and the voice information of the target user can be quickly and accurately detected through the pre-trained voice feature detection model.

In one embodiment, the apparatus further comprises:

In this embodiment, during training, the acoustic feature detection model is trained by using the multiple sample speech features and the user identifiers corresponding to the sample speech features as a training set, and during training, iterative training is guided by using the value of the loss function as the direction of model training, so that the training of the acoustic feature detection model is more robust and accurate.

The implementation principle and technical effect of all the human-computer interaction devices provided by the above embodiments are similar to those of the human-computer interaction method embodiments, and are not described herein again.

For specific limitations of the human-computer interaction device, reference may be made to the above limitations of the human-computer interaction method, which are not described herein again. All or part of each module in the man-machine interaction device can be realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, a computer device is provided, which may be a terminal, and its internal structure diagram may be as shown in fig. 8 above. The computer device includes a processor, a memory, a network interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a human-computer interaction method. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on the shell of the computer equipment, an external keyboard, a touch pad or a mouse and the like.

Those skilled in the art will appreciate that the above-described architecture shown in fig. 8 is merely a block diagram of some of the structures associated with the present solution, and does not constitute a limitation on the computing devices to which the present solution applies, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, a computer device is provided, comprising a memory and a processor, the memory having a computer program stored therein, the processor implementing the following steps when executing the computer program:

The implementation principle and technical effect of the computer-readable storage medium provided by the above embodiments are similar to those of the above method embodiments, and are not described herein again.

In one embodiment, a computer-readable storage medium is provided, having a computer program stored thereon, which when executed by a processor, performs the steps of:

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A human-computer interaction method, characterized in that the method comprises:

and after the response operation is finished, playing the residual content of the target voice data from the playing position of the pause time.

2. The method of claim 1, wherein the recognizing the semantics of the voice information of the target user, and performing a response operation corresponding to the semantics, comprises:

and executing response operation corresponding to the semantics according to the information type.

3. The method of claim 2, wherein performing a response operation corresponding to the semantics according to the information type comprises:

if the information type is an inquiry type, matching reply content corresponding to the semantics of the voice information of the target user from a preset knowledge base, and playing the reply content; the knowledge base comprises reply contents corresponding to various information types; or,

and if the information type is a closed type, stopping for a preset time length and continuing to play the residual content of the target voice data from the playing position of the pause time.

4. The method according to any one of claims 1 to 3, wherein the recording of the play position of the target speech data at the pause time comprises:

determining clauses corresponding to pause time in the target voice data;

5. The method according to any one of claims 1 to 3, wherein after detecting the voice information in the environment of the target user while playing the target voice data, the method comprises:

6. The method according to any one of claims 1-3, further comprising:

and identifying whether the voice information in the environment where the target user is located is the voice information of the target user through a preset voice feature detection model.

7. The method of claim 6, wherein the training process of the acoustic feature detection model comprises:

8. A human-computer interaction device, characterized in that the device comprises:

and the playing module is used for playing the residual content of the target voice data from the playing position of the pause moment after the response operation is finished.

9. A computer device comprising a memory and a processor, the memory storing a computer program, wherein the processor implements the steps of the method of any one of claims 1 to 7 when executing the computer program.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.