WO2020253128A1

WO2020253128A1 - Voice recognition-based communication service method, apparatus, computer device, and storage medium

Info

Publication number: WO2020253128A1
Application number: PCT/CN2019/122167
Authority: WO
Inventors: 杨一凡; 徐国强
Original assignee: 深圳壹账通智能科技有限公司
Priority date: 2019-06-17
Filing date: 2019-11-29
Publication date: 2020-12-24
Also published as: CN110444229A

Abstract

Provided are a voice recognition-based communication service method, apparatus, computer device, and storage medium, said method comprising: obtaining first audio of a first terminal and second audio of a second terminal, and recognizing and obtaining a call context; recognizing the first audio to obtain the emotion of a first person, and recognizing the second audio to obtain the emotion of a second person; sending prompt information to the first terminal according to the call setting and the emotion of the first person; sending prompt information to the first terminal according to the call setting and the emotion of the second person.

Description

Communication service method, device, computer equipment and storage medium based on speech recognition

This application claims the priority of a Chinese patent application filed with the Chinese Patent Office, the application number is 201910605732.6, and the invention title is "voice recognition-based communication service methods, devices, computer equipment and storage media" on July 5, 2019. All of them The content is incorporated in this application by reference.

Technical field

This application relates to the field of data analysis technology, and in particular to a communication service method, device, computer equipment and storage medium based on voice recognition.

Background technique

People can make calls through existing telecom operators or other social platforms, but the services they provide are relatively simple. For example, sometimes the communication between callers requires some intervention to better achieve the purpose of the communication, but these existing communication service platforms cannot inject intervention in a timely and accurate manner when the caller makes a call, so as to guide the caller to better implement the call.

Summary of the invention

The embodiments of the present application provide a communication service method, device, computer equipment, and storage medium based on voice recognition, which can better realize timely and accurate injection intervention when the caller makes a call, so as to guide the caller to better implement the call.

In the first aspect, this application provides a communication service method based on voice recognition, the method including:

If the call between the first call terminal and the second call terminal is connected, acquiring the first call audio corresponding to the first call terminal and the second call audio corresponding to the second call terminal;

Performing voice recognition on the first call audio and the second call audio to obtain dialogue text data;

Recognizing the dialogue text data based on a pre-built scene recognition model to obtain type data of the call scene;

Recognize the first call audio and the second call audio based on the pre-built emotion recognition model to obtain the emotion data of the first caller corresponding to the first call terminal and the second call corresponding to the second call terminal Emotional data of the caller;

Generating and sending first prompt information for prompting the first caller to adjust emotions to the first call terminal according to the type data of the call scene and the emotion data of the first caller;

According to the type data of the call scene and the emotion data of the second caller, it is generated and sent to the first call terminal for prompting the first caller to adjust the dialogue strategy to deal with the second caller's emotions. The second prompt message.

In a second aspect, the present application provides a communication service device based on voice recognition, the device including:

An audio acquisition module, configured to obtain a first call audio corresponding to the first call terminal and a second call audio corresponding to the second call terminal if the call between the first call terminal and the second call terminal is connected;

A voice recognition module, configured to perform voice recognition on the first call audio and the second call audio to obtain dialogue text data;

A scene recognition module, configured to recognize the dialogue text data based on a pre-built scene recognition model to obtain type data of the call scene;

The emotion recognition module is configured to recognize at least one of the first call audio, second call audio, and dialog text data based on a pre-built emotion recognition model to obtain the first call corresponding to the first call terminal Emotional data of a person and emotional data of a second caller corresponding to the second call terminal;

The first prompting module is configured to generate and send a first prompt for prompting the first caller to adjust emotions to the first call terminal according to the type data of the call scene and the emotional data of the first caller information;

The second prompting module is configured to generate and send to the first call terminal according to the type data of the call scene and the emotional data of the second caller for prompting the first caller to adjust the dialogue strategy to deal with the situation. The second prompt message describing the emotion of the second caller.

In a third aspect, the present application provides a computer device that includes a memory and a processor; the memory is used to store a computer program; the processor is used to execute the computer program and when the computer is executed The program implements the above-mentioned communication service method based on voice recognition.

In a fourth aspect, this application provides a computer-readable storage medium that stores a computer program, and if the computer program is executed by a processor, the above-mentioned voice recognition-based communication service method is implemented.

This application discloses a communication service method, device, equipment and storage medium based on voice recognition. The corresponding audio is obtained during a call between a first call terminal and a second call terminal, and then the dialogue text is obtained through voice recognition and based on The conversation text recognizes the call scene and the emotion of the caller according to the acquired audio; then, according to the call scene and the emotion of the caller, prompts the caller accordingly, so as to realize timely and accurate injection of intervention during the call to guide the caller The caller realizes the call better.

Description of the drawings

In order to explain the technical solutions of the embodiments of the present application more clearly, the following will briefly introduce the accompanying drawings used in the description of the embodiments. Obviously, the drawings in the following description are some embodiments of the present application. For those of ordinary skill in the art, without creative work, other drawings can be obtained from these drawings.

FIG. 1 is a schematic diagram of a usage scenario of a communication service method based on voice recognition according to an embodiment of the application;

2 is a schematic flowchart of a communication service method based on voice recognition according to an embodiment of the application;

Figure 3 is a schematic diagram of a sub-process of obtaining dialogue text data through voice recognition;

4 is a schematic flowchart of a communication service method based on voice recognition according to another embodiment of this application;

Figure 5 is a schematic diagram of a sub-process for obtaining type data of a call scene;

Figure 6 is a schematic diagram of a sub-process for extracting text features;

Figure 7 is a schematic diagram of a sub-process for extracting text features based on a bag of words model;

FIG. 8 is a schematic diagram of a sub-process of obtaining emotional data of the first caller;

FIG. 9 is a schematic diagram of a sub-process of emotion recognition model recognition and acquisition of emotion data;

10 is a schematic flowchart of a communication service method based on voice recognition according to still another embodiment of this application;

11 is a schematic flowchart of a communication service method based on voice recognition according to another embodiment of this application;

12 is a schematic structural diagram of a communication service device based on voice recognition provided by an embodiment of the application;

FIG. 13 is a schematic structural diagram of a communication service device based on voice recognition provided by another embodiment of this application;

FIG. 14 is a schematic structural diagram of a computer device provided by an embodiment of this application.

Detailed ways

The technical solutions in the embodiments of the present application will be described clearly and completely in conjunction with the accompanying drawings in the embodiments of the present application. Obviously, the described embodiments are part of the embodiments of the present application, rather than all of them. Based on the embodiments in this application, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of this application.

The flowchart shown in the drawings is merely an illustration, and does not necessarily include all contents and operations/steps, nor does it have to be executed in the described order. For example, some operations/steps can also be decomposed, combined or partially combined, so the actual execution order may be changed according to actual conditions. In addition, although the functional modules are divided in the device schematic diagram, in some cases, they may be divided into different modules from the device schematic diagram.

The embodiments of the present application provide a voice recognition-based communication service method, device, computer equipment, and computer-readable storage medium. Among them, the communication service method can be applied to a terminal or a server, so as to intervene in the communication between the callers when needed.

In some embodiments, the first call terminal and the second call terminal conduct a call, and the communication service method based on voice recognition is applied to at least one of the first call terminal and the second call terminal. In other embodiments, the first call terminal and the second call terminal conduct a call, and the server provides support for the call between the first call terminal and the second call terminal, and a voice recognition-based communication service method can be applied to the server. Please refer to FIG. 1, which is a schematic diagram of an application scenario of a communication service method based on voice recognition provided by an embodiment of the present application. The application scenario includes a server, a first call terminal, and a second call terminal.

Among them, the call terminal can be a mobile phone, a tablet computer, a notebook computer, a desktop computer, a personal digital assistant, a wearable device, a smart speaker, and other electronic devices; the server can be an independent server or a server cluster.

However, for ease of understanding, the following embodiments will introduce in detail a communication service method based on voice recognition applied to a server.

Hereinafter, some embodiments of the present application will be described in detail with reference to the accompanying drawings. In the case of no conflict, the following embodiments and features in the embodiments can be combined with each other.

Please refer to FIG. 2, which is a schematic flowchart of a communication service method based on voice recognition provided by an embodiment of the present application.

As shown in FIG. 2, the communication service method based on voice recognition includes the following steps S110 to S160.

Step S110: If the call between the first call terminal and the second call terminal is connected, obtain the first call audio corresponding to the first call terminal and the second call audio corresponding to the second call terminal.

In some embodiments, the first caller uses the first call terminal to make a call to the second caller, and the second caller uses the second call terminal to answer the call, then the first call terminal and the second call terminal The call is connected.

When the call between the first call terminal and the second call terminal is connected, and when the first caller is talking with the second caller, the server provides support for the call between the first call terminal and the second call terminal. Exemplarily, the server collects the audio of the first caller, that is, the first call audio corresponding to the first call terminal, and sends the first call audio to the second call terminal so that the speaker of the second call terminal can play the audio to the second call terminal. The caller listens; the server also collects the audio of the second caller, that is, the second call audio corresponding to the second call terminal, and sends the second call audio to the first call terminal so that the speaker of the first call terminal plays the audio to the second call terminal One caller listens. Therefore, when the server monitors that the call between the first call terminal and the second call terminal is connected, the first call audio corresponding to the first call terminal and the second call audio corresponding to the second call terminal can be obtained.

Step S120: Perform voice recognition on the first call audio and the second call audio to obtain dialog text data.

Specifically, the server converts the first call audio and the second call audio into text by means of voice recognition to obtain dialog text data.

In some embodiments, as shown in FIG. 3, step S120 performs voice recognition on the first call audio and the second call audio to obtain dialog text data, which specifically includes step S121 to step S123.

Step S121: Perform voice recognition on the first call audio to obtain a first text corresponding to the first caller.

Exemplarily, when collecting the first call audio corresponding to the first call terminal, the server performs voice recognition on the collected first call audio, and marks the recognized text as the first text.

Step S122: Perform voice recognition on the second call audio to obtain a second text corresponding to the second caller.

Exemplarily, when collecting the second call audio corresponding to the second call terminal, the server performs voice recognition on the collected second call audio, and marks the recognized text as the second text.

Step S123: Sort the first text and the second text according to a preset sorting rule to obtain dialogue text data.

Exemplarily, according to the order of recording time of each first text and second text, the first text and the second text are sorted to obtain dialog text data.

Exemplarily, the dialogue text data includes a plurality of first texts and second texts arranged at intervals.

Step S130: Recognizing the dialogue text data based on a pre-built scene recognition model to obtain type data of the call scene.

In some embodiments, the scene recognition model stores or learns several scene recognition rules, and the scene recognition model recognizes the call scene corresponding to the dialogue text data based on the scene recognition rules.

In some embodiments, as shown in FIG. 4, step S130 recognizes the dialogue text data based on a pre-built scene recognition model to obtain type data of the call scene, including step S131.

Step S131: Based on the scene rule engine with built-in scene judgment rules, analyze the conversation text data to obtain type data of the call scene.

Exemplarily, the scene rule engine is a rule engine with built-in scene judgment rules, such as a drools rule engine. The rule engine originated from the rule-based expert system, and the rule-based expert system is a branch of the expert system. Expert system belongs to the category of artificial intelligence. It imitates human reasoning, uses tentative methods to reason, and uses human-understandable terms to explain and prove its reasoning conclusions. The rule engine is a core technical component designed to respond to and process complex business rules. By introducing the rule engine, it is possible to dynamically define and adjust scene judgment rules in a timely manner through flexible configuration.

Exemplarily, the built-in scene judgment rule of the scene rule engine is specifically a rule set based on people's practical experience, and this embodiment does not limit the setting of the preset scene judgment rule. For example, if the dialog text data includes "Hello, Mr. Wang, I am XX", the scene recognition model recognizes the type of the conversation scene corresponding to the conversation text data as a stranger call based on a certain scene judgment rule.

The construction of the scene rule engine includes: first obtain a number of scene judgment rules matching the rule modification template according to a preset rule modification template; then precompile and test the scene judgment rules, and generate according to the scene judgment rules after the test passes Script file; then store the script file on the server and associate the script file with the rule calling interface of the scene rule engine, so that the scene rule engine calls the corresponding scene judgment rule.

In some embodiments, the rule modification template is a visual rule modification template. By visualizing the rule modification template, it is more conducive for relevant personnel to edit directly on the rule modification template to generate scene judgment rules; so that relevant personnel who understand the call scene judgment rule can modify the scene judgment rule through the template without knowing the implementation method behind the template , The threshold for using the rule engine is further reduced, which is beneficial to improve the accuracy of the scene rule engine's recognition of the call scene.

In other embodiments, the scene recognition model may be constructed in the following manner: the scene recognition model is obtained by learning from a set of scene training samples through a machine learning algorithm.

As shown in FIG. 5, step S130 recognizes the dialogue text data based on the pre-built scene recognition model to obtain the type data of the call scene, including step S132 and step S133.

Step S132: Extract text features in the dialogue text data.

When recognizing the call scene corresponding to the dialogue text data, it is necessary to extract features from the dialogue text data to extract valuable information for recognition, instead of using all the words, it will cause a dimensional disaster.

Exemplarily, feature words are extracted from the dialogue text data and quantified to represent the text information, that is, the text features in the dialogue text data, to realize the scientific abstraction of the dialogue text data, and establish its mathematical model to describe and replace Dialog text data.

Exemplarily, text features are extracted from dialogue text data based on a bag-of-words (Bag-of-words, BOW) model.

In some embodiments, as shown in FIG. 6, step S132 extracts the text features in the dialogue text data, including step S1321 and step S1322.

Step S1321, filter out noisy characters in the dialogue text data according to a preset filtering rule.

Exemplarily, according to a preset stop word database including several stop words, the stop words in the dialogue text data are deleted or replaced with preset symbols.

Specifically, some special words such as "的", "得" and other noise characters and invalid words can be specified as stop words according to the call scene to construct a stop word database and save it in the form of a configuration file. The server calls the stop word database when needed.

Specifically, each stop word in the stop word database is searched separately whether it appears in the dialogue text data, and if it appears, the stop word in the dialogue text data is deleted; or, the stop word in the stop word database is searched separately Whether each stop word appears in the dialogue text data, if it appears, replace the stop word in the dialogue text data with a preset symbol, such as a space, to preserve the structure of the dialogue text data to a certain extent .

Step S1322, based on the bag-of-words model, extract text features from the dialogue text data with noise characters filtered out.

Bag-of-words (Bag-of-words, BOW) is a representation of the text that describes the occurrence of word elements in a document. The bag-of-words model is a method of representing text data when modeling text with machine learning algorithms. It involves two aspects: the collection of known words and testing the existence of known words.

Specifically, the bag-of-words model includes a dictionary, and the dictionary includes several words. The bag-of-words model divides the dialogue text data with noisy characters filtered into words, imagine putting all words in a bag, ignoring the word order, grammar, syntax and other elements, and treating them as just a collection of several words. The appearance of each word in the dialogue text data is independent and does not depend on whether other words appear or not. The bag-of-words model extracts text features from dialogue text data that filter out noise characters including bag-of-words feature vectors.

Exemplarily, as shown in FIG. 7, step S1322 is based on the bag-of-words model to extract text features from the dialogue text data with noise characters filtered out, including steps S1301-step S1303.

Step S1301, initialize the all-zero bag-of-words feature vector.

Wherein, the elements in the bag-of-words feature vector correspond one-to-one with words in the dictionary of the bag-of-words model.

Exemplarily, according to the dictionary of the bag-of-words model {1: "Xiao Ming", 2: "Like", 3: "Watch", 4: "Movie" 5: "Also", 6: "Kick", 7: "Football" ”}, initialize the all-zero bag-of-words feature vector to [0, 0, 0, 0, 0, 0, 0].

Step S1302: Count the number of occurrences of each word in the dictionary in the dialogue text data from which the noise character is filtered out.

Step S1303: Assign a value to the corresponding element in the bag of words feature vector according to the number of times the word appears in the dialogue text data.

Exemplarily, if the dialogue text data from which noise characters are removed is "Xiao Ming likes to watch movies", then the bag of words feature vector is [1, 1, 1, 1, 0, 0, 0]. If the dialogue text data from which the noise characters are removed is "Xiao Ming likes watching movies and Xiao Ming also likes playing football", the bag of words feature vector is [2, 2, 1, 1, 1, 1, 1].

Step S133: Based on the trained machine learning model, the type data of the call scene is identified according to the text features in the dialogue text data.

Specifically, the text features in the dialogue text data are used as the input of the trained machine learning model, and the output of the machine learning model is used as the type data of the identified call scene.

In some embodiments, the scene training sample set used to train the machine learning model includes several scene training samples. The scene training sample includes historical dialogue text data and scene type data corresponding to the historical dialogue text data. Text features can be extracted from historical dialogue text data. The scene type data is the annotation data of the historical dialogue text data. During model training, the text characteristics corresponding to the historical dialogue text data are used as input data, and all The scene type data is used as output data, and a selected machine learning model is used to learn from a scene training sample set including a large number of scene training samples to obtain a trained machine learning model.

In some embodiments, the trained machine learning model can be set as a model that only recognizes the call scene type in a single scene, and the type data of the call scene obtained by recognizing the conversation text data based on the pre-built scene recognition model can be It reflects whether the first caller and the second caller belong to a specific call scene. In other embodiments, the trained machine learning model can also be set as a model that can recognize the type of call scenes in multiple scenarios, and then the type of call scene obtained by recognizing the conversation text data based on the pre-built scene recognition model The data can reflect the probability that the first caller and the second caller belong to multiple specific call scenarios. For example, in an embodiment, based on a pre-built scene recognition model to recognize the conversation text data, the probability that the type data of the call scene corresponding to the two scene types "friend" and "borrow money" are 40% and 43%, respectively , Are greater than the preset threshold of 30%, and the type of the call scene corresponding to the conversation text data is "friend" and "borrowing money".

Step S140: Recognizing the first call audio and the second call audio based on the pre-built emotion recognition model to obtain the emotion data of the first caller corresponding to the first call terminal and the second call terminal corresponding The emotional data of the second caller.

In some embodiments, the server recognizes the first call audio based on a pre-built emotion recognition model to obtain the emotion data of the first caller; and the server recognizes the second call audio based on the pre-built emotion recognition model Perform recognition to obtain emotional data of the second caller.

Exemplarily, a machine learning algorithm is used to obtain the emotion recognition model from a set of emotion training samples.

The emotion training sample set includes several emotion training samples. The emotion training sample includes historical audio data and emotion type data corresponding to the historical audio data. According to historical audio data, characteristic data can be extracted, such as volume characteristics, speech rate characteristics, smooth characteristics, pause characteristics, etc.; the emotion type data is the annotation data of the historical audio data, and the historical audio data is used for model training. The feature data corresponding to the audio data is used as input data, the emotion type data is used as output data, and the emotion recognition model is obtained by learning from a set of emotion training samples including several emotion training samples through a selected machine learning model.

In some implementations, the first call audio is first processed to obtain smooth features that reflect the smoothness of the first caller’s voice, and to obtain pause features that reflect the duration of the pause; specifically, the smooth feature is identified through The voice jitter frequency of the first caller is detected and evaluated. The identification of the pause feature is obtained by starting a timer for timing when the voice of the first caller and the second caller stops. The trained emotion recognition model can recognize the emotion data of the first caller based on smooth features, pause features, volume features, and/or speech rate features. Correspondingly, the emotion recognition model can recognize the second call audio to obtain the emotion data of the second caller.

Exemplarily, when the volume of the first call audio is higher than the preset threshold, the emotion recognition model recognizes that the emotion data of the first caller corresponding to the first call terminal is "excited"; When the frequency is higher than the preset frequency threshold, the emotion recognition model recognizes that the emotion data of the first caller corresponding to the first call terminal is "tension".

In some embodiments, the emotion recognition model recognizes the dialogue text data to obtain text features; the emotion recognition model can also identify the emotion data of the first speaker or the second speaker based on the text features. For example, if the second text in the dialogue text data includes the sentence "You need to be calm and not excited" corresponding to the second caller, the emotion recognition model can recognize the emotion of the first caller as "excited"; if the dialogue text data The second text includes the sentence "you this **" corresponding to the second caller, and the emotion recognition model can identify the emotion of the second caller as "excited" or "angry".

In some embodiments, as shown in FIG. 8, step S140 recognizes the first call audio and the second call audio based on a pre-built emotion recognition model to obtain the first caller corresponding to the first call terminal The emotional data of and the emotional data of the second caller corresponding to the second call terminal specifically include step S141 and step S142.

Step S141: Recognizing the first call audio and dialogue text data based on the pre-built emotion recognition model to obtain the emotion data of the first caller corresponding to the first call terminal.

Specifically, the volume feature, the speech rate feature, the smooth feature and/or the pause feature extracted from the first call audio, and the text feature extracted from the dialogue text data are merged as the input of the emotion recognition model, and the emotion recognition model is recognized The emotional data of the first caller is obtained; the accuracy of model recognition is further improved.

Step S142: Recognizing the second call audio and dialogue text data based on the pre-built emotion recognition model to obtain the second caller's emotion data corresponding to the second call terminal.

Specifically, the volume feature, the speech rate feature, the smooth feature and/or the pause feature extracted from the second call audio, and the text feature extracted from the dialogue text data are merged as the input of the emotion recognition model, and the emotion recognition model is used to recognize The emotional data of the second caller is obtained; the accuracy of model recognition is further improved.

Exemplarily, as shown in FIG. 9, step S141 recognizes the first call audio and dialogue text data based on a pre-built emotion recognition model to obtain the emotion data of the first caller corresponding to the first call terminal , Specifically includes step S1411-step S1413.

Step S1411, extract at least one of a volume feature, a speech rate feature, a smooth feature, and a pause feature from the first call audio.

Specifically, the volume feature is a feature used to reflect the amplitude of the first call audio, the recognition of the speech rate feature is obtained by calculating the rate of change of the energy envelope of the first call audio in the time domain, and the smooth feature is recognized by The voice jitter frequency of the first caller is detected and evaluated. The identification of the pause feature is obtained by starting a timer for timing when the voice of the first caller and the second caller stops.

Step S1412, extract text features from the dialogue text data.

Specifically, the text features of the dialog text data extracted in step S132 can be reused.

Step S1413: Based on the pre-built emotion recognition model, process the text feature and at least one of the volume feature, the speech rate feature, the smooth feature, and the pause feature to obtain the first call terminal corresponding to the first call terminal. The emotional data of the caller.

Specifically, the fusion process is performed on the text feature and the volume feature, speech rate feature, smooth feature, and pause feature, such as splicing as input to the emotion recognition model, and the emotion recognition model recognizes the emotion of the first caller Data, further improve the accuracy of model recognition.

The emotion training sample set includes several emotion training samples. The emotion training sample includes historical audio data, corresponding dialogue text data and corresponding emotion type data. According to historical audio data, volume characteristics, speech rate characteristics, smooth characteristics, pause characteristics, etc. can be extracted, and text characteristics can be obtained from dialogue text data; the emotion type data is the annotation data of the historical audio data, which is used during model training , Using the volume feature, speaking rate feature, smooth feature, pause feature, etc., and text feature corresponding to the historical audio data as input data, using the emotion type data as output data, and using the selected machine learning model from including Emotion training samples of several emotion training samples are collectively learned to obtain the emotion recognition model.

Step S150: Generate and send first prompt information for prompting the first caller to adjust emotions to the first call terminal according to the type data of the call scene and the emotion data of the first caller.

Exemplarily, if the type of the call scene is a call between a father and a child, and the emotional data of the first caller is "excited", then the first prompt information generated and sent to the first call terminal includes "excited too much" and the like.

Exemplarily, the first prompt information may be provided to the first caller using the first call terminal in a manner of display or sound.

In some embodiments, as shown in FIG. 10, step S150 generates and sends to the first call terminal according to the type data of the call scene and the emotional data of the first caller for prompting the first call The first prompt message for adjusting the emotion of the person includes step S151:

Step S151: Based on the prompt rule engine with built-in prompt rules, analyze the type data of the call scene and the emotional data of the first caller to obtain corresponding first prompt information, and send the first prompt information to Sent by the first call terminal to prompt the first caller to adjust emotions.

Exemplarily, the prompt rule engine is a rule engine with built-in prompt rules, such as a drools rule engine. For example, the prompt rule engine includes prompt rules: if the type of the call scene is father and son, and the emotional data of the first caller is "excited", then first prompt information including "excited emotion" is generated.

In other embodiments, as shown in FIG. 11, step S150 generates and sends to the first call terminal according to the type data of the call scene and the emotional data of the first caller for prompting the first call The first prompt message for adjusting the emotion of the caller includes step S152:

Step S152: Based on the pre-trained first prompt model, generate and send to the first call terminal according to the type data of the call scene, the emotional data of the first caller, and the dialog text data for prompting The first prompt message for adjusting the emotion of the first caller.

In some embodiments, the first prompt model may be constructed in the following manner: a machine learning algorithm is used to obtain the first prompt model from the first prompt training sample set.

The first prompt training sample set includes a plurality of first prompt training samples. Each first prompt training sample includes type data of the historical call scene, historical emotion data corresponding to the first caller, text features corresponding to the historical dialogue text data, and prompt information corresponding to the training sample. The prompt information is the annotation data of the training sample; during model training, the type data of the historical call scene, the historical emotion data corresponding to the first caller, and the text features corresponding to the historical dialogue text data are used as input data , Using the prompt information as output data, through the selected machine learning model, learn from the first prompt training sample set including the first prompt training sample to obtain the first prompt model.

Therefore, the first prompt model can learn the verbal rules in the call based on the historical dialogue text data, and can provide prompts including the verbal information when generating and prompting information.

Exemplarily, if the type of the call scene is a call between a father and a son, and the emotional data of the first caller is "excited", the first prompt message is generated and sent to the first call terminal including "excited, try to talk Talk about the weather" etc.

Step S160: Generate and send to the first call terminal according to the type data of the call scene and the emotional data of the second caller for prompting the first caller to adjust the dialogue strategy to deal with the second call The second reminder of human emotions.

Exemplarily, if the type of the call scene is a call between mother and child, and the emotional data of the second caller is "exhausted", the second prompt message including "your mother has been exhausted recently" is generated and sent to the first call terminal; Or the type of the call scene is a conversation between lovers, and the emotional data of the second caller is "acting like a baby", then a second prompt message including "your girlfriend is acting like a baby" is generated and sent to the first calling terminal; or the scene of the call The type of is a call between friends, and the emotional data of the second caller is "angry", then a second prompt message including "your friend is angry" is generated and sent to the first call terminal.

Exemplarily, the second prompt information may be provided to the first caller using the first call terminal in a manner of display or sound.

In some embodiments, as shown in FIG. 10, step S160 generates and sends to the first call terminal according to the type data of the call scene and the emotional data of the second caller for prompting the first call The person adjusts the dialogue strategy to deal with the second prompt message of the emotion of the second caller, including step S161:

Step S161: Based on the prompt rule engine with built-in prompt rules, analyze the type data of the call scene and the emotional data of the second caller to obtain corresponding second prompt information, and send the second prompt information to The first call terminal sends to prompt the first caller to adjust the dialogue strategy to deal with the emotion of the second caller.

Exemplarily, the prompt rule engine is a rule engine with built-in prompt rules, such as a drools rule engine. For example, the reminder rule engine includes a reminder rule: if the type of the call scene is a couple and the emotional data of the second caller is "acting like a baby", then a second reminder message including "your girlfriend is acting like a baby" is generated.

In other embodiments, as shown in FIG. 11, step S160 generates and sends to the first call terminal a reminder of the first call based on the type data of the call scene and the emotional data of the second caller. The caller adjusts the dialogue strategy to deal with the second prompt message of the second caller’s emotion, including step S162:

Step S162: Based on the pre-trained second prompt model, generate and send to the first call terminal according to the type data of the call scene, the emotional data of the second caller, and the dialog text data for prompting The first caller adjusts the dialogue strategy to deal with the second prompt message of the emotion of the second caller.

In some embodiments, the second prompt model may be constructed in the following manner: the second prompt model is obtained by learning from the second prompt training sample set through a machine learning algorithm.

The second prompt training sample set includes a plurality of second prompt training samples. Each second prompt training sample includes type data of the historical call scene, historical emotion data corresponding to the second caller, text features corresponding to the historical dialogue text data, and prompt information corresponding to the training sample. The prompt information is the annotation data of the training sample; during model training, the type data of the historical call scene, the historical emotion data corresponding to the second caller, and the text features corresponding to the historical dialogue text data are used as input data , Using the prompt information as output data, and through a selected machine learning model, learn from a second prompt training sample set including a second prompt training sample to obtain the second prompt model.

Therefore, the second prompt model can learn the verbal rules in the call based on the historical dialogue text data, and can provide prompts including verbal information when generating and prompting information.

Exemplarily, if the type of the call scene is a call between mother and child, and the emotional data of the second caller is "exhausted", the second prompt message generated and sent to the first call terminal includes "Your mother has been tired recently, Condolences to mom’s life"; or the type of the call scene is a conversation between couples, and the emotional data of the second caller is "coquettish", then the second prompt message generated and sent to the first call terminal includes "your girl A friend is acting like a baby and tenderly calling her baby"; or the type of the call scene is a call between friends, and the emotional data of the second caller is "anger", then a second reminder is generated and sent to the first call terminal Messages include "Your friend is angry, try to talk about the weather" etc.

It is understandable that the terms "first" and "second" in the description of the application and the drawings are used to distinguish different objects, or to distinguish different treatments of the same object, rather than describing objects. The specific order cannot be understood as indicating or implying relative importance or implicitly indicating the number of technical features indicated.

Exemplarily, it is also possible to generate and send corresponding prompt information for prompting the second caller to adjust emotions to the second call terminal according to the type data of the call scene and the emotional data of the second caller; It is also possible to generate and send to the second call terminal according to the type data of the call scene and the emotional data of the first caller to prompt the second caller to adjust the dialogue strategy to deal with the first caller Corresponding reminder information for emotions.

In some embodiments, the first prompt model in step S152 and the second prompt model in step S162 can be integrated into one prompt model. Specifically, it can be used to indicate the reminder object identifier in the reminder training sample; for example, the reminder model running on the server can generate the corresponding reminder information and predict the reminder object corresponding to the reminder information, and send the reminder information to the The reminder object, such as sent to the first call terminal or the second call terminal.

In some embodiments, when the first prompt information for prompting the first caller to adjust emotions is sent to the first call terminal in step S150, the first call audio corresponding to the first call terminal is suspended to Sent by the second call terminal to shield the first prompt information from the second caller.

In some embodiments, when the second prompt message for prompting the first caller to adjust the dialogue strategy in response to the emotion of the second caller is sent to the first call terminal in step S160, the second caller is suspended. The first call audio corresponding to a call terminal is sent to the second call terminal to shield the second prompt information from the second caller.

Specifically, when the server sends corresponding prompt information to the first call terminal, the first call terminal prompts the first caller by means of voice prompts; at this time, the server can pause the collection of the audio obtained by the microphone of the first call terminal, that is, the first call terminal Call audio, for example, control the call mode of the first call terminal to mute mode; thus stop sending the first call audio containing the corresponding sound prompt to the second call terminal, so the first prompt information and the second prompt information will not be Second, the caller heard it.

In the communication service method based on voice recognition provided by the foregoing embodiment, the corresponding audio is obtained during a call between the first call terminal and the second call terminal, and then the dialogue text is obtained through voice recognition and the call scene is recognized according to the dialogue text, and according to The acquired audio recognizes the emotion of the caller; then, corresponding prompts are made to the caller according to the call scene and the caller's emotions, so as to realize timely and accurate injection of intervention during the call to guide the caller to better implement the call.

Please refer to FIG. 12, which is a schematic structural diagram of a voice recognition-based communication service device provided by an embodiment of the present application. The voice recognition-based communication service device may be configured in a server for performing the aforementioned voice recognition Communication service method.

As shown in FIG. 12, the communication service device based on voice recognition includes: an audio acquisition module 110, a voice recognition module 120, a scene recognition module 130, an emotion recognition module 140, a first prompt module 150, and a second prompt module 160.

The audio obtaining module 110 is configured to obtain the first call audio corresponding to the first call terminal and the second call audio corresponding to the second call terminal if the call between the first call terminal and the second call terminal is connected .

The voice recognition module 120 is configured to perform voice recognition on the first call audio and the second call audio to obtain dialogue text data.

Specifically, as shown in FIG. 13, the speech recognition module 120 includes:

The first voice sub-module 121 is configured to perform voice recognition on the first call audio to obtain the first text corresponding to the first caller;

The second voice submodule 122 is configured to perform voice recognition on the second call audio to obtain the second text corresponding to the second caller;

The text sorting sub-module 123 is used to sort the first text and the second text according to a preset sorting rule to obtain dialogue text data.

The scene recognition module 130 is configured to recognize the dialogue text data based on a pre-built scene recognition model to obtain type data of the call scene.

In some embodiments, as shown in FIG. 13, the scene recognition module 130 includes:

The scene rule sub-module 131 is used to analyze the conversation text data to obtain the type data of the call scene based on the scene rule engine of the built-in scene judgment rule

In other embodiments, as shown in FIG. 13, the scene recognition module 130 includes:

The feature extraction sub-module 132 is used to extract text features in the dialogue text data;

The scene recognition sub-module 133 is used to identify the type data of the call scene according to the text features in the conversation text data based on the trained machine learning model.

The emotion recognition module 140 is configured to recognize the first call audio and the second call audio based on a pre-built emotion recognition model to obtain the emotion data of the first caller corresponding to the first call terminal and the second call Emotional data of the second caller corresponding to the second call terminal.

Specifically, as shown in FIG. 13, the emotion recognition module 140 includes:

The first emotion recognition sub-module 141 is configured to recognize the first call audio and dialog text data based on a pre-built emotion recognition model to obtain the emotion data of the first caller corresponding to the first call terminal.

Exemplarily, the first emotion recognition sub-module 141 includes:

An audio feature extraction sub-module for extracting at least one of a volume feature, a speech rate feature, a smooth feature, and a pause feature from the first call audio;

A text feature extraction sub-module for extracting text features from the dialogue text data;

The emotion data acquisition sub-module is used to process the text feature and at least one of the volume feature, speech rate feature, smooth feature, and pause feature based on the pre-built emotion recognition model to obtain the first Emotional data of the first caller corresponding to the call terminal.

The second emotion recognition sub-module 142 is configured to recognize the second call audio and conversation text data based on a pre-built emotion recognition model to obtain the second caller's emotion data corresponding to the second call terminal.

The first prompt module 150 is configured to generate and send to the first call terminal a first prompt for prompting the first caller to adjust emotions according to the type data of the call scene and the emotional data of the first caller. Prompt information.

In some embodiments, as shown in FIG. 13, the first prompting module 150 includes:

The first prompt rule sub-module 151 is used to analyze the type data of the call scene and the emotional data of the first caller to obtain the corresponding first prompt information based on the prompt rule engine with built-in prompt rules, and The first prompt information is sent to the first call terminal to prompt the first caller to adjust emotions.

In other embodiments, as shown in FIG. 13, the first prompting module 150 includes:

The first prompt generation sub-module 152 is configured to generate and report to the first prompt model according to the type data of the call scene, the emotional data of the first caller, and the dialog text data based on the pre-trained first prompt model. The call terminal sends first prompt information for prompting the first caller to adjust emotions.

The second prompting module 160 is configured to generate and send to the first call terminal according to the type data of the call scene and the emotional data of the second caller for prompting the first caller to adjust the dialogue strategy to deal with The second prompt information of the emotion of the second caller.

In some embodiments, as shown in FIG. 13, the second prompting module 160 includes:

The second prompt rule sub-module 161 is configured to analyze the type data of the call scene and the emotional data of the second caller to obtain the corresponding second prompt information based on the prompt rule engine with built-in prompt rules, and The second prompt information is sent to the first call terminal to prompt the first caller to adjust the dialogue strategy to deal with the emotions of the second caller

In other embodiments, as shown in FIG. 13, the second prompting module 160 includes:

The second prompt generation sub-module 162 is configured to generate and report to the first prompt model based on the pre-trained second prompt model according to the type data of the call scene, the emotional data of the second caller, and the dialog text data. The call terminal sends second prompt information for prompting the first caller to adjust the dialogue strategy to deal with the emotion of the second caller.

It should be noted that those skilled in the art can clearly understand that for the convenience and conciseness of description, the specific working process of the above-described device and each module and unit can refer to the corresponding process in the foregoing method embodiment. No longer.

The method and device of this application can be used in many general or special computing system environments or configurations. For example: personal computers, server computers, handheld or portable devices, tablet devices, multi-processor systems, microprocessor-based systems, set-top boxes, programmable consumer electronic devices, network PCs, small computers, large computers, including the above Distributed computing environment of any system or device, etc.

Exemplarily, the above-mentioned method and apparatus may be implemented in the form of a computer program, and the computer program may run on the computer device as shown in FIG. 14.

Please refer to FIG. 14. FIG. 14 is a schematic structural diagram of a computer device provided by an embodiment of the present application. The computer equipment can be a server or a terminal.

Referring to FIG. 14, the computer device includes a processor, a memory, and a network interface connected through a system bus, where the memory may include a non-volatile storage medium and an internal memory.

The non-volatile storage medium can store an operating system and a computer program. The computer program includes program instructions, and when the program instructions are executed, the processor can execute any communication service method based on voice recognition.

The processor is used to provide computing and control capabilities and support the operation of the entire computer equipment.

The internal memory provides an environment for the operation of the computer program in the non-volatile storage medium. When the computer program is executed by the processor, the processor can execute any communication service method based on voice recognition.

The network interface is used for network communication, such as sending assigned tasks. Those skilled in the art can understand that the structure of the computer device is only a block diagram of a part of the structure related to the solution of the application, and does not constitute a limitation on the computer device to which the solution of the application is applied. The specific computer device may include More or fewer components are shown in the figure, or some components are combined, or have different component arrangements.

It should be understood that the processor may be a central processing unit (Central Processing Unit, CPU), the processor may also be other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), and application specific integrated circuits (Application Specific Integrated Circuits). Circuit, ASIC), Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc. Among them, the general-purpose processor may be a microprocessor or the processor may also be any conventional processor.

Wherein, in an embodiment, the processor is used to run a computer program stored in a memory to implement the following steps:

Specifically, when the processor implements voice recognition on the first call audio and the second call audio to obtain dialog text data, it is specifically implemented: perform voice recognition on the first call audio to obtain the first call The first text corresponding to the person; perform voice recognition on the second call audio to obtain the second text corresponding to the second caller; sort the first text and the second text according to a preset sorting rule to obtain the dialogue text data.

Specifically, when the processor realizes the recognition of the dialogue text data based on the pre-built scene recognition model to obtain the type data of the call scene, it is specifically realized: the scene rule engine based on the built-in scene judgment rule is used for the dialogue The text data is analyzed to obtain the type data of the call scene.

Alternatively, when the processor realizes the recognition of the dialogue text data based on a pre-built scene recognition model to obtain the type data of the call scene, it is specifically implemented: extracting text features in the dialogue text data; based on the trained The machine learning model recognizes the type data of the call scene according to the text features in the dialog text data.

Specifically, the processor realizes the recognition of the first call audio and the second call audio based on a pre-built emotion recognition model to obtain the emotion data of the first caller corresponding to the first call terminal and the When the emotion data of the second caller corresponding to the second call terminal is implemented, it is specifically implemented: based on the pre-built emotion recognition model, the first call audio and dialogue text data are recognized to obtain the second call corresponding to the first call terminal. Emotional data of a caller; the second call audio and dialog text data are recognized based on a pre-built emotion recognition model to obtain the second caller's emotional data corresponding to the second call terminal.

Specifically, when the processor realizes the recognition of the first call audio and dialogue text data based on a pre-built emotion recognition model to obtain the emotion data of the first caller corresponding to the first call terminal, the specific implementation is : Extract at least one of the volume feature, the speaking rate feature, the smooth feature, and the pause feature from the first call audio; extract the text feature from the dialogue text data; based on the pre-built emotion recognition model, compare the text feature And at least one of the volume feature, the speaking rate feature, the smooth feature, and the pause feature are processed to obtain the emotion data of the first caller corresponding to the first call terminal.

Specifically, the processor generates and sends to the first call terminal a first call for prompting the first caller to adjust emotions according to the type data of the call scene and the emotion data of the first caller. When prompting information, the specific implementation is: based on the prompt rule engine with built-in prompt rules, the type data of the call scene and the emotional data of the first caller are analyzed to obtain the corresponding first prompt information, and the A prompt message is sent to the first call terminal to prompt the first caller to adjust their emotions; or specific implementation: based on a pre-trained first prompt model, according to the type data of the call scene, the first caller The emotion data of and the dialog text data are generated and sent to the first call terminal to prompt the first caller to adjust emotions.

Specifically, when the processor implements sending first prompt information to the first call terminal or sends second prompt information to the first call terminal, it also implements: pause the first call terminal corresponding to the first call terminal. The call audio is sent to the second call terminal to shield the first prompt information or the second prompt information from the second caller.

From the description of the foregoing implementation manners, it can be understood that those skilled in the art can clearly understand that this application can be implemented by means of software plus a necessary general hardware platform. Based on this understanding, the technical solution of this application essentially or the part that contributes to the existing technology can be embodied in the form of a software product, and the computer software product can be stored in a storage medium, such as ROM/RAM, magnetic disk , CD-ROM, etc., including several instructions to make a computer device (which can be a personal computer, a server, or a network device, etc.) execute the methods described in each embodiment of this application or some parts of the embodiment, such as:

A computer-readable storage medium, the computer-readable storage medium stores a computer program, the computer program includes program instructions, and the processor executes the program instructions to implement any item provided in the embodiments of this application based on Voice recognition communication service method.

The computer-readable storage medium may be the internal storage unit of the computer device described in the foregoing embodiment, such as the hard disk or memory of the computer device. The computer-readable storage medium may also be an external storage device of the computer device, such as a plug-in hard disk, a smart memory card (SMC), or a secure digital (Secure Digital, SD) equipped on the computer device. ) Card, Flash Card, etc.

The above are only specific implementations of this application, but the protection scope of this application is not limited to this. Anyone familiar with the technical field can easily think of various equivalents within the technical scope disclosed in this application. Modifications or replacements, these modifications or replacements shall be covered within the protection scope of this application. Therefore, the protection scope of this application shall be subject to the protection scope of the claims.

Claims

A communication service method based on speech recognition, which includes:

If the call between the first call terminal and the second call terminal is connected, acquiring the first call audio corresponding to the first call terminal and the second call audio corresponding to the second call terminal;

Performing voice recognition on the first call audio and the second call audio to obtain dialogue text data;

Recognizing the dialogue text data based on a pre-built scene recognition model to obtain type data of the call scene;

Recognize the first call audio and the second call audio based on the pre-built emotion recognition model to obtain the emotion data of the first caller corresponding to the first call terminal and the second call corresponding to the second call terminal Emotional data of the caller;

Generating and sending first prompt information for prompting the first caller to adjust emotions to the first call terminal according to the type data of the call scene and the emotion data of the first caller;

According to the type data of the call scene and the emotion data of the second caller, it is generated and sent to the first call terminal for prompting the first caller to adjust the dialogue strategy to deal with the second caller's emotions. The second prompt message.
The communication service method according to claim 1, wherein said performing voice recognition on said first call audio and said second call audio to obtain dialogue text data comprises:

Performing voice recognition on the first call audio to obtain a first text corresponding to the first caller;

Performing voice recognition on the second call audio to obtain a second text corresponding to the second caller;

The first text and the second text are sorted according to a preset sorting rule to obtain dialogue text data.
The communication service method according to claim 1, wherein the recognizing the dialogue text data based on the pre-built scene recognition model to obtain the type data of the call scene comprises:

Based on a scene rule engine with built-in scene judgment rules, analyze the conversation text data to obtain type data of the call scene; or

The recognizing the dialogue text data based on the pre-built scene recognition model to obtain the type data of the call scene includes:

Extracting text features in the dialogue text data;

Based on the trained machine learning model, the type data of the call scene is identified according to the text features in the dialogue text data.
The communication service method of claim 1, wherein the first call audio and the second call audio are recognized based on the pre-built emotion recognition model to obtain the first call corresponding to the first call terminal The emotional data of the person and the emotional data of the second caller corresponding to the second call terminal include:

Recognizing the first call audio and dialogue text data based on a pre-built emotion recognition model to obtain the emotion data of the first caller corresponding to the first call terminal;

Recognizing the second call audio and dialogue text data based on a pre-built emotion recognition model to obtain the emotion data of the second caller corresponding to the second call terminal.
The communication service method according to claim 4, wherein the first call audio and conversation text data are recognized based on the pre-built emotion recognition model to obtain the first caller corresponding to the first call terminal Sentiment data, including:

Extracting at least one of a volume feature, a speech rate feature, a smooth feature, and a pause feature from the first call audio;

Extracting text features from the dialogue text data;

Based on the pre-built emotion recognition model, process the text feature and at least one of the volume feature, speech rate feature, smooth feature, and pause feature to obtain the first caller corresponding to the first call terminal Sentiment data.
The communication service method according to claim 1, wherein the generating and sending to the first calling terminal is used to prompt the first call based on the type data of the call scene and the emotional data of the first caller The first reminder information for the caller to adjust their emotions, including:

Based on the prompt rule engine with built-in prompt rules, it analyzes the type data of the call scene and the emotional data of the first caller to obtain corresponding first prompt information, and sends the first prompt information to the first Sent by a call terminal to prompt the first caller to adjust emotions; or

Based on the pre-trained first prompt model, generate and send to the first call terminal according to the type data of the call scene, the emotional data of the first caller, and the dialog text data for prompting the first call The first message for the caller to adjust his emotions.
The communication service method according to claim 1, wherein when the first prompt information is sent to the first call terminal or the second prompt information is sent to the first call terminal, the first call terminal is suspended. The corresponding first call audio is sent to the second call terminal to shield the first prompt information or the second prompt information from the second caller.
A communication service device based on voice recognition, which includes:

An audio acquisition module, configured to obtain a first call audio corresponding to the first call terminal and a second call audio corresponding to the second call terminal if the call between the first call terminal and the second call terminal is connected;

A voice recognition module, configured to perform voice recognition on the first call audio and the second call audio to obtain dialogue text data;

A scene recognition module, configured to recognize the dialogue text data based on a pre-built scene recognition model to obtain type data of the call scene;

The emotion recognition module is configured to recognize the first call audio and the second call audio based on a pre-built emotion recognition model to obtain the emotion data of the first caller and the second call corresponding to the first call terminal Emotional data of the second caller corresponding to the call terminal;

The first prompting module is configured to generate and send a first prompt for prompting the first caller to adjust emotions to the first call terminal according to the type data of the call scene and the emotional data of the first caller information;

The second prompting module is configured to generate and send to the first call terminal according to the type data of the call scene and the emotional data of the second caller for prompting the first caller to adjust the dialogue strategy to deal with the situation. The second prompt message describing the emotion of the second caller.
A computer device, wherein the computer device includes a memory and a processor;

The memory is used to store computer programs;

The processor is configured to execute the computer program and implement the following steps when executing the computer program:

If the call between the first call terminal and the second call terminal is connected, acquiring the first call audio corresponding to the first call terminal and the second call audio corresponding to the second call terminal;

Performing voice recognition on the first call audio and the second call audio to obtain dialogue text data;

Recognizing the dialogue text data based on a pre-built scene recognition model to obtain type data of the call scene;

Recognize the first call audio and the second call audio based on the pre-built emotion recognition model to obtain the emotion data of the first caller corresponding to the first call terminal and the second call corresponding to the second call terminal Emotional data of the caller;

Generating and sending first prompt information for prompting the first caller to adjust emotions to the first call terminal according to the type data of the call scene and the emotion data of the first caller;

According to the type data of the call scene and the emotion data of the second caller, it is generated and sent to the first call terminal for prompting the first caller to adjust the dialogue strategy to deal with the second caller's emotions. The second prompt message.
9. The computer device according to claim 9, wherein the processor is configured to implement the following steps when performing voice recognition on the first call audio and the second call audio to obtain dialog text data:

Performing voice recognition on the first call audio to obtain a first text corresponding to the first caller;

Performing voice recognition on the second call audio to obtain a second text corresponding to the second caller;

The first text and the second text are sorted according to a preset sorting rule to obtain dialogue text data.
The computer device according to claim 9, wherein the processor is configured to implement the following steps when implementing the recognition of the dialogue text data based on the pre-built scene recognition model to obtain the type data of the call scene :

Based on a scene rule engine with built-in scene judgment rules, analyze the conversation text data to obtain type data of the call scene; or

The recognizing the dialogue text data based on the pre-built scene recognition model to obtain the type data of the call scene includes:

Extracting text features in the dialogue text data;

Based on the trained machine learning model, the type data of the call scene is identified according to the text features in the dialogue text data.
The computer device according to claim 9, wherein the processor is implementing the recognition of the first call audio and the second call audio based on the pre-built emotion recognition model to obtain the first call terminal Corresponding to the emotional data of the first caller and the emotional data of the second caller corresponding to the second call terminal, it is used to implement the following steps:

Recognizing the first call audio and dialogue text data based on a pre-built emotion recognition model to obtain the emotion data of the first caller corresponding to the first call terminal;

Recognizing the second call audio and dialogue text data based on a pre-built emotion recognition model to obtain the emotion data of the second caller corresponding to the second call terminal.
The computer device according to claim 12, wherein the processor recognizes the first call audio and dialogue text data based on the pre-built emotion recognition model, so as to obtain the corresponding data of the first call terminal. The emotional data of the first caller is used to implement the following steps:

Extracting at least one of a volume feature, a speech rate feature, a smooth feature, and a pause feature from the first call audio;

Extracting text features from the dialogue text data;

Based on the pre-built emotion recognition model, process the text feature and at least one of the volume feature, speech rate feature, smooth feature, and pause feature to obtain the first caller corresponding to the first call terminal Sentiment data.
The computer device according to claim 9, wherein the processor is configured to generate and send to the first call terminal according to the type data of the call scene and the emotional data of the first caller. When the first prompt message prompting the first caller to adjust emotions is used to implement the following steps:

Based on the prompt rule engine with built-in prompt rules, it analyzes the type data of the call scene and the emotional data of the first caller to obtain corresponding first prompt information, and sends the first prompt information to the first Sent by a call terminal to prompt the first caller to adjust emotions; or

Based on the pre-trained first prompt model, generate and send to the first call terminal according to the type data of the call scene, the emotional data of the first caller, and the dialog text data for prompting the first call The first message for the caller to adjust his emotions.
The computer device according to claim 9, wherein the processor is configured to implement the sending of the first prompt information to the first call terminal or the second prompt information to the first call terminal The following steps are as follows: suspend sending the first call audio corresponding to the first call terminal to the second call terminal to shield the first prompt information or the second prompt information from the second caller.
A computer-readable storage medium storing a computer program, wherein: if the computer program is executed by a processor, the following steps are implemented:

If the call between the first call terminal and the second call terminal is connected, acquiring the first call audio corresponding to the first call terminal and the second call audio corresponding to the second call terminal;

Performing voice recognition on the first call audio and the second call audio to obtain dialogue text data;

Recognizing the dialogue text data based on a pre-built scene recognition model to obtain type data of the call scene;

Recognize the first call audio and the second call audio based on the pre-built emotion recognition model to obtain the emotion data of the first caller corresponding to the first call terminal and the second call corresponding to the second call terminal Emotional data of the caller;

Generating and sending first prompt information for prompting the first caller to adjust emotions to the first call terminal according to the type data of the call scene and the emotion data of the first caller;

According to the type data of the call scene and the emotion data of the second caller, it is generated and sent to the first call terminal for prompting the first caller to adjust the dialogue strategy to deal with the second caller's emotions. The second prompt message.
16. The storage medium of claim 16, wherein the processor is configured to implement the following steps when performing voice recognition on the first call audio and the second call audio to obtain dialog text data:

Performing voice recognition on the first call audio to obtain a first text corresponding to the first caller;

Performing voice recognition on the second call audio to obtain a second text corresponding to the second caller;

The first text and the second text are sorted according to a preset sorting rule to obtain dialogue text data.
The storage medium according to claim 16, wherein the processor is configured to implement the following steps when implementing the recognition of the dialogue text data based on the pre-built scene recognition model to obtain the type data of the call scene :

Based on a scene rule engine with built-in scene judgment rules, analyze the conversation text data to obtain type data of the call scene; or

The recognizing the dialogue text data based on the pre-built scene recognition model to obtain the type data of the call scene includes:

Extracting text features in the dialogue text data;

Based on the trained machine learning model, the type data of the call scene is identified according to the text features in the dialogue text data.
The storage medium according to claim 16, wherein the processor is implementing the recognition of the first call audio and the second call audio based on the pre-built emotion recognition model to obtain the first call terminal Corresponding to the emotional data of the first caller and the emotional data of the second caller corresponding to the second call terminal, it is used to implement the following steps:

Recognizing the first call audio and dialogue text data based on a pre-built emotion recognition model to obtain the emotion data of the first caller corresponding to the first call terminal;

Recognizing the second call audio and dialogue text data based on a pre-built emotion recognition model to obtain the second caller's emotion data corresponding to the second call terminal;

Wherein, when the processor realizes the recognition of the first call audio and dialogue text data based on the pre-built emotion recognition model to obtain the emotion data of the first caller corresponding to the first call terminal, Used to implement the following steps:

Extracting at least one of a volume feature, a speech rate feature, a smooth feature, and a pause feature from the first call audio;

Extracting text features from the dialogue text data;

Based on the pre-built emotion recognition model, process the text feature and at least one of the volume feature, speech rate feature, smooth feature, and pause feature to obtain the first caller corresponding to the first call terminal Sentiment data.
The storage medium according to claim 16, wherein the processor is configured to generate data based on the type data of the call scene and the emotional data of the first caller and send to the first call terminal When the first prompt message prompting the first caller to adjust emotions is used to implement the following steps:

Based on the prompt rule engine with built-in prompt rules, it analyzes the type data of the call scene and the emotional data of the first caller to obtain corresponding first prompt information, and sends the first prompt information to the first Sent by a call terminal to prompt the first caller to adjust emotions; or

Based on the pre-trained first prompt model, generate and send to the first call terminal according to the type data of the call scene, the emotional data of the first caller, and the dialog text data for prompting the first call The first message for the caller to adjust their emotions.