WO2020253128A1 - Voice recognition-based communication service method, apparatus, computer device, and storage medium - Google Patents

Voice recognition-based communication service method, apparatus, computer device, and storage medium Download PDF

Info

Publication number
WO2020253128A1
WO2020253128A1 PCT/CN2019/122167 CN2019122167W WO2020253128A1 WO 2020253128 A1 WO2020253128 A1 WO 2020253128A1 CN 2019122167 W CN2019122167 W CN 2019122167W WO 2020253128 A1 WO2020253128 A1 WO 2020253128A1
Authority
WO
WIPO (PCT)
Prior art keywords
call
data
caller
scene
audio
Prior art date
Application number
PCT/CN2019/122167
Other languages
French (fr)
Chinese (zh)
Inventor
杨一凡
徐国强
Original Assignee
深圳壹账通智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳壹账通智能科技有限公司 filed Critical 深圳壹账通智能科技有限公司
Publication of WO2020253128A1 publication Critical patent/WO2020253128A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M1/00Substation equipment, e.g. for use by subscribers
    • H04M1/72Mobile telephones; Cordless telephones, i.e. devices for establishing wireless links to base stations without route selection
    • H04M1/724User interfaces specially adapted for cordless or mobile telephones
    • H04M1/72448User interfaces specially adapted for cordless or mobile telephones with means for adapting the functionality of the device according to specific conditions
    • H04M1/72454User interfaces specially adapted for cordless or mobile telephones with means for adapting the functionality of the device according to specific conditions according to context-related or environment-related conditions
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M1/00Substation equipment, e.g. for use by subscribers
    • H04M1/72Mobile telephones; Cordless telephones, i.e. devices for establishing wireless links to base stations without route selection
    • H04M1/724User interfaces specially adapted for cordless or mobile telephones
    • H04M1/72484User interfaces specially adapted for cordless or mobile telephones wherein functions are triggered by incoming communication events
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • G10L2015/0631Creating reference templates; Clustering
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M2250/00Details of telephonic subscriber devices
    • H04M2250/12Details of telephonic subscriber devices including a sensor for measuring a physical value, e.g. temperature or motion

Definitions

  • This application relates to the field of data analysis technology, and in particular to a communication service method, device, computer equipment and storage medium based on voice recognition.
  • the embodiments of the present application provide a communication service method, device, computer equipment, and storage medium based on voice recognition, which can better realize timely and accurate injection intervention when the caller makes a call, so as to guide the caller to better implement the call.
  • this application provides a communication service method based on voice recognition, the method including:
  • the type data of the call scene and the emotion data of the second caller it is generated and sent to the first call terminal for prompting the first caller to adjust the dialogue strategy to deal with the second caller's emotions.
  • the second prompt message is generated and sent to the first call terminal for prompting the first caller to adjust the dialogue strategy to deal with the second caller's emotions.
  • the present application provides a communication service device based on voice recognition, the device including:
  • An audio acquisition module configured to obtain a first call audio corresponding to the first call terminal and a second call audio corresponding to the second call terminal if the call between the first call terminal and the second call terminal is connected;
  • a voice recognition module configured to perform voice recognition on the first call audio and the second call audio to obtain dialogue text data
  • a scene recognition module configured to recognize the dialogue text data based on a pre-built scene recognition model to obtain type data of the call scene
  • the emotion recognition module is configured to recognize at least one of the first call audio, second call audio, and dialog text data based on a pre-built emotion recognition model to obtain the first call corresponding to the first call terminal Emotional data of a person and emotional data of a second caller corresponding to the second call terminal;
  • the first prompting module is configured to generate and send a first prompt for prompting the first caller to adjust emotions to the first call terminal according to the type data of the call scene and the emotional data of the first caller information;
  • the second prompting module is configured to generate and send to the first call terminal according to the type data of the call scene and the emotional data of the second caller for prompting the first caller to adjust the dialogue strategy to deal with the situation.
  • the second prompt message describing the emotion of the second caller.
  • the present application provides a computer device that includes a memory and a processor; the memory is used to store a computer program; the processor is used to execute the computer program and when the computer is executed
  • the program implements the above-mentioned communication service method based on voice recognition.
  • this application provides a computer-readable storage medium that stores a computer program, and if the computer program is executed by a processor, the above-mentioned voice recognition-based communication service method is implemented.
  • This application discloses a communication service method, device, equipment and storage medium based on voice recognition.
  • the corresponding audio is obtained during a call between a first call terminal and a second call terminal, and then the dialogue text is obtained through voice recognition and based on
  • the conversation text recognizes the call scene and the emotion of the caller according to the acquired audio; then, according to the call scene and the emotion of the caller, prompts the caller accordingly, so as to realize timely and accurate injection of intervention during the call to guide the caller The caller realizes the call better.
  • FIG. 1 is a schematic diagram of a usage scenario of a communication service method based on voice recognition according to an embodiment of the application;
  • FIG. 2 is a schematic flowchart of a communication service method based on voice recognition according to an embodiment of the application
  • Figure 3 is a schematic diagram of a sub-process of obtaining dialogue text data through voice recognition
  • FIG. 4 is a schematic flowchart of a communication service method based on voice recognition according to another embodiment of this application.
  • Figure 5 is a schematic diagram of a sub-process for obtaining type data of a call scene
  • Figure 6 is a schematic diagram of a sub-process for extracting text features
  • Figure 7 is a schematic diagram of a sub-process for extracting text features based on a bag of words model
  • FIG. 8 is a schematic diagram of a sub-process of obtaining emotional data of the first caller
  • FIG. 9 is a schematic diagram of a sub-process of emotion recognition model recognition and acquisition of emotion data
  • FIG. 10 is a schematic flowchart of a communication service method based on voice recognition according to still another embodiment of this application.
  • FIG. 11 is a schematic flowchart of a communication service method based on voice recognition according to another embodiment of this application.
  • FIG. 12 is a schematic structural diagram of a communication service device based on voice recognition provided by an embodiment of the application.
  • FIG. 13 is a schematic structural diagram of a communication service device based on voice recognition provided by another embodiment of this application.
  • FIG. 14 is a schematic structural diagram of a computer device provided by an embodiment of this application.
  • the embodiments of the present application provide a voice recognition-based communication service method, device, computer equipment, and computer-readable storage medium.
  • the communication service method can be applied to a terminal or a server, so as to intervene in the communication between the callers when needed.
  • the first call terminal and the second call terminal conduct a call
  • the communication service method based on voice recognition is applied to at least one of the first call terminal and the second call terminal.
  • the first call terminal and the second call terminal conduct a call
  • the server provides support for the call between the first call terminal and the second call terminal
  • a voice recognition-based communication service method can be applied to the server.
  • FIG. 1 is a schematic diagram of an application scenario of a communication service method based on voice recognition provided by an embodiment of the present application.
  • the application scenario includes a server, a first call terminal, and a second call terminal.
  • the call terminal can be a mobile phone, a tablet computer, a notebook computer, a desktop computer, a personal digital assistant, a wearable device, a smart speaker, and other electronic devices;
  • the server can be an independent server or a server cluster.
  • FIG. 2 is a schematic flowchart of a communication service method based on voice recognition provided by an embodiment of the present application.
  • the communication service method based on voice recognition includes the following steps S110 to S160.
  • Step S110 If the call between the first call terminal and the second call terminal is connected, obtain the first call audio corresponding to the first call terminal and the second call audio corresponding to the second call terminal.
  • the first caller uses the first call terminal to make a call to the second caller, and the second caller uses the second call terminal to answer the call, then the first call terminal and the second call terminal The call is connected.
  • the server When the call between the first call terminal and the second call terminal is connected, and when the first caller is talking with the second caller, the server provides support for the call between the first call terminal and the second call terminal.
  • the server collects the audio of the first caller, that is, the first call audio corresponding to the first call terminal, and sends the first call audio to the second call terminal so that the speaker of the second call terminal can play the audio to the second call terminal.
  • the caller listens; the server also collects the audio of the second caller, that is, the second call audio corresponding to the second call terminal, and sends the second call audio to the first call terminal so that the speaker of the first call terminal plays the audio to the second call terminal
  • Step S120 Perform voice recognition on the first call audio and the second call audio to obtain dialog text data.
  • the server converts the first call audio and the second call audio into text by means of voice recognition to obtain dialog text data.
  • step S120 performs voice recognition on the first call audio and the second call audio to obtain dialog text data, which specifically includes step S121 to step S123.
  • Step S121 Perform voice recognition on the first call audio to obtain a first text corresponding to the first caller.
  • the server when collecting the first call audio corresponding to the first call terminal, the server performs voice recognition on the collected first call audio, and marks the recognized text as the first text.
  • Step S122 Perform voice recognition on the second call audio to obtain a second text corresponding to the second caller.
  • the server when collecting the second call audio corresponding to the second call terminal, the server performs voice recognition on the collected second call audio, and marks the recognized text as the second text.
  • Step S123 Sort the first text and the second text according to a preset sorting rule to obtain dialogue text data.
  • the first text and the second text are sorted to obtain dialog text data.
  • the dialogue text data includes a plurality of first texts and second texts arranged at intervals.
  • Step S130 Recognizing the dialogue text data based on a pre-built scene recognition model to obtain type data of the call scene.
  • the scene recognition model stores or learns several scene recognition rules, and the scene recognition model recognizes the call scene corresponding to the dialogue text data based on the scene recognition rules.
  • step S130 recognizes the dialogue text data based on a pre-built scene recognition model to obtain type data of the call scene, including step S131.
  • Step S131 Based on the scene rule engine with built-in scene judgment rules, analyze the conversation text data to obtain type data of the call scene.
  • the scene rule engine is a rule engine with built-in scene judgment rules, such as a drools rule engine.
  • the rule engine originated from the rule-based expert system, and the rule-based expert system is a branch of the expert system. Expert system belongs to the category of artificial intelligence. It imitates human reasoning, uses tentative methods to reason, and uses human-understandable terms to explain and prove its reasoning conclusions.
  • the rule engine is a core technical component designed to respond to and process complex business rules. By introducing the rule engine, it is possible to dynamically define and adjust scene judgment rules in a timely manner through flexible configuration.
  • the built-in scene judgment rule of the scene rule engine is specifically a rule set based on people's practical experience, and this embodiment does not limit the setting of the preset scene judgment rule. For example, if the dialog text data includes "Hello, Mr. Wang, I am XX", the scene recognition model recognizes the type of the conversation scene corresponding to the conversation text data as a stranger call based on a certain scene judgment rule.
  • the construction of the scene rule engine includes: first obtain a number of scene judgment rules matching the rule modification template according to a preset rule modification template; then precompile and test the scene judgment rules, and generate according to the scene judgment rules after the test passes Script file; then store the script file on the server and associate the script file with the rule calling interface of the scene rule engine, so that the scene rule engine calls the corresponding scene judgment rule.
  • the rule modification template is a visual rule modification template.
  • visualizing the rule modification template it is more conducive for relevant personnel to edit directly on the rule modification template to generate scene judgment rules; so that relevant personnel who understand the call scene judgment rule can modify the scene judgment rule through the template without knowing the implementation method behind the template ,
  • the threshold for using the rule engine is further reduced, which is beneficial to improve the accuracy of the scene rule engine's recognition of the call scene.
  • the scene recognition model may be constructed in the following manner: the scene recognition model is obtained by learning from a set of scene training samples through a machine learning algorithm.
  • step S130 recognizes the dialogue text data based on the pre-built scene recognition model to obtain the type data of the call scene, including step S132 and step S133.
  • Step S132 Extract text features in the dialogue text data.
  • feature words are extracted from the dialogue text data and quantified to represent the text information, that is, the text features in the dialogue text data, to realize the scientific abstraction of the dialogue text data, and establish its mathematical model to describe and replace Dialog text data.
  • text features are extracted from dialogue text data based on a bag-of-words (Bag-of-words, BOW) model.
  • BOW bag-of-words
  • step S132 extracts the text features in the dialogue text data, including step S1321 and step S1322.
  • Step S1321 filter out noisy characters in the dialogue text data according to a preset filtering rule.
  • the stop words in the dialogue text data are deleted or replaced with preset symbols.
  • some special words such as " ⁇ ", " ⁇ ” and other noise characters and invalid words can be specified as stop words according to the call scene to construct a stop word database and save it in the form of a configuration file.
  • the server calls the stop word database when needed.
  • each stop word in the stop word database is searched separately whether it appears in the dialogue text data, and if it appears, the stop word in the dialogue text data is deleted; or, the stop word in the stop word database is searched separately Whether each stop word appears in the dialogue text data, if it appears, replace the stop word in the dialogue text data with a preset symbol, such as a space, to preserve the structure of the dialogue text data to a certain extent .
  • Step S1322 based on the bag-of-words model, extract text features from the dialogue text data with noise characters filtered out.
  • Bag-of-words (Bag-of-words, BOW) is a representation of the text that describes the occurrence of word elements in a document.
  • the bag-of-words model is a method of representing text data when modeling text with machine learning algorithms. It involves two aspects: the collection of known words and testing the existence of known words.
  • the bag-of-words model includes a dictionary, and the dictionary includes several words.
  • the bag-of-words model divides the dialogue text data with noisy characters filtered into words, imagine putting all words in a bag, ignoring the word order, grammar, syntax and other elements, and treating them as just a collection of several words. The appearance of each word in the dialogue text data is independent and does not depend on whether other words appear or not.
  • the bag-of-words model extracts text features from dialogue text data that filter out noise characters including bag-of-words feature vectors.
  • step S1322 is based on the bag-of-words model to extract text features from the dialogue text data with noise characters filtered out, including steps S1301-step S1303.
  • Step S1301 initialize the all-zero bag-of-words feature vector.
  • the elements in the bag-of-words feature vector correspond one-to-one with words in the dictionary of the bag-of-words model.
  • Step S1302 Count the number of occurrences of each word in the dictionary in the dialogue text data from which the noise character is filtered out.
  • Step S1303 Assign a value to the corresponding element in the bag of words feature vector according to the number of times the word appears in the dialogue text data.
  • the bag of words feature vector is [1, 1, 1, 1, 0, 0, 0]. If the dialogue text data from which the noise characters are removed is "Xiao Ming likes watching movies and Xiao Ming also likes playing football", the bag of words feature vector is [2, 2, 1, 1, 1, 1, 1].
  • Step S133 Based on the trained machine learning model, the type data of the call scene is identified according to the text features in the dialogue text data.
  • the text features in the dialogue text data are used as the input of the trained machine learning model, and the output of the machine learning model is used as the type data of the identified call scene.
  • the scene training sample set used to train the machine learning model includes several scene training samples.
  • the scene training sample includes historical dialogue text data and scene type data corresponding to the historical dialogue text data. Text features can be extracted from historical dialogue text data.
  • the scene type data is the annotation data of the historical dialogue text data.
  • model training the text characteristics corresponding to the historical dialogue text data are used as input data, and all The scene type data is used as output data, and a selected machine learning model is used to learn from a scene training sample set including a large number of scene training samples to obtain a trained machine learning model.
  • the trained machine learning model can be set as a model that only recognizes the call scene type in a single scene, and the type data of the call scene obtained by recognizing the conversation text data based on the pre-built scene recognition model can be It reflects whether the first caller and the second caller belong to a specific call scene.
  • the trained machine learning model can also be set as a model that can recognize the type of call scenes in multiple scenarios, and then the type of call scene obtained by recognizing the conversation text data based on the pre-built scene recognition model The data can reflect the probability that the first caller and the second caller belong to multiple specific call scenarios.
  • Step S140 Recognizing the first call audio and the second call audio based on the pre-built emotion recognition model to obtain the emotion data of the first caller corresponding to the first call terminal and the second call terminal corresponding The emotional data of the second caller.
  • the server recognizes the first call audio based on a pre-built emotion recognition model to obtain the emotion data of the first caller; and the server recognizes the second call audio based on the pre-built emotion recognition model Perform recognition to obtain emotional data of the second caller.
  • a machine learning algorithm is used to obtain the emotion recognition model from a set of emotion training samples.
  • the emotion training sample set includes several emotion training samples.
  • the emotion training sample includes historical audio data and emotion type data corresponding to the historical audio data.
  • characteristic data can be extracted, such as volume characteristics, speech rate characteristics, smooth characteristics, pause characteristics, etc.
  • the emotion type data is the annotation data of the historical audio data
  • the historical audio data is used for model training.
  • the feature data corresponding to the audio data is used as input data, the emotion type data is used as output data, and the emotion recognition model is obtained by learning from a set of emotion training samples including several emotion training samples through a selected machine learning model.
  • the first call audio is first processed to obtain smooth features that reflect the smoothness of the first caller’s voice, and to obtain pause features that reflect the duration of the pause; specifically, the smooth feature is identified through The voice jitter frequency of the first caller is detected and evaluated. The identification of the pause feature is obtained by starting a timer for timing when the voice of the first caller and the second caller stops.
  • the trained emotion recognition model can recognize the emotion data of the first caller based on smooth features, pause features, volume features, and/or speech rate features.
  • the emotion recognition model can recognize the second call audio to obtain the emotion data of the second caller.
  • the emotion recognition model recognizes that the emotion data of the first caller corresponding to the first call terminal is "excited";
  • the emotion recognition model recognizes that the emotion data of the first caller corresponding to the first call terminal is "tension”.
  • the emotion recognition model recognizes the dialogue text data to obtain text features; the emotion recognition model can also identify the emotion data of the first speaker or the second speaker based on the text features. For example, if the second text in the dialogue text data includes the sentence "You need to be calm and not excited” corresponding to the second caller, the emotion recognition model can recognize the emotion of the first caller as "excited”; if the dialogue text data The second text includes the sentence "you this **" corresponding to the second caller, and the emotion recognition model can identify the emotion of the second caller as "excited” or "angry”.
  • step S140 recognizes the first call audio and the second call audio based on a pre-built emotion recognition model to obtain the first caller corresponding to the first call terminal
  • the emotional data of and the emotional data of the second caller corresponding to the second call terminal specifically include step S141 and step S142.
  • Step S141 Recognizing the first call audio and dialogue text data based on the pre-built emotion recognition model to obtain the emotion data of the first caller corresponding to the first call terminal.
  • the volume feature, the speech rate feature, the smooth feature and/or the pause feature extracted from the first call audio, and the text feature extracted from the dialogue text data are merged as the input of the emotion recognition model, and the emotion recognition model is recognized
  • the emotional data of the first caller is obtained; the accuracy of model recognition is further improved.
  • Step S142 Recognizing the second call audio and dialogue text data based on the pre-built emotion recognition model to obtain the second caller's emotion data corresponding to the second call terminal.
  • the volume feature, the speech rate feature, the smooth feature and/or the pause feature extracted from the second call audio, and the text feature extracted from the dialogue text data are merged as the input of the emotion recognition model, and the emotion recognition model is used to recognize The emotional data of the second caller is obtained; the accuracy of model recognition is further improved.
  • step S141 recognizes the first call audio and dialogue text data based on a pre-built emotion recognition model to obtain the emotion data of the first caller corresponding to the first call terminal , Specifically includes step S1411-step S1413.
  • Step S1411 extract at least one of a volume feature, a speech rate feature, a smooth feature, and a pause feature from the first call audio.
  • the volume feature is a feature used to reflect the amplitude of the first call audio
  • the recognition of the speech rate feature is obtained by calculating the rate of change of the energy envelope of the first call audio in the time domain
  • the smooth feature is recognized by The voice jitter frequency of the first caller is detected and evaluated.
  • the identification of the pause feature is obtained by starting a timer for timing when the voice of the first caller and the second caller stops.
  • Step S1412 extract text features from the dialogue text data.
  • the text features of the dialog text data extracted in step S132 can be reused.
  • Step S1413 Based on the pre-built emotion recognition model, process the text feature and at least one of the volume feature, the speech rate feature, the smooth feature, and the pause feature to obtain the first call terminal corresponding to the first call terminal.
  • the emotional data of the caller is not limited to the pre-built emotion recognition model.
  • the fusion process is performed on the text feature and the volume feature, speech rate feature, smooth feature, and pause feature, such as splicing as input to the emotion recognition model, and the emotion recognition model recognizes the emotion of the first caller Data, further improve the accuracy of model recognition.
  • the emotion training sample set includes several emotion training samples.
  • the emotion training sample includes historical audio data, corresponding dialogue text data and corresponding emotion type data.
  • volume characteristics, speech rate characteristics, smooth characteristics, pause characteristics, etc. can be extracted, and text characteristics can be obtained from dialogue text data;
  • the emotion type data is the annotation data of the historical audio data, which is used during model training .
  • Using the volume feature, speaking rate feature, smooth feature, pause feature, etc., and text feature corresponding to the historical audio data as input data, using the emotion type data as output data, and using the selected machine learning model from including Emotion training samples of several emotion training samples are collectively learned to obtain the emotion recognition model.
  • Step S150 Generate and send first prompt information for prompting the first caller to adjust emotions to the first call terminal according to the type data of the call scene and the emotion data of the first caller.
  • the first prompt information generated and sent to the first call terminal includes “excited too much” and the like.
  • the first prompt information may be provided to the first caller using the first call terminal in a manner of display or sound.
  • step S150 generates and sends to the first call terminal according to the type data of the call scene and the emotional data of the first caller for prompting the first call
  • the first prompt message for adjusting the emotion of the person includes step S151:
  • Step S151 Based on the prompt rule engine with built-in prompt rules, analyze the type data of the call scene and the emotional data of the first caller to obtain corresponding first prompt information, and send the first prompt information to Sent by the first call terminal to prompt the first caller to adjust emotions.
  • the prompt rule engine is a rule engine with built-in prompt rules, such as a drools rule engine.
  • the prompt rule engine includes prompt rules: if the type of the call scene is father and son, and the emotional data of the first caller is "excited", then first prompt information including "excited emotion" is generated.
  • step S150 generates and sends to the first call terminal according to the type data of the call scene and the emotional data of the first caller for prompting the first call
  • the first prompt message for adjusting the emotion of the caller includes step S152:
  • Step S152 Based on the pre-trained first prompt model, generate and send to the first call terminal according to the type data of the call scene, the emotional data of the first caller, and the dialog text data for prompting The first prompt message for adjusting the emotion of the first caller.
  • the first prompt model may be constructed in the following manner: a machine learning algorithm is used to obtain the first prompt model from the first prompt training sample set.
  • the first prompt training sample set includes a plurality of first prompt training samples.
  • Each first prompt training sample includes type data of the historical call scene, historical emotion data corresponding to the first caller, text features corresponding to the historical dialogue text data, and prompt information corresponding to the training sample.
  • the prompt information is the annotation data of the training sample; during model training, the type data of the historical call scene, the historical emotion data corresponding to the first caller, and the text features corresponding to the historical dialogue text data are used as input data , Using the prompt information as output data, through the selected machine learning model, learn from the first prompt training sample set including the first prompt training sample to obtain the first prompt model.
  • the first prompt model can learn the verbal rules in the call based on the historical dialogue text data, and can provide prompts including the verbal information when generating and prompting information.
  • the first prompt message is generated and sent to the first call terminal including "excited, try to talk Talk about the weather” etc.
  • Step S160 Generate and send to the first call terminal according to the type data of the call scene and the emotional data of the second caller for prompting the first caller to adjust the dialogue strategy to deal with the second call The second reminder of human emotions.
  • the second prompt message including "your mother has been exhausted recently" is generated and sent to the first call terminal;
  • the type of the call scene is a conversation between lovers, and the emotional data of the second caller is "acting like a baby”, then a second prompt message including "your girlfriend is acting like a baby” is generated and sent to the first calling terminal; or the scene of the call
  • the type of is a call between friends, and the emotional data of the second caller is "angry”, then a second prompt message including "your friend is angry” is generated and sent to the first call terminal.
  • the second prompt information may be provided to the first caller using the first call terminal in a manner of display or sound.
  • step S160 generates and sends to the first call terminal according to the type data of the call scene and the emotional data of the second caller for prompting the first call
  • the person adjusts the dialogue strategy to deal with the second prompt message of the emotion of the second caller, including step S161:
  • Step S161 Based on the prompt rule engine with built-in prompt rules, analyze the type data of the call scene and the emotional data of the second caller to obtain corresponding second prompt information, and send the second prompt information to The first call terminal sends to prompt the first caller to adjust the dialogue strategy to deal with the emotion of the second caller.
  • the prompt rule engine is a rule engine with built-in prompt rules, such as a drools rule engine.
  • the reminder rule engine includes a reminder rule: if the type of the call scene is a couple and the emotional data of the second caller is "acting like a baby", then a second reminder message including "your girlfriend is acting like a baby" is generated.
  • step S160 generates and sends to the first call terminal a reminder of the first call based on the type data of the call scene and the emotional data of the second caller.
  • the caller adjusts the dialogue strategy to deal with the second prompt message of the second caller’s emotion, including step S162:
  • Step S162 Based on the pre-trained second prompt model, generate and send to the first call terminal according to the type data of the call scene, the emotional data of the second caller, and the dialog text data for prompting The first caller adjusts the dialogue strategy to deal with the second prompt message of the emotion of the second caller.
  • the second prompt model may be constructed in the following manner: the second prompt model is obtained by learning from the second prompt training sample set through a machine learning algorithm.
  • the second prompt training sample set includes a plurality of second prompt training samples.
  • Each second prompt training sample includes type data of the historical call scene, historical emotion data corresponding to the second caller, text features corresponding to the historical dialogue text data, and prompt information corresponding to the training sample.
  • the prompt information is the annotation data of the training sample; during model training, the type data of the historical call scene, the historical emotion data corresponding to the second caller, and the text features corresponding to the historical dialogue text data are used as input data , Using the prompt information as output data, and through a selected machine learning model, learn from a second prompt training sample set including a second prompt training sample to obtain the second prompt model.
  • the second prompt model can learn the verbal rules in the call based on the historical dialogue text data, and can provide prompts including verbal information when generating and prompting information.
  • the second prompt message generated and sent to the first call terminal includes "Your mother has been tired recently, Condolences to mom’s life”; or the type of the call scene is a conversation between couples, and the emotional data of the second caller is "coquettish”, then the second prompt message generated and sent to the first call terminal includes "your girl A friend is acting like a baby and tenderly calling her baby”; or the type of the call scene is a call between friends, and the emotional data of the second caller is "anger”, then a second reminder is generated and sent to the first call terminal Messages include "Your friend is angry, try to talk about the weather” etc.
  • the first prompt model in step S152 and the second prompt model in step S162 can be integrated into one prompt model. Specifically, it can be used to indicate the reminder object identifier in the reminder training sample; for example, the reminder model running on the server can generate the corresponding reminder information and predict the reminder object corresponding to the reminder information, and send the reminder information to the The reminder object, such as sent to the first call terminal or the second call terminal.
  • the first call audio corresponding to the first call terminal is suspended to Sent by the second call terminal to shield the first prompt information from the second caller.
  • the second caller when the second prompt message for prompting the first caller to adjust the dialogue strategy in response to the emotion of the second caller is sent to the first call terminal in step S160, the second caller is suspended.
  • the first call audio corresponding to a call terminal is sent to the second call terminal to shield the second prompt information from the second caller.
  • the server when the server sends corresponding prompt information to the first call terminal, the first call terminal prompts the first caller by means of voice prompts; at this time, the server can pause the collection of the audio obtained by the microphone of the first call terminal, that is, the first call terminal Call audio, for example, control the call mode of the first call terminal to mute mode; thus stop sending the first call audio containing the corresponding sound prompt to the second call terminal, so the first prompt information and the second prompt information will not be Second, the caller heard it.
  • the corresponding audio is obtained during a call between the first call terminal and the second call terminal, and then the dialogue text is obtained through voice recognition and the call scene is recognized according to the dialogue text, and according to The acquired audio recognizes the emotion of the caller; then, corresponding prompts are made to the caller according to the call scene and the caller's emotions, so as to realize timely and accurate injection of intervention during the call to guide the caller to better implement the call.
  • FIG. 12 is a schematic structural diagram of a voice recognition-based communication service device provided by an embodiment of the present application.
  • the voice recognition-based communication service device may be configured in a server for performing the aforementioned voice recognition Communication service method.
  • the communication service device based on voice recognition includes: an audio acquisition module 110, a voice recognition module 120, a scene recognition module 130, an emotion recognition module 140, a first prompt module 150, and a second prompt module 160.
  • the audio obtaining module 110 is configured to obtain the first call audio corresponding to the first call terminal and the second call audio corresponding to the second call terminal if the call between the first call terminal and the second call terminal is connected .
  • the voice recognition module 120 is configured to perform voice recognition on the first call audio and the second call audio to obtain dialogue text data.
  • the speech recognition module 120 includes:
  • the first voice sub-module 121 is configured to perform voice recognition on the first call audio to obtain the first text corresponding to the first caller;
  • the second voice submodule 122 is configured to perform voice recognition on the second call audio to obtain the second text corresponding to the second caller;
  • the text sorting sub-module 123 is used to sort the first text and the second text according to a preset sorting rule to obtain dialogue text data.
  • the scene recognition module 130 is configured to recognize the dialogue text data based on a pre-built scene recognition model to obtain type data of the call scene.
  • the scene recognition module 130 includes:
  • the scene rule sub-module 131 is used to analyze the conversation text data to obtain the type data of the call scene based on the scene rule engine of the built-in scene judgment rule
  • the scene recognition module 130 includes:
  • the feature extraction sub-module 132 is used to extract text features in the dialogue text data
  • the scene recognition sub-module 133 is used to identify the type data of the call scene according to the text features in the conversation text data based on the trained machine learning model.
  • the emotion recognition module 140 is configured to recognize the first call audio and the second call audio based on a pre-built emotion recognition model to obtain the emotion data of the first caller corresponding to the first call terminal and the second call Emotional data of the second caller corresponding to the second call terminal.
  • the emotion recognition module 140 includes:
  • the first emotion recognition sub-module 141 is configured to recognize the first call audio and dialog text data based on a pre-built emotion recognition model to obtain the emotion data of the first caller corresponding to the first call terminal.
  • the first emotion recognition sub-module 141 includes:
  • An audio feature extraction sub-module for extracting at least one of a volume feature, a speech rate feature, a smooth feature, and a pause feature from the first call audio;
  • the emotion data acquisition sub-module is used to process the text feature and at least one of the volume feature, speech rate feature, smooth feature, and pause feature based on the pre-built emotion recognition model to obtain the first Emotional data of the first caller corresponding to the call terminal.
  • the second emotion recognition sub-module 142 is configured to recognize the second call audio and conversation text data based on a pre-built emotion recognition model to obtain the second caller's emotion data corresponding to the second call terminal.
  • the first prompt module 150 is configured to generate and send to the first call terminal a first prompt for prompting the first caller to adjust emotions according to the type data of the call scene and the emotional data of the first caller. Prompt information.
  • the first prompting module 150 includes:
  • the first prompt rule sub-module 151 is used to analyze the type data of the call scene and the emotional data of the first caller to obtain the corresponding first prompt information based on the prompt rule engine with built-in prompt rules, and The first prompt information is sent to the first call terminal to prompt the first caller to adjust emotions.
  • the first prompting module 150 includes:
  • the first prompt generation sub-module 152 is configured to generate and report to the first prompt model according to the type data of the call scene, the emotional data of the first caller, and the dialog text data based on the pre-trained first prompt model.
  • the call terminal sends first prompt information for prompting the first caller to adjust emotions.
  • the second prompting module 160 is configured to generate and send to the first call terminal according to the type data of the call scene and the emotional data of the second caller for prompting the first caller to adjust the dialogue strategy to deal with The second prompt information of the emotion of the second caller.
  • the second prompting module 160 includes:
  • the second prompt rule sub-module 161 is configured to analyze the type data of the call scene and the emotional data of the second caller to obtain the corresponding second prompt information based on the prompt rule engine with built-in prompt rules, and The second prompt information is sent to the first call terminal to prompt the first caller to adjust the dialogue strategy to deal with the emotions of the second caller
  • the second prompting module 160 includes:
  • the second prompt generation sub-module 162 is configured to generate and report to the first prompt model based on the pre-trained second prompt model according to the type data of the call scene, the emotional data of the second caller, and the dialog text data.
  • the call terminal sends second prompt information for prompting the first caller to adjust the dialogue strategy to deal with the emotion of the second caller.
  • the method and device of this application can be used in many general or special computing system environments or configurations.
  • the above-mentioned method and apparatus may be implemented in the form of a computer program, and the computer program may run on the computer device as shown in FIG. 14.
  • FIG. 14 is a schematic structural diagram of a computer device provided by an embodiment of the present application.
  • the computer equipment can be a server or a terminal.
  • the computer device includes a processor, a memory, and a network interface connected through a system bus, where the memory may include a non-volatile storage medium and an internal memory.
  • the non-volatile storage medium can store an operating system and a computer program.
  • the computer program includes program instructions, and when the program instructions are executed, the processor can execute any communication service method based on voice recognition.
  • the processor is used to provide computing and control capabilities and support the operation of the entire computer equipment.
  • the internal memory provides an environment for the operation of the computer program in the non-volatile storage medium.
  • the processor can execute any communication service method based on voice recognition.
  • the network interface is used for network communication, such as sending assigned tasks.
  • the structure of the computer device is only a block diagram of a part of the structure related to the solution of the application, and does not constitute a limitation on the computer device to which the solution of the application is applied.
  • the specific computer device may include More or fewer components are shown in the figure, or some components are combined, or have different component arrangements.
  • the processor may be a central processing unit (Central Processing Unit, CPU), the processor may also be other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), and application specific integrated circuits (Application Specific Integrated Circuits). Circuit, ASIC), Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc.
  • the general-purpose processor may be a microprocessor or the processor may also be any conventional processor.
  • the processor is used to run a computer program stored in a memory to implement the following steps:
  • the type data of the call scene and the emotion data of the second caller it is generated and sent to the first call terminal for prompting the first caller to adjust the dialogue strategy to deal with the second caller's emotions.
  • the second prompt message is generated and sent to the first call terminal for prompting the first caller to adjust the dialogue strategy to deal with the second caller's emotions.
  • the processor when the processor implements voice recognition on the first call audio and the second call audio to obtain dialog text data, it is specifically implemented: perform voice recognition on the first call audio to obtain the first call The first text corresponding to the person; perform voice recognition on the second call audio to obtain the second text corresponding to the second caller; sort the first text and the second text according to a preset sorting rule to obtain the dialogue text data.
  • the processor realizes the recognition of the dialogue text data based on the pre-built scene recognition model to obtain the type data of the call scene, it is specifically realized: the scene rule engine based on the built-in scene judgment rule is used for the dialogue The text data is analyzed to obtain the type data of the call scene.
  • the processor realizes the recognition of the dialogue text data based on a pre-built scene recognition model to obtain the type data of the call scene, it is specifically implemented: extracting text features in the dialogue text data; based on the trained The machine learning model recognizes the type data of the call scene according to the text features in the dialog text data.
  • the processor realizes the recognition of the first call audio and the second call audio based on a pre-built emotion recognition model to obtain the emotion data of the first caller corresponding to the first call terminal and the
  • the emotion data of the second caller corresponding to the second call terminal is implemented, it is specifically implemented: based on the pre-built emotion recognition model, the first call audio and dialogue text data are recognized to obtain the second call corresponding to the first call terminal. Emotional data of a caller; the second call audio and dialog text data are recognized based on a pre-built emotion recognition model to obtain the second caller's emotional data corresponding to the second call terminal.
  • the specific implementation is : Extract at least one of the volume feature, the speaking rate feature, the smooth feature, and the pause feature from the first call audio; extract the text feature from the dialogue text data; based on the pre-built emotion recognition model, compare the text feature And at least one of the volume feature, the speaking rate feature, the smooth feature, and the pause feature are processed to obtain the emotion data of the first caller corresponding to the first call terminal.
  • the processor generates and sends to the first call terminal a first call for prompting the first caller to adjust emotions according to the type data of the call scene and the emotion data of the first caller.
  • prompting information the specific implementation is: based on the prompt rule engine with built-in prompt rules, the type data of the call scene and the emotional data of the first caller are analyzed to obtain the corresponding first prompt information, and the A prompt message is sent to the first call terminal to prompt the first caller to adjust their emotions; or specific implementation: based on a pre-trained first prompt model, according to the type data of the call scene, the first caller The emotion data of and the dialog text data are generated and sent to the first call terminal to prompt the first caller to adjust emotions.
  • the processor when the processor implements sending first prompt information to the first call terminal or sends second prompt information to the first call terminal, it also implements: pause the first call terminal corresponding to the first call terminal.
  • the call audio is sent to the second call terminal to shield the first prompt information or the second prompt information from the second caller.
  • a computer-readable storage medium stores a computer program
  • the computer program includes program instructions
  • the processor executes the program instructions to implement any item provided in the embodiments of this application based on Voice recognition communication service method.
  • the computer-readable storage medium may be the internal storage unit of the computer device described in the foregoing embodiment, such as the hard disk or memory of the computer device.
  • the computer-readable storage medium may also be an external storage device of the computer device, such as a plug-in hard disk, a smart memory card (SMC), or a secure digital (Secure Digital, SD) equipped on the computer device. ) Card, Flash Card, etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • General Health & Medical Sciences (AREA)
  • Psychiatry (AREA)
  • Hospice & Palliative Care (AREA)
  • Child & Adolescent Psychology (AREA)
  • Artificial Intelligence (AREA)
  • Environmental & Geological Engineering (AREA)
  • Telephonic Communication Services (AREA)
  • Telephone Function (AREA)

Abstract

Provided are a voice recognition-based communication service method, apparatus, computer device, and storage medium, said method comprising: obtaining first audio of a first terminal and second audio of a second terminal, and recognizing and obtaining a call context; recognizing the first audio to obtain the emotion of a first person, and recognizing the second audio to obtain the emotion of a second person; sending prompt information to the first terminal according to the call setting and the emotion of the first person; sending prompt information to the first terminal according to the call setting and the emotion of the second person.

Description

基于语音识别的通信服务方法、装置、计算机设备及存储介质Communication service method, device, computer equipment and storage medium based on speech recognition
本申请要求于2019年07月05日提交中国专利局、申请号为201910605732.6、发明名称为“基于语音识别的通信服务方法、装置、计算机设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of a Chinese patent application filed with the Chinese Patent Office, the application number is 201910605732.6, and the invention title is "voice recognition-based communication service methods, devices, computer equipment and storage media" on July 5, 2019. All of them The content is incorporated in this application by reference.
技术领域Technical field
本申请涉及数据分析技术领域,尤其涉及一种基于语音识别的通信服务方法、装置、计算机设备及存储介质。This application relates to the field of data analysis technology, and in particular to a communication service method, device, computer equipment and storage medium based on voice recognition.
背景技术Background technique
人们可以通过现有的电信运营商或者其他社交平台进行通话,但是其提供的服务较为单一。例如有时通话人之间的交流需要一些干预才能更好的实现交流目的,但是现有的这些通信服务平台无法及时准确的在通话人进行通话时注入干预,以引导通话人更好的实现通话。People can make calls through existing telecom operators or other social platforms, but the services they provide are relatively simple. For example, sometimes the communication between callers requires some intervention to better achieve the purpose of the communication, but these existing communication service platforms cannot inject intervention in a timely and accurate manner when the caller makes a call, so as to guide the caller to better implement the call.
发明内容Summary of the invention
本申请实施例提供一种基于语音识别的通信服务方法、装置、计算机设备及存储介质,能够较佳地实现在通话人进行通话时及时准确的注入干预,以引导通话人更好的实现通话。The embodiments of the present application provide a communication service method, device, computer equipment, and storage medium based on voice recognition, which can better realize timely and accurate injection intervention when the caller makes a call, so as to guide the caller to better implement the call.
第一方面,本申请提供了一种基于语音识别的通信服务方法,所述方法包括:In the first aspect, this application provides a communication service method based on voice recognition, the method including:
若第一通话终端与第二通话终端之间的通话接通,获取所述第一通话终端对应的第一通话音频和所述第二通话终端对应的第二通话音频;If the call between the first call terminal and the second call terminal is connected, acquiring the first call audio corresponding to the first call terminal and the second call audio corresponding to the second call terminal;
对所述第一通话音频和所述第二通话音频进行语音识别以得到对话文本数据;Performing voice recognition on the first call audio and the second call audio to obtain dialogue text data;
基于预先构建的场景识别模型对所述对话文本数据进行识别,以获取通话场景的类型数据;Recognizing the dialogue text data based on a pre-built scene recognition model to obtain type data of the call scene;
基于预先构建的情绪识别模型对所述第一通话音频、第二通话音频进行识别,以获取所述第一通话终端对应的第一通话人的情绪数据、所述第二通话终端对应的第二通话人的情绪数据;Recognize the first call audio and the second call audio based on the pre-built emotion recognition model to obtain the emotion data of the first caller corresponding to the first call terminal and the second call corresponding to the second call terminal Emotional data of the caller;
根据所述通话场景的类型数据和所述第一通话人的情绪数据生成并向所述第一通话终端发送用于提示所述第一通话人调整情绪的第一提示信息;Generating and sending first prompt information for prompting the first caller to adjust emotions to the first call terminal according to the type data of the call scene and the emotion data of the first caller;
根据所述通话场景的类型数据和所述第二通话人的情绪数据生成并向所述第一通话终端发送用于提示所述第一通话人调整对话策略以应对所述第二通话人情绪的第二提示信息。According to the type data of the call scene and the emotion data of the second caller, it is generated and sent to the first call terminal for prompting the first caller to adjust the dialogue strategy to deal with the second caller's emotions. The second prompt message.
第二方面,本申请提供了一种基于语音识别的通信服务装置,所述装置包括:In a second aspect, the present application provides a communication service device based on voice recognition, the device including:
音频获取模块,用于若第一通话终端与第二通话终端之间的通话接通,获取所述第一通话终端对应的第一通话音频和所述第二通话终端对应的第二通话音频;An audio acquisition module, configured to obtain a first call audio corresponding to the first call terminal and a second call audio corresponding to the second call terminal if the call between the first call terminal and the second call terminal is connected;
语音识别模块,用于对所述第一通话音频和所述第二通话音频进行语音识别以得到对话文本数据;A voice recognition module, configured to perform voice recognition on the first call audio and the second call audio to obtain dialogue text data;
场景识别模块,用于基于预先构建的场景识别模型对所述对话文本数据进行识别,以获取通话场景的类型数据;A scene recognition module, configured to recognize the dialogue text data based on a pre-built scene recognition model to obtain type data of the call scene;
情绪识别模块,用于基于预先构建的情绪识别模型对所述第一通话音频、第二通话音频、对话文本数据中的至少一项进行识别,以获取所述第一通话终端对应的第一通话人的情绪数据和所述第二通话终端对应的第二通话人的情绪数据;The emotion recognition module is configured to recognize at least one of the first call audio, second call audio, and dialog text data based on a pre-built emotion recognition model to obtain the first call corresponding to the first call terminal Emotional data of a person and emotional data of a second caller corresponding to the second call terminal;
第一提示模块,用于根据所述通话场景的类型数据和所述第一通话人的情绪数据生成并向所述第一通话终端发送用于提示所述第一通话人调整情绪的第一提示信息;The first prompting module is configured to generate and send a first prompt for prompting the first caller to adjust emotions to the first call terminal according to the type data of the call scene and the emotional data of the first caller information;
第二提示模块,用于根据所述通话场景的类型数据和所述第二通话人的情绪数据生成并向所述第一通话终端发送用于提示所述第一通话人调整对话策略以应对所述第二通话人情绪的第二提示信息。The second prompting module is configured to generate and send to the first call terminal according to the type data of the call scene and the emotional data of the second caller for prompting the first caller to adjust the dialogue strategy to deal with the situation. The second prompt message describing the emotion of the second caller.
第三方面,本申请提供了一种计算机设备,所述计算机设备包括存储器和处理器;所述存储器用于存储计算机程序;所述处理器,用于执行所述计算机程序并在执行所述计算机程序时实现上述的基于语音识别的通信服务方法。In a third aspect, the present application provides a computer device that includes a memory and a processor; the memory is used to store a computer program; the processor is used to execute the computer program and when the computer is executed The program implements the above-mentioned communication service method based on voice recognition.
第四方面,本申请提供了一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,若所述计算机程序被处理器执行,实现上述的基于语音识别的通信服务方法。In a fourth aspect, this application provides a computer-readable storage medium that stores a computer program, and if the computer program is executed by a processor, the above-mentioned voice recognition-based communication service method is implemented.
本申请公开了一种基于语音识别的通信服务方法、装置、设备及存储介质,通过在第一通话终端与第二通话终端之间通话时获取相应的音频,然后通过语音识别得到对话文本并根据对话文本识别通话场景,以及根据获取的音频识别通话人的情绪;之后根据通话场景和通话人的情绪对通话人作出相应的提示,从而实现及时准确的在通话人进行通话时注入干预,以引导通话人更好的实现通话。This application discloses a communication service method, device, equipment and storage medium based on voice recognition. The corresponding audio is obtained during a call between a first call terminal and a second call terminal, and then the dialogue text is obtained through voice recognition and based on The conversation text recognizes the call scene and the emotion of the caller according to the acquired audio; then, according to the call scene and the emotion of the caller, prompts the caller accordingly, so as to realize timely and accurate injection of intervention during the call to guide the caller The caller realizes the call better.
附图说明Description of the drawings
为了更清楚地说明本申请实施例的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to explain the technical solutions of the embodiments of the present application more clearly, the following will briefly introduce the accompanying drawings used in the description of the embodiments. Obviously, the drawings in the following description are some embodiments of the present application. For those of ordinary skill in the art, without creative work, other drawings can be obtained from these drawings.
图1为本申请实施例的基于语音识别的通信服务方法的使用场景示意图;FIG. 1 is a schematic diagram of a usage scenario of a communication service method based on voice recognition according to an embodiment of the application;
图2为本申请一实施例的基于语音识别的通信服务方法的流程示意图;2 is a schematic flowchart of a communication service method based on voice recognition according to an embodiment of the application;
图3为通过语音识别以得到对话文本数据的子流程示意图;Figure 3 is a schematic diagram of a sub-process of obtaining dialogue text data through voice recognition;
图4为本申请另一实施例的基于语音识别的通信服务方法的流程示意图;4 is a schematic flowchart of a communication service method based on voice recognition according to another embodiment of this application;
图5为获取通话场景的类型数据的子流程示意图;Figure 5 is a schematic diagram of a sub-process for obtaining type data of a call scene;
图6为抽取文本特征的子流程示意图;Figure 6 is a schematic diagram of a sub-process for extracting text features;
图7为基于词袋模型提取文本特征的子流程示意图;Figure 7 is a schematic diagram of a sub-process for extracting text features based on a bag of words model;
图8为获取第一通话人的情绪数据的子流程示意图;FIG. 8 is a schematic diagram of a sub-process of obtaining emotional data of the first caller;
图9为情绪识别模型识别获取情绪数据的子流程示意图;FIG. 9 is a schematic diagram of a sub-process of emotion recognition model recognition and acquisition of emotion data;
图10为本申请再一实施例的基于语音识别的通信服务方法的流程示意图;10 is a schematic flowchart of a communication service method based on voice recognition according to still another embodiment of this application;
图11为本申请又一实施例的基于语音识别的通信服务方法的流程示意图;11 is a schematic flowchart of a communication service method based on voice recognition according to another embodiment of this application;
图12为本申请一实施例提供的基于语音识别的通信服务装置的结构示意图;12 is a schematic structural diagram of a communication service device based on voice recognition provided by an embodiment of the application;
图13为本申请另一实施例提供的基于语音识别的通信服务装置的结构示意图;FIG. 13 is a schematic structural diagram of a communication service device based on voice recognition provided by another embodiment of this application;
图14为本申请一实施例提供的一种计算机设备的结构示意图。FIG. 14 is a schematic structural diagram of a computer device provided by an embodiment of this application.
具体实施方式Detailed ways
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。The technical solutions in the embodiments of the present application will be described clearly and completely in conjunction with the accompanying drawings in the embodiments of the present application. Obviously, the described embodiments are part of the embodiments of the present application, rather than all of them. Based on the embodiments in this application, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of this application.
附图中所示的流程图仅是示例说明,不是必须包括所有的内容和操作/步骤,也不是必须按所描述的顺序执行。例如,有的操作/步骤还可以分解、组合或部分合并,因此实际执行的顺序有可能根据实际情况改变。另外,虽然在装置示意图中进行了功能模块的划分,但是在某些情况下,可以以不同于装置示意图中的模块划分。The flowchart shown in the drawings is merely an illustration, and does not necessarily include all contents and operations/steps, nor does it have to be executed in the described order. For example, some operations/steps can also be decomposed, combined or partially combined, so the actual execution order may be changed according to actual conditions. In addition, although the functional modules are divided in the device schematic diagram, in some cases, they may be divided into different modules from the device schematic diagram.
本申请的实施例提供了一种基于语音识别的通信服务方法、装置、计算机设备及计算机可读存储介质。其中,该通信服务方法可以应用于终端或服务器中,以实现在需要时干预通话人之间的交流。The embodiments of the present application provide a voice recognition-based communication service method, device, computer equipment, and computer-readable storage medium. Among them, the communication service method can be applied to a terminal or a server, so as to intervene in the communication between the callers when needed.
在一些实施例中,第一通话终端和第二通话终端进行通话,基于语音识别的通信服务方法应用于第一通话终端、第二通话终端中的至少一个。在另一些实施例中,第一通话终端和第二通话终端进行通话,服务器为第一通话终端和第二通话终端之间的通话提供支持,基于 语音识别的通信服务方法可以应用于该服务器。请参阅图1,图1是本申请的实施例提供的基于语音识别的通信服务方法的应用场景示意图。该应用场景包括服务器、第一通话终端和第二通话终端。In some embodiments, the first call terminal and the second call terminal conduct a call, and the communication service method based on voice recognition is applied to at least one of the first call terminal and the second call terminal. In other embodiments, the first call terminal and the second call terminal conduct a call, and the server provides support for the call between the first call terminal and the second call terminal, and a voice recognition-based communication service method can be applied to the server. Please refer to FIG. 1, which is a schematic diagram of an application scenario of a communication service method based on voice recognition provided by an embodiment of the present application. The application scenario includes a server, a first call terminal, and a second call terminal.
其中,通话终端可以是手机、平板电脑、笔记本电脑、台式电脑、个人数字助理、穿戴式设备、智能音箱等电子设备;服务器可以为独立的服务器,也可以为服务器集群。Among them, the call terminal can be a mobile phone, a tablet computer, a notebook computer, a desktop computer, a personal digital assistant, a wearable device, a smart speaker, and other electronic devices; the server can be an independent server or a server cluster.
但为了便于理解,以下实施例将以应用于服务器的基于语音识别的通信服务方法进行详细介绍。However, for ease of understanding, the following embodiments will introduce in detail a communication service method based on voice recognition applied to a server.
下面结合附图,对本申请的一些实施方式作详细说明。在不冲突的情况下,下述的实施例及实施例中的特征可以相互组合。Hereinafter, some embodiments of the present application will be described in detail with reference to the accompanying drawings. In the case of no conflict, the following embodiments and features in the embodiments can be combined with each other.
请参阅图2,图2是本申请的实施例提供的一种基于语音识别的通信服务方法的流程示意图。Please refer to FIG. 2, which is a schematic flowchart of a communication service method based on voice recognition provided by an embodiment of the present application.
如图2所示,基于语音识别的通信服务方法包括以下步骤S110-步骤S160。As shown in FIG. 2, the communication service method based on voice recognition includes the following steps S110 to S160.
步骤S110、若第一通话终端与第二通话终端之间的通话接通,获取所述第一通话终端对应的第一通话音频和所述第二通话终端对应的第二通话音频。Step S110: If the call between the first call terminal and the second call terminal is connected, obtain the first call audio corresponding to the first call terminal and the second call audio corresponding to the second call terminal.
在一些实施方式中,第一通话人使用第一通话终端拨出第二通话人的电话,第二通话人使用第二通话终端接听了该电话,则第一通话终端与第二通话终端之间的通话接通。In some embodiments, the first caller uses the first call terminal to make a call to the second caller, and the second caller uses the second call terminal to answer the call, then the first call terminal and the second call terminal The call is connected.
在第一通话终端与第二通话终端之间的通话接通,第一通话人与第二通话人通话的时候,服务器为第一通话终端和第二通话终端之间的通话提供支持。示例性的,服务器采集第一通话人的音频,即第一通话终端对应的第一通话音频,并将第一通话音频向第二通话终端发送以便第二通话终端的喇叭将音频播放给第二通话人收听;服务器还采集第二通话人的音频,即第二通话终端对应的第二通话音频,并将第二通话音频向第一通话终端发送以便第一通话终端的喇叭将音频播放给第一通话人收听。因此在服务器监测到第一通话终端与第二通话终端之间的通话接通时,可以获取所述第一通话终端对应的第一通话音频和所述第二通话终端对应的第二通话音频。When the call between the first call terminal and the second call terminal is connected, and when the first caller is talking with the second caller, the server provides support for the call between the first call terminal and the second call terminal. Exemplarily, the server collects the audio of the first caller, that is, the first call audio corresponding to the first call terminal, and sends the first call audio to the second call terminal so that the speaker of the second call terminal can play the audio to the second call terminal. The caller listens; the server also collects the audio of the second caller, that is, the second call audio corresponding to the second call terminal, and sends the second call audio to the first call terminal so that the speaker of the first call terminal plays the audio to the second call terminal One caller listens. Therefore, when the server monitors that the call between the first call terminal and the second call terminal is connected, the first call audio corresponding to the first call terminal and the second call audio corresponding to the second call terminal can be obtained.
步骤S120、对所述第一通话音频和所述第二通话音频进行语音识别以得到对话文本数据。Step S120: Perform voice recognition on the first call audio and the second call audio to obtain dialog text data.
具体的,服务器将第一通话音频和第二通话音频通过语音识别的方式转换为文本以得到对话文本数据。Specifically, the server converts the first call audio and the second call audio into text by means of voice recognition to obtain dialog text data.
在一些实施方式中,如图3所示,步骤S120对所述第一通话音频和所述第二通话音频进行语音识别以得到对话文本数据,具体包括步骤S121-步骤S123。In some embodiments, as shown in FIG. 3, step S120 performs voice recognition on the first call audio and the second call audio to obtain dialog text data, which specifically includes step S121 to step S123.
步骤S121、对所述第一通话音频进行语音识别以得到第一通话人对应的第一文本。Step S121: Perform voice recognition on the first call audio to obtain a first text corresponding to the first caller.
示例性的,服务器在采集第一通话终端对应的第一通话音频时,对采集的第一通话音频进行语音识别,并将识别出的文本标记为第一文本。Exemplarily, when collecting the first call audio corresponding to the first call terminal, the server performs voice recognition on the collected first call audio, and marks the recognized text as the first text.
步骤S122、对所述第二通话音频进行语音识别以得到第二通话人对应的第二文本。Step S122: Perform voice recognition on the second call audio to obtain a second text corresponding to the second caller.
示例性的,服务器在采集第二通话终端对应的第二通话音频时,对采集的第二通话音频进行语音识别,并将识别出的文本标记为第二文本。Exemplarily, when collecting the second call audio corresponding to the second call terminal, the server performs voice recognition on the collected second call audio, and marks the recognized text as the second text.
步骤S123、根据预设排序规则对所述第一文本、第二文本排序,以得到对话文本数据。Step S123: Sort the first text and the second text according to a preset sorting rule to obtain dialogue text data.
示例性的,按照各第一文本、第二文本的记录时间的先后,将所述第一文本、第二文本排序得到对话文本数据。Exemplarily, according to the order of recording time of each first text and second text, the first text and the second text are sorted to obtain dialog text data.
示例性的,对话文本数据包括间隔排列的多个第一文本、第二文本。Exemplarily, the dialogue text data includes a plurality of first texts and second texts arranged at intervals.
步骤S130、基于预先构建的场景识别模型对所述对话文本数据进行识别,以获取通话场景的类型数据。Step S130: Recognizing the dialogue text data based on a pre-built scene recognition model to obtain type data of the call scene.
在一些实施方式中,场景识别模型保存或学习了若干场景识别规则,场景识别模型基于场景识别规则识别对话文本数据对应的通话场景。In some embodiments, the scene recognition model stores or learns several scene recognition rules, and the scene recognition model recognizes the call scene corresponding to the dialogue text data based on the scene recognition rules.
在一些实施方式中,如图4所示,步骤S130基于预先构建的场景识别模型对所述对话文本数据进行识别,以获取通话场景的类型数据,包括步骤S131。In some embodiments, as shown in FIG. 4, step S130 recognizes the dialogue text data based on a pre-built scene recognition model to obtain type data of the call scene, including step S131.
步骤S131、基于内置场景判断规则的场景规则引擎,对所述对话文本数据进行分析以获取通话场景的类型数据。Step S131: Based on the scene rule engine with built-in scene judgment rules, analyze the conversation text data to obtain type data of the call scene.
示例性的,场景规则引擎是内置场景判断规则的规则引擎,如drools规则引擎。规则引擎起源于基于规则的专家系统,而基于规则的专家系统又是专家系统的其中一个分支。专家系统属于人工智能的范畴,它模仿人类的推理方式,使用试探性的方法进行推理,并使用人类能理解的术语解释和证明它的推理结论。规则引擎是为了响应和处理复杂的业务规则而设计的核心技术组件,通过引入规则引擎,可以通过灵活配置的方式动态及时定义和调整场景判断规则。Exemplarily, the scene rule engine is a rule engine with built-in scene judgment rules, such as a drools rule engine. The rule engine originated from the rule-based expert system, and the rule-based expert system is a branch of the expert system. Expert system belongs to the category of artificial intelligence. It imitates human reasoning, uses tentative methods to reason, and uses human-understandable terms to explain and prove its reasoning conclusions. The rule engine is a core technical component designed to respond to and process complex business rules. By introducing the rule engine, it is possible to dynamically define and adjust scene judgment rules in a timely manner through flexible configuration.
示例性的,场景规则引擎内置的场景判断规则具体为基于人的实践经验而设定的规则,且本实施例对于预置的场景判断规则的设置不做限制。例如,若对话文本数据中包括“您好,王先生,我是某某某”时,场景识别模型基于某一场景判断规则识别对话文本数据对应通话场景的类型为陌生人通话。Exemplarily, the built-in scene judgment rule of the scene rule engine is specifically a rule set based on people's practical experience, and this embodiment does not limit the setting of the preset scene judgment rule. For example, if the dialog text data includes "Hello, Mr. Wang, I am XX", the scene recognition model recognizes the type of the conversation scene corresponding to the conversation text data as a stranger call based on a certain scene judgment rule.
场景规则引擎的构建包括:首先根据预设的规则修改模板获取若干与所述规则修改模板匹配的场景判断规则;然后对所述场景判断规则进行预编译和测试,测试通过后根据场景判断规则生成脚本文件;之后将所述脚本文件存储于服务器并将所述脚本文件与所述场景规则引擎的规则调用接口相关联,以便场景规则引擎调用相应的场景判断规则。The construction of the scene rule engine includes: first obtain a number of scene judgment rules matching the rule modification template according to a preset rule modification template; then precompile and test the scene judgment rules, and generate according to the scene judgment rules after the test passes Script file; then store the script file on the server and associate the script file with the rule calling interface of the scene rule engine, so that the scene rule engine calls the corresponding scene judgment rule.
在一些实施方式中,规则修改模板为可视化规则修改模板。通过将规则修改模板可视化,更有利于相关人员直接在规则修改模板上进行编辑,生成场景判断规则;使得了解通话场景判断规律的相关人员不用了解模版背后的实现方式就可以通过模版修改场景判断规则,将使用规则引擎的门槛进一步降低从而利于提高场景规则引擎对通话场景识别的准确性。In some embodiments, the rule modification template is a visual rule modification template. By visualizing the rule modification template, it is more conducive for relevant personnel to edit directly on the rule modification template to generate scene judgment rules; so that relevant personnel who understand the call scene judgment rule can modify the scene judgment rule through the template without knowing the implementation method behind the template , The threshold for using the rule engine is further reduced, which is beneficial to improve the accuracy of the scene rule engine's recognition of the call scene.
在另一些实施方式中,场景识别模型可采用如下方式构建:通过机器学习算法,从场景训练样本集中学习获得所述场景识别模型。In other embodiments, the scene recognition model may be constructed in the following manner: the scene recognition model is obtained by learning from a set of scene training samples through a machine learning algorithm.
如图5所示,步骤S130基于预先构建的场景识别模型对所述对话文本数据进行识别,以获取通话场景的类型数据,包括步骤S132、步骤S133。As shown in FIG. 5, step S130 recognizes the dialogue text data based on the pre-built scene recognition model to obtain the type data of the call scene, including step S132 and step S133.
步骤S132、抽取所述对话文本数据中的文本特征。Step S132: Extract text features in the dialogue text data.
在识别对话文本数据对应的通话场景时,需要从对话文本数据中提取特征,提取出对识别有价值的信息,而不是把所有的词都用上,那样会造成维度灾难。When recognizing the call scene corresponding to the dialogue text data, it is necessary to extract features from the dialogue text data to extract valuable information for recognition, instead of using all the words, it will cause a dimensional disaster.
示例性的,从对话文本数据中抽取出特征词进行量化来表示文本信息,即对话文本数据中的文本特征,实现对对话文本数据进行科学的抽象,建立它的数学模型,用以描述和代替对话文本数据。Exemplarily, feature words are extracted from the dialogue text data and quantified to represent the text information, that is, the text features in the dialogue text data, to realize the scientific abstraction of the dialogue text data, and establish its mathematical model to describe and replace Dialog text data.
示例性的,基于词袋(Bag-of-words,BOW)模型从对话文本数据中抽取文本特征。Exemplarily, text features are extracted from dialogue text data based on a bag-of-words (Bag-of-words, BOW) model.
在一些实施方式中,如图6所示,步骤S132抽取所述对话文本数据中的文本特征,包括步骤S1321、步骤S1322。In some embodiments, as shown in FIG. 6, step S132 extracts the text features in the dialogue text data, including step S1321 and step S1322.
步骤S1321、根据预设过滤规则滤除所述对话文本数据中的噪音字符。Step S1321, filter out noisy characters in the dialogue text data according to a preset filtering rule.
示例性的,根据预设的包括若干停用词的停用词库,将所述对话文本数据中的所述停用词删除或者以预设符号替换。Exemplarily, according to a preset stop word database including several stop words, the stop words in the dialogue text data are deleted or replaced with preset symbols.
具体的,可以根据通话场景规定一些特殊词语如“的”“得”等噪音字符、无效词为停用词,以构建停用词库,以配置文件的形式保存起来。服务器在需要时调取停用词库。Specifically, some special words such as "的", "得" and other noise characters and invalid words can be specified as stop words according to the call scene to construct a stop word database and save it in the form of a configuration file. The server calls the stop word database when needed.
具体的,分别查找停用词库中的各停用词是否在所述对话文本数据中出现,若出现则删除所述对话文本数据中的停用词;或者,分别查找停用词库中的各停用词是否在所述对话文本数据中出现,若出现则将所述对话文本数据的停用词替换为预设符号,如空格等,以在一定程度上保留所述对话文本数据的结构。Specifically, each stop word in the stop word database is searched separately whether it appears in the dialogue text data, and if it appears, the stop word in the dialogue text data is deleted; or, the stop word in the stop word database is searched separately Whether each stop word appears in the dialogue text data, if it appears, replace the stop word in the dialogue text data with a preset symbol, such as a space, to preserve the structure of the dialogue text data to a certain extent .
步骤S1322、基于词袋模型,从滤除噪音字符的对话文本数据提取文本特征。Step S1322, based on the bag-of-words model, extract text features from the dialogue text data with noise characters filtered out.
词袋(Bag-of-words,BOW)是描述文档中单词元素出现的文本的一种表示形式。词袋模型是用机器学习算法对文本进行建模时表示文本数据的方法。它涉及两件方面:已知单词 的集合、测试已知单词的存在。Bag-of-words (Bag-of-words, BOW) is a representation of the text that describes the occurrence of word elements in a document. The bag-of-words model is a method of representing text data when modeling text with machine learning algorithms. It involves two aspects: the collection of known words and testing the existence of known words.
具体的,词袋模型包括词典,词典中包括若干词语。词袋模型把滤除噪音字符的对话文本数据划分成一个个词语,想象将所有词语放入一个袋子里,忽略其词序、语法、句法等要素,将其仅仅看作是若干个词语的集合,对话文本数据中每个词语的出现都是独立的,不依赖于其他词语是否出现。词袋模型从滤除噪音字符的对话文本数据提取的文本特征包括词袋特征向量。Specifically, the bag-of-words model includes a dictionary, and the dictionary includes several words. The bag-of-words model divides the dialogue text data with noisy characters filtered into words, imagine putting all words in a bag, ignoring the word order, grammar, syntax and other elements, and treating them as just a collection of several words. The appearance of each word in the dialogue text data is independent and does not depend on whether other words appear or not. The bag-of-words model extracts text features from dialogue text data that filter out noise characters including bag-of-words feature vectors.
示例性的,如图7所示,步骤S1322基于词袋模型,从滤除噪音字符的对话文本数据提取文本特征,包括步骤S1301-步骤S1303。Exemplarily, as shown in FIG. 7, step S1322 is based on the bag-of-words model to extract text features from the dialogue text data with noise characters filtered out, including steps S1301-step S1303.
步骤S1301、初始化全零的词袋特征向量。Step S1301, initialize the all-zero bag-of-words feature vector.
其中,所述词袋特征向量中的元素与所述词袋模型的词典中的词语一一对应。Wherein, the elements in the bag-of-words feature vector correspond one-to-one with words in the dictionary of the bag-of-words model.
示例性的,根据词袋模型的词典{1:“小明”,2:“喜欢”,3:“看”,4:“电影”5:“也”,6:“踢”,7:“足球”},初始化全零的词袋特征向量为[0,0,0,0,0,0,0]。Exemplarily, according to the dictionary of the bag-of-words model {1: "Xiao Ming", 2: "Like", 3: "Watch", 4: "Movie" 5: "Also", 6: "Kick", 7: "Football" ”}, initialize the all-zero bag-of-words feature vector to [0, 0, 0, 0, 0, 0, 0].
步骤S1302、统计所述词典中各所述词语在滤除所述噪音字符的对话文本数据中出现的次数。Step S1302: Count the number of occurrences of each word in the dictionary in the dialogue text data from which the noise character is filtered out.
步骤S1303、根据所述词语在所述对话文本数据中出现的次数对所述词袋特征向量中对应的元素赋值。Step S1303: Assign a value to the corresponding element in the bag of words feature vector according to the number of times the word appears in the dialogue text data.
示例性的,如果去除噪音字符的对话文本数据为“小明喜欢看电影”,则词袋特征向量为[1,1,1,1,0,0,0]。如果去除噪音字符的对话文本数据为“小明喜欢看电影小明也喜欢踢足球”,则词袋特征向量为[2,2,1,1,1,1,1]。Exemplarily, if the dialogue text data from which noise characters are removed is "Xiao Ming likes to watch movies", then the bag of words feature vector is [1, 1, 1, 1, 0, 0, 0]. If the dialogue text data from which the noise characters are removed is "Xiao Ming likes watching movies and Xiao Ming also likes playing football", the bag of words feature vector is [2, 2, 1, 1, 1, 1, 1].
步骤S133、基于训练好的机器学习模型,根据所述对话文本数据中的文本特征识别出通话场景的类型数据。Step S133: Based on the trained machine learning model, the type data of the call scene is identified according to the text features in the dialogue text data.
具体的,将对话文本数据中的文本特征作为训练好的机器学习模型的输入,机器学习模型的输出作为识别出的通话场景的类型数据。Specifically, the text features in the dialogue text data are used as the input of the trained machine learning model, and the output of the machine learning model is used as the type data of the identified call scene.
在一些实施方式中,用于训练机器学习模型的场景训练样本集包括若干场景训练样本。所述场景训练样本包括历史对话文本数据和与历史对话文本数据对应的场景类型数据两方面信息。从历史对话文本数据可以提取出文本特征,所述场景类型数据为所述历史对话文本数据的标注数据,在进行模型训练时,将所述历史对话文本数据对应的文本特征作为输入数据,将所述场景类型数据作为输出数据,通过选定的机器学习模型,从包括大量场景训练样本的场景训练样本集中学习以获得训练好的机器学习模型。In some embodiments, the scene training sample set used to train the machine learning model includes several scene training samples. The scene training sample includes historical dialogue text data and scene type data corresponding to the historical dialogue text data. Text features can be extracted from historical dialogue text data. The scene type data is the annotation data of the historical dialogue text data. During model training, the text characteristics corresponding to the historical dialogue text data are used as input data, and all The scene type data is used as output data, and a selected machine learning model is used to learn from a scene training sample set including a large number of scene training samples to obtain a trained machine learning model.
在一些实施方式中,训练好的机器学习模型可以设置为仅识别单一场景下通话场景类型的模型,则基于预先构建的场景识别模型对所述对话文本数据进行识别获取的通话场景的类型数据可以体现第一通话人与第二通话人间是否属于某个特定的通话场景。在另一些实施方式中,训练好的机器学习模型还可以设置为能够识别多场景下通话场景类型的模型,则基于预先构建的场景识别模型对所述对话文本数据进行识别获取的通话场景的类型数据可以体现第一通话人与第二通话人间属于多个特定通话场景的概率。如某实施例中基于预先构建的场景识别模型对所述对话文本数据进行识别获取的通话场景的类型数据中对应于“朋友”“借钱”两个场景类型的概率分别为40%和43%,均大于预设的阈值30%,则所述对话文本数据对应的通话场景的类型为“朋友”“借钱”。In some embodiments, the trained machine learning model can be set as a model that only recognizes the call scene type in a single scene, and the type data of the call scene obtained by recognizing the conversation text data based on the pre-built scene recognition model can be It reflects whether the first caller and the second caller belong to a specific call scene. In other embodiments, the trained machine learning model can also be set as a model that can recognize the type of call scenes in multiple scenarios, and then the type of call scene obtained by recognizing the conversation text data based on the pre-built scene recognition model The data can reflect the probability that the first caller and the second caller belong to multiple specific call scenarios. For example, in an embodiment, based on a pre-built scene recognition model to recognize the conversation text data, the probability that the type data of the call scene corresponding to the two scene types "friend" and "borrow money" are 40% and 43%, respectively , Are greater than the preset threshold of 30%, and the type of the call scene corresponding to the conversation text data is "friend" and "borrowing money".
步骤S140、基于预先构建的情绪识别模型对所述第一通话音频、第二通话音频进行识别,以获取所述第一通话终端对应的第一通话人的情绪数据、所述第二通话终端对应的第二通话人的情绪数据。Step S140: Recognizing the first call audio and the second call audio based on the pre-built emotion recognition model to obtain the emotion data of the first caller corresponding to the first call terminal and the second call terminal corresponding The emotional data of the second caller.
在一些实施方式中,服务器基于预先构建的情绪识别模型对所述第一通话音频进行识别,以获取第一通话人的情绪数据;以及服务器基于预先构建的情绪识别模型对所述第二通话音频进行识别,以获取第二通话人的情绪数据。In some embodiments, the server recognizes the first call audio based on a pre-built emotion recognition model to obtain the emotion data of the first caller; and the server recognizes the second call audio based on the pre-built emotion recognition model Perform recognition to obtain emotional data of the second caller.
示例性的,通过机器学习算法,从情绪训练样本集中学习获得所述情绪识别模型。Exemplarily, a machine learning algorithm is used to obtain the emotion recognition model from a set of emotion training samples.
所述情绪训练样本集包括若干情绪训练样本。所述情绪训练样本包括历史音频数据和与历史音频数据对应的情绪类型数据两方面信息。根据历史音频数据可以提取出特征数据,例如音量特征、语速特征、顺畅特征、停顿特征等;所述情绪类型数据为所述历史音频数据的标注数据,在进行模型训练时,将所述历史音频数据对应的特征数据作为输入数据,将所述情绪类型数据作为输出数据,通过选定的机器学习模型,从包括若干情绪训练样本的情绪训练样本集中学习以获得所述情绪识别模型。The emotion training sample set includes several emotion training samples. The emotion training sample includes historical audio data and emotion type data corresponding to the historical audio data. According to historical audio data, characteristic data can be extracted, such as volume characteristics, speech rate characteristics, smooth characteristics, pause characteristics, etc.; the emotion type data is the annotation data of the historical audio data, and the historical audio data is used for model training. The feature data corresponding to the audio data is used as input data, the emotion type data is used as output data, and the emotion recognition model is obtained by learning from a set of emotion training samples including several emotion training samples through a selected machine learning model.
在一些实施方式中,先对第一通话音频进行处理以获取用于体现第一通话人语音顺畅性的顺畅特征,以及获取用于体现停顿时长的停顿特征;具体的,顺畅特征的识别是通过对第一通话人语音声音抖动频率进行侦测与评定获取的,停顿特征的识别是通过在第一通话人、第二通话人声音停止时开启计时器进行计时获取的。训练好的情绪识别模型可以根据顺畅特征、停顿特征、音量特征和/或语速特征等识别出第一通话人的情绪数据。相应的,情绪识别模型可以对第二通话音频进行识别以获取第二通话人的情绪数据。In some implementations, the first call audio is first processed to obtain smooth features that reflect the smoothness of the first caller’s voice, and to obtain pause features that reflect the duration of the pause; specifically, the smooth feature is identified through The voice jitter frequency of the first caller is detected and evaluated. The identification of the pause feature is obtained by starting a timer for timing when the voice of the first caller and the second caller stops. The trained emotion recognition model can recognize the emotion data of the first caller based on smooth features, pause features, volume features, and/or speech rate features. Correspondingly, the emotion recognition model can recognize the second call audio to obtain the emotion data of the second caller.
示例性的,在第一通话音频的音量高于预设阈值时情绪识别模型识别所述第一通话终端对应的第一通话人的情绪数据为“激动”;在第一通话人语音声音抖动频率高于预设频率阈值时情绪识别模型识别所述第一通话终端对应的第一通话人的情绪数据为“紧张”。Exemplarily, when the volume of the first call audio is higher than the preset threshold, the emotion recognition model recognizes that the emotion data of the first caller corresponding to the first call terminal is "excited"; When the frequency is higher than the preset frequency threshold, the emotion recognition model recognizes that the emotion data of the first caller corresponding to the first call terminal is "tension".
在一些实施方式中,情绪识别模型对对话文本数据进行识别以获取文本特征;情绪识别模型还可以根据文本特征识别出第一通话人或者第二通话人的情绪数据。例如,对话文本数据中第二文本包括与第二通话人对应的语句“你需要冷静,不要激动”,则情绪识别模型可以识别出第一通话人的情绪为“激动”;如果对话文本数据中第二文本包括与第二通话人对应的语句“你这个**”,则情绪识别模型可以识别出第二通话人的情绪为“激动”或“生气”。In some embodiments, the emotion recognition model recognizes the dialogue text data to obtain text features; the emotion recognition model can also identify the emotion data of the first speaker or the second speaker based on the text features. For example, if the second text in the dialogue text data includes the sentence "You need to be calm and not excited" corresponding to the second caller, the emotion recognition model can recognize the emotion of the first caller as "excited"; if the dialogue text data The second text includes the sentence "you this **" corresponding to the second caller, and the emotion recognition model can identify the emotion of the second caller as "excited" or "angry".
在一些实施方式中,如图8所示,步骤S140基于预先构建的情绪识别模型对所述第一通话音频、第二通话音频进行识别,以获取所述第一通话终端对应的第一通话人的情绪数据、所述第二通话终端对应的第二通话人的情绪数据,具体包括步骤S141、步骤S142。In some embodiments, as shown in FIG. 8, step S140 recognizes the first call audio and the second call audio based on a pre-built emotion recognition model to obtain the first caller corresponding to the first call terminal The emotional data of and the emotional data of the second caller corresponding to the second call terminal specifically include step S141 and step S142.
步骤S141、基于预先构建的情绪识别模型对所述第一通话音频和对话文本数据进行识别,以获取所述第一通话终端对应的第一通话人的情绪数据。Step S141: Recognizing the first call audio and dialogue text data based on the pre-built emotion recognition model to obtain the emotion data of the first caller corresponding to the first call terminal.
具体的,将从第一通话音频提取的音量特征、语速特征、顺畅特征和/或停顿特征,以及从对话文本数据提取的文本特征进行融合后作为情绪识别模型的输入,由情绪识别模型识别出第一通话人的情绪数据;进一步提高了模型识别的准确度。Specifically, the volume feature, the speech rate feature, the smooth feature and/or the pause feature extracted from the first call audio, and the text feature extracted from the dialogue text data are merged as the input of the emotion recognition model, and the emotion recognition model is recognized The emotional data of the first caller is obtained; the accuracy of model recognition is further improved.
步骤S142、基于预先构建的情绪识别模型对所述第二通话音频和对话文本数据进行识别,以获取所述第二通话终端对应的第二通话人的情绪数据。Step S142: Recognizing the second call audio and dialogue text data based on the pre-built emotion recognition model to obtain the second caller's emotion data corresponding to the second call terminal.
具体的,将从第二通话音频提取的音量特征、语速特征、顺畅特征和/或停顿特征,以及从对话文本数据提取的文本特征进行融合后作为情绪识别模型的输入,由情绪识别模型识别出第二通话人的情绪数据;进一步提高了模型识别的准确度。Specifically, the volume feature, the speech rate feature, the smooth feature and/or the pause feature extracted from the second call audio, and the text feature extracted from the dialogue text data are merged as the input of the emotion recognition model, and the emotion recognition model is used to recognize The emotional data of the second caller is obtained; the accuracy of model recognition is further improved.
示例性的,如图9所示,步骤S141基于预先构建的情绪识别模型对所述第一通话音频和对话文本数据进行识别,以获取所述第一通话终端对应的第一通话人的情绪数据,具体包括步骤S1411-步骤S1413。Exemplarily, as shown in FIG. 9, step S141 recognizes the first call audio and dialogue text data based on a pre-built emotion recognition model to obtain the emotion data of the first caller corresponding to the first call terminal , Specifically includes step S1411-step S1413.
步骤S1411、从所述第一通话音频提取音量特征、语速特征、顺畅特征、停顿特征中的至少一种。Step S1411, extract at least one of a volume feature, a speech rate feature, a smooth feature, and a pause feature from the first call audio.
具体的,音量特征是用于体现第一通话音频振幅大小的特征,语速特征的识别是通过计算第一通话音频在时域上的能量包络的变化率获取的,顺畅特征的识别是通过对第一通话人语音声音抖动频率进行侦测与评定获取的,停顿特征的识别是通过在第一通话人、第二通话人声音停止时开启计时器进行计时获取的。Specifically, the volume feature is a feature used to reflect the amplitude of the first call audio, the recognition of the speech rate feature is obtained by calculating the rate of change of the energy envelope of the first call audio in the time domain, and the smooth feature is recognized by The voice jitter frequency of the first caller is detected and evaluated. The identification of the pause feature is obtained by starting a timer for timing when the voice of the first caller and the second caller stops.
步骤S1412、从所述对话文本数据提取文本特征。Step S1412, extract text features from the dialogue text data.
具体的,可以复用步骤S132抽取的对话文本数据的文本特征。Specifically, the text features of the dialog text data extracted in step S132 can be reused.
步骤S1413、基于预先构建的情绪识别模型,对所述文本特征以及所述音量特征、语速 特征、顺畅特征、停顿特征中的至少一种进行处理,以得到所述第一通话终端对应的第一通话人的情绪数据。Step S1413: Based on the pre-built emotion recognition model, process the text feature and at least one of the volume feature, the speech rate feature, the smooth feature, and the pause feature to obtain the first call terminal corresponding to the first call terminal. The emotional data of the caller.
具体的,对所述文本特征以及所述音量特征、语速特征、顺畅特征、停顿特征进行融合处理,如拼接处理后作为情绪识别模型的输入,由情绪识别模型识别出第一通话人的情绪数据,进一步提高了模型识别的准确度。Specifically, the fusion process is performed on the text feature and the volume feature, speech rate feature, smooth feature, and pause feature, such as splicing as input to the emotion recognition model, and the emotion recognition model recognizes the emotion of the first caller Data, further improve the accuracy of model recognition.
所述情绪训练样本集包括若干情绪训练样本。所述情绪训练样本包括历史音频数据、对应对话文本数据和对应的情绪类型数据。根据历史音频数据可以提取出音量特征、语速特征、顺畅特征、停顿特征等,根据对话文本数据可以获取文本特征;所述情绪类型数据为所述历史音频数据的标注数据,在进行模型训练时,将所述历史音频数据对应的音量特征、语速特征、顺畅特征、停顿特征等,以及文本特征作为输入数据,将所述情绪类型数据作为输出数据,通过选定的机器学习模型,从包括若干情绪训练样本的情绪训练样本集中学习以获得所述情绪识别模型。The emotion training sample set includes several emotion training samples. The emotion training sample includes historical audio data, corresponding dialogue text data and corresponding emotion type data. According to historical audio data, volume characteristics, speech rate characteristics, smooth characteristics, pause characteristics, etc. can be extracted, and text characteristics can be obtained from dialogue text data; the emotion type data is the annotation data of the historical audio data, which is used during model training , Using the volume feature, speaking rate feature, smooth feature, pause feature, etc., and text feature corresponding to the historical audio data as input data, using the emotion type data as output data, and using the selected machine learning model from including Emotion training samples of several emotion training samples are collectively learned to obtain the emotion recognition model.
步骤S150、根据所述通话场景的类型数据和所述第一通话人的情绪数据生成并向所述第一通话终端发送用于提示所述第一通话人调整情绪的第一提示信息。Step S150: Generate and send first prompt information for prompting the first caller to adjust emotions to the first call terminal according to the type data of the call scene and the emotion data of the first caller.
示例性的,通话场景的类型为父子间通话,第一通话人的情绪数据为“很激动”,则生成并向所述第一通话终端发送的第一提示信息包括“情绪过于激动”等。Exemplarily, if the type of the call scene is a call between a father and a child, and the emotional data of the first caller is "excited", then the first prompt information generated and sent to the first call terminal includes "excited too much" and the like.
示例性的,第一提示信息可以通过显示或者发声的方式提供给使用第一通话终端的第一通话人。Exemplarily, the first prompt information may be provided to the first caller using the first call terminal in a manner of display or sound.
在一些实施例中,如图10所示,步骤S150根据所述通话场景的类型数据和所述第一通话人的情绪数据生成并向所述第一通话终端发送用于提示所述第一通话人调整情绪的第一提示信息,包括步骤S151:In some embodiments, as shown in FIG. 10, step S150 generates and sends to the first call terminal according to the type data of the call scene and the emotional data of the first caller for prompting the first call The first prompt message for adjusting the emotion of the person includes step S151:
步骤S151、基于内置提示规则的提示规则引擎,对所述通话场景的类型数据和所述第一通话人的情绪数据进行分析以获取对应的第一提示信息,并将所述第一提示信息向所述第一通话终端发送以提示所述第一通话人调整情绪。Step S151: Based on the prompt rule engine with built-in prompt rules, analyze the type data of the call scene and the emotional data of the first caller to obtain corresponding first prompt information, and send the first prompt information to Sent by the first call terminal to prompt the first caller to adjust emotions.
示例性的,提示规则引擎是内置提示规则的规则引擎,如drools规则引擎。例如提示规则引擎中包括提示规则:如果通话场景的类型为父子,第一通话人的情绪数据为“很激动”,则生成包括“情绪过于激动”等的第一提示信息。Exemplarily, the prompt rule engine is a rule engine with built-in prompt rules, such as a drools rule engine. For example, the prompt rule engine includes prompt rules: if the type of the call scene is father and son, and the emotional data of the first caller is "excited", then first prompt information including "excited emotion" is generated.
在另一些实施例中,如图11所示,步骤S150根据所述通话场景的类型数据和所述第一通话人的情绪数据生成并向所述第一通话终端发送用于提示所述第一通话人调整情绪的第一提示信息,包括步骤S152:In other embodiments, as shown in FIG. 11, step S150 generates and sends to the first call terminal according to the type data of the call scene and the emotional data of the first caller for prompting the first call The first prompt message for adjusting the emotion of the caller includes step S152:
步骤S152、基于预先训练的第一提示模型,根据所述通话场景的类型数据、所述第一通话人的情绪数据以及所述对话文本数据生成并向所述第一通话终端发送用于提示所述第一通话人调整情绪的第一提示信息。Step S152: Based on the pre-trained first prompt model, generate and send to the first call terminal according to the type data of the call scene, the emotional data of the first caller, and the dialog text data for prompting The first prompt message for adjusting the emotion of the first caller.
在一些实施方式中,第一提示模型可采用如下方式构建:通过机器学习算法,从第一提示训练样本集中学习获得第一提示模型。In some embodiments, the first prompt model may be constructed in the following manner: a machine learning algorithm is used to obtain the first prompt model from the first prompt training sample set.
所述第一提示训练样本集包括若干第一提示训练样本。各第一提示训练样本包括历史通话场景的类型数据、第一通话人对应的历史情绪数据、历史对话文本数据对应的文本特征,以及训练样本对应的提示信息。所述提示信息为所述训练样本的标注数据;在进行模型训练时,将所述历史通话场景的类型数据、第一通话人对应的历史情绪数据、历史对话文本数据对应的文本特征作为输入数据,将所述提示信息作为输出数据,通过选定的机器学习模型,从包括第一提示训练样本的第一提示训练样本集中学习以获得所述第一提示模型。The first prompt training sample set includes a plurality of first prompt training samples. Each first prompt training sample includes type data of the historical call scene, historical emotion data corresponding to the first caller, text features corresponding to the historical dialogue text data, and prompt information corresponding to the training sample. The prompt information is the annotation data of the training sample; during model training, the type data of the historical call scene, the historical emotion data corresponding to the first caller, and the text features corresponding to the historical dialogue text data are used as input data , Using the prompt information as output data, through the selected machine learning model, learn from the first prompt training sample set including the first prompt training sample to obtain the first prompt model.
从而第一提示模型可以根据历史对话文本数据学习通话中的话术规则,在生成和提示信息时可以提供包括话术信息的提示。Therefore, the first prompt model can learn the verbal rules in the call based on the historical dialogue text data, and can provide prompts including the verbal information when generating and prompting information.
示例性的,如通话场景的类型为父子间通话,第一通话人的情绪数据为“很激动”,则生成并向所述第一通话终端发送第一提示信息包括“情绪过于激动,尝试谈谈天气”等。Exemplarily, if the type of the call scene is a call between a father and a son, and the emotional data of the first caller is "excited", the first prompt message is generated and sent to the first call terminal including "excited, try to talk Talk about the weather" etc.
步骤S160、根据所述通话场景的类型数据和所述第二通话人的情绪数据生成并向所述第一通话终端发送用于提示所述第一通话人调整对话策略以应对所述第二通话人情绪的第二提示信息。Step S160: Generate and send to the first call terminal according to the type data of the call scene and the emotional data of the second caller for prompting the first caller to adjust the dialogue strategy to deal with the second call The second reminder of human emotions.
示例性的,通话场景的类型为母子间通话,第二通话人的情绪数据为“疲惫”,则生成并向所述第一通话终端发送第二提示信息包括“你的妈妈最近比较疲惫”;或者通话场景的类型为情侣间通话,第二通话人的情绪数据为“撒娇”,则生成并向所述第一通话终端发送第二提示信息包括“你的女朋友在撒娇”;或者通话场景的类型为朋友间通话,第二通话人的情绪数据为“气愤”,则生成并向所述第一通话终端发送第二提示信息包括“你的朋友生气了”等。Exemplarily, if the type of the call scene is a call between mother and child, and the emotional data of the second caller is "exhausted", the second prompt message including "your mother has been exhausted recently" is generated and sent to the first call terminal; Or the type of the call scene is a conversation between lovers, and the emotional data of the second caller is "acting like a baby", then a second prompt message including "your girlfriend is acting like a baby" is generated and sent to the first calling terminal; or the scene of the call The type of is a call between friends, and the emotional data of the second caller is "angry", then a second prompt message including "your friend is angry" is generated and sent to the first call terminal.
示例性的,第二提示信息可以通过显示或者发声的方式提供给使用第一通话终端的第一通话人。Exemplarily, the second prompt information may be provided to the first caller using the first call terminal in a manner of display or sound.
在一些实施例中,如图10所示,步骤S160根据所述通话场景的类型数据和所述第二通话人的情绪数据生成并向所述第一通话终端发送用于提示所述第一通话人调整对话策略以应对所述第二通话人情绪的第二提示信息,包括步骤S161:In some embodiments, as shown in FIG. 10, step S160 generates and sends to the first call terminal according to the type data of the call scene and the emotional data of the second caller for prompting the first call The person adjusts the dialogue strategy to deal with the second prompt message of the emotion of the second caller, including step S161:
步骤S161、基于内置提示规则的提示规则引擎,对所述通话场景的类型数据和所述第二通话人的情绪数据进行分析以获取对应的第二提示信息,并将所述第二提示信息向所述第一通话终端发送以提示所述第一通话人调整对话策略以应对所述第二通话人的情绪。Step S161: Based on the prompt rule engine with built-in prompt rules, analyze the type data of the call scene and the emotional data of the second caller to obtain corresponding second prompt information, and send the second prompt information to The first call terminal sends to prompt the first caller to adjust the dialogue strategy to deal with the emotion of the second caller.
示例性的,提示规则引擎是内置提示规则的规则引擎,如drools规则引擎。例如提示规则引擎中包括提示规则:如果通话场景的类型为情侣,第二通话人的情绪数据为“撒娇”,则生成包括“你的女朋友在撒娇”等的第二提示信息。Exemplarily, the prompt rule engine is a rule engine with built-in prompt rules, such as a drools rule engine. For example, the reminder rule engine includes a reminder rule: if the type of the call scene is a couple and the emotional data of the second caller is "acting like a baby", then a second reminder message including "your girlfriend is acting like a baby" is generated.
在另一些实施例中,如图11所示,步骤S160根据所述通话场景的类型数据和所述第二通话人的情绪数据生成并向所述第一通话终端发送用于提示所述第一通话人调整对话策略以应对所述第二通话人情绪的第二提示信息,包括步骤S162:In other embodiments, as shown in FIG. 11, step S160 generates and sends to the first call terminal a reminder of the first call based on the type data of the call scene and the emotional data of the second caller. The caller adjusts the dialogue strategy to deal with the second prompt message of the second caller’s emotion, including step S162:
步骤S162、基于预先训练的第二提示模型,根据所述通话场景的类型数据、所述第二通话人的情绪数据以及所述对话文本数据生成并向所述第一通话终端发送用于提示所述第一通话人调整对话策略以应对所述第二通话人情绪的第二提示信息。Step S162: Based on the pre-trained second prompt model, generate and send to the first call terminal according to the type data of the call scene, the emotional data of the second caller, and the dialog text data for prompting The first caller adjusts the dialogue strategy to deal with the second prompt message of the emotion of the second caller.
在一些实施方式中,第二提示模型可采用如下方式构建:通过机器学习算法,从第二提示训练样本集中学习获得第二提示模型。In some embodiments, the second prompt model may be constructed in the following manner: the second prompt model is obtained by learning from the second prompt training sample set through a machine learning algorithm.
所述第二提示训练样本集包括若干第二提示训练样本。各第二提示训练样本包括历史通话场景的类型数据、第二通话人对应的历史情绪数据、历史对话文本数据对应的文本特征,以及训练样本对应的提示信息。所述提示信息为所述训练样本的标注数据;在进行模型训练时,将所述历史通话场景的类型数据、第二通话人对应的历史情绪数据、历史对话文本数据对应的文本特征作为输入数据,将所述提示信息作为输出数据,通过选定的机器学习模型,从包括第二提示训练样本的第二提示训练样本集中学习以获得所述第二提示模型。The second prompt training sample set includes a plurality of second prompt training samples. Each second prompt training sample includes type data of the historical call scene, historical emotion data corresponding to the second caller, text features corresponding to the historical dialogue text data, and prompt information corresponding to the training sample. The prompt information is the annotation data of the training sample; during model training, the type data of the historical call scene, the historical emotion data corresponding to the second caller, and the text features corresponding to the historical dialogue text data are used as input data , Using the prompt information as output data, and through a selected machine learning model, learn from a second prompt training sample set including a second prompt training sample to obtain the second prompt model.
从而第二提示模型可以根据历史对话文本数据学习通话中的话术规则,在生成和提示信息时可以提供包括话术信息的提示。Therefore, the second prompt model can learn the verbal rules in the call based on the historical dialogue text data, and can provide prompts including verbal information when generating and prompting information.
示例性的,通话场景的类型为母子间通话,第二通话人的情绪数据为“疲惫”,则生成并向所述第一通话终端发送的第二提示信息包括“你的妈妈最近比较疲惫,慰问一下妈妈的生活”;或者通话场景的类型为情侣间通话,第二通话人的情绪数据为“撒娇”,则生成并向所述第一通话终端发送的第二提示信息包括“你的女朋友在撒娇,温柔地叫她小宝贝儿”;或者通话场景的类型为朋友间通话,第二通话人的情绪数据为“气愤”,则生成并向所述第一通话终端发送的第二提示信息包括“你的朋友生气了,尝试谈谈天气”等。Exemplarily, if the type of the call scene is a call between mother and child, and the emotional data of the second caller is "exhausted", the second prompt message generated and sent to the first call terminal includes "Your mother has been tired recently, Condolences to mom’s life"; or the type of the call scene is a conversation between couples, and the emotional data of the second caller is "coquettish", then the second prompt message generated and sent to the first call terminal includes "your girl A friend is acting like a baby and tenderly calling her baby"; or the type of the call scene is a call between friends, and the emotional data of the second caller is "anger", then a second reminder is generated and sent to the first call terminal Messages include "Your friend is angry, try to talk about the weather" etc.
可以理解的,本申请的说明书以及附图中的术语“第一”和“第二”等是用于区别不同的对象,或者用于区别对同一对象的不同处理,而不是用于描述对象的特定顺序,不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征的数量。It is understandable that the terms "first" and "second" in the description of the application and the drawings are used to distinguish different objects, or to distinguish different treatments of the same object, rather than describing objects. The specific order cannot be understood as indicating or implying relative importance or implicitly indicating the number of technical features indicated.
示例性的,还可以根据所述通话场景的类型数据和所述第二通话人的情绪数据生成并向 所述第二通话终端发送用于提示所述第二通话人调整情绪的相应提示信息;还可以根据所述通话场景的类型数据和所述第一通话人的情绪数据生成并向所述第二通话终端发送用于提示所述第二通话人调整对话策略以应对所述第一通话人情绪的相应提示信息。Exemplarily, it is also possible to generate and send corresponding prompt information for prompting the second caller to adjust emotions to the second call terminal according to the type data of the call scene and the emotional data of the second caller; It is also possible to generate and send to the second call terminal according to the type data of the call scene and the emotional data of the first caller to prompt the second caller to adjust the dialogue strategy to deal with the first caller Corresponding reminder information for emotions.
在一些实施方式中,步骤S152中的第一提示模型和步骤S162中的第二提示模型可以综合为一个提示模型。具体可以通过在提示训练样本中置入用于表示提示对象标识;从而例如运行于服务器的提示模型可以生成相应的提示信息并预测出该提示信息对应的提示对象,并将该提示信息发送给该提示对象,如发送给第一通话终端或第二通话终端。In some embodiments, the first prompt model in step S152 and the second prompt model in step S162 can be integrated into one prompt model. Specifically, it can be used to indicate the reminder object identifier in the reminder training sample; for example, the reminder model running on the server can generate the corresponding reminder information and predict the reminder object corresponding to the reminder information, and send the reminder information to the The reminder object, such as sent to the first call terminal or the second call terminal.
在一些实施例中,步骤S150中向所述第一通话终端发送用于提示所述第一通话人调整情绪的第一提示信息时,暂停将所述第一通话终端对应的第一通话音频向所述第二通话终端发送以对所述第二通话人屏蔽所述第一提示信息。In some embodiments, when the first prompt information for prompting the first caller to adjust emotions is sent to the first call terminal in step S150, the first call audio corresponding to the first call terminal is suspended to Sent by the second call terminal to shield the first prompt information from the second caller.
在一些实施例中,步骤S160中向所述第一通话终端发送用于提示所述第一通话人调整对话策略以应对所述第二通话人情绪的第二提示信息时,暂停将所述第一通话终端对应的第一通话音频向所述第二通话终端发送以对所述第二通话人屏蔽所述第二提示信息。In some embodiments, when the second prompt message for prompting the first caller to adjust the dialogue strategy in response to the emotion of the second caller is sent to the first call terminal in step S160, the second caller is suspended. The first call audio corresponding to a call terminal is sent to the second call terminal to shield the second prompt information from the second caller.
具体的,服务器在向第一通话终端发送相应的提示信息时,第一通话终端通过声音提示的方式提示第一通话人;此时服务器可以暂停采集第一通话终端麦克风获取的音频,即第一通话音频,例如控制第一通话终端的通话模式置为静音模式;从而停止将包含相应声音提示的第一通话音频向第二通话终端发送,因此第一提示信息、第二提示信息不会被第二通话人收听到。Specifically, when the server sends corresponding prompt information to the first call terminal, the first call terminal prompts the first caller by means of voice prompts; at this time, the server can pause the collection of the audio obtained by the microphone of the first call terminal, that is, the first call terminal Call audio, for example, control the call mode of the first call terminal to mute mode; thus stop sending the first call audio containing the corresponding sound prompt to the second call terminal, so the first prompt information and the second prompt information will not be Second, the caller heard it.
上述实施例提供的基于语音识别的通信服务方法,通过在第一通话终端与第二通话终端之间通话时获取相应的音频,然后通过语音识别得到对话文本并根据对话文本识别通话场景,以及根据获取的音频识别通话人的情绪;之后根据通话场景和通话人的情绪对通话人作出相应的提示,从而实现及时准确的在通话人进行通话时注入干预,以引导通话人更好的实现通话。In the communication service method based on voice recognition provided by the foregoing embodiment, the corresponding audio is obtained during a call between the first call terminal and the second call terminal, and then the dialogue text is obtained through voice recognition and the call scene is recognized according to the dialogue text, and according to The acquired audio recognizes the emotion of the caller; then, corresponding prompts are made to the caller according to the call scene and the caller's emotions, so as to realize timely and accurate injection of intervention during the call to guide the caller to better implement the call.
请参阅图12,图12是本申请一实施例提供的一种基于语音识别的通信服务装置的结构示意图,该基于语音识别的通信服务装置可以配置于服务器中,用于执行前述的基于语音识别的通信服务方法。Please refer to FIG. 12, which is a schematic structural diagram of a voice recognition-based communication service device provided by an embodiment of the present application. The voice recognition-based communication service device may be configured in a server for performing the aforementioned voice recognition Communication service method.
如图12所示,该基于语音识别的通信服务装置,包括:音频获取模块110、语音识别模块120、场景识别模块130、情绪识别模块140、第一提示模块150、第二提示模块160。As shown in FIG. 12, the communication service device based on voice recognition includes: an audio acquisition module 110, a voice recognition module 120, a scene recognition module 130, an emotion recognition module 140, a first prompt module 150, and a second prompt module 160.
音频获取模块110,用于若第一通话终端与第二通话终端之间的通话接通,获取所述第一通话终端对应的第一通话音频和所述第二通话终端对应的第二通话音频。The audio obtaining module 110 is configured to obtain the first call audio corresponding to the first call terminal and the second call audio corresponding to the second call terminal if the call between the first call terminal and the second call terminal is connected .
语音识别模块120,用于对所述第一通话音频和所述第二通话音频进行语音识别以得到对话文本数据。The voice recognition module 120 is configured to perform voice recognition on the first call audio and the second call audio to obtain dialogue text data.
具体的,如图13所示,语音识别模块120包括:Specifically, as shown in FIG. 13, the speech recognition module 120 includes:
第一语音子模块121,用于对所述第一通话音频进行语音识别以得到第一通话人对应的第一文本;The first voice sub-module 121 is configured to perform voice recognition on the first call audio to obtain the first text corresponding to the first caller;
第二语音子模块122,用于对所述第二通话音频进行语音识别以得到第二通话人对应的第二文本;The second voice submodule 122 is configured to perform voice recognition on the second call audio to obtain the second text corresponding to the second caller;
文本排序子模块123,用于根据预设排序规则对所述第一文本、第二文本排序,以得到对话文本数据。The text sorting sub-module 123 is used to sort the first text and the second text according to a preset sorting rule to obtain dialogue text data.
场景识别模块130,用于基于预先构建的场景识别模型对所述对话文本数据进行识别,以获取通话场景的类型数据。The scene recognition module 130 is configured to recognize the dialogue text data based on a pre-built scene recognition model to obtain type data of the call scene.
在一些实施方式中,如图13所示,场景识别模块130包括:In some embodiments, as shown in FIG. 13, the scene recognition module 130 includes:
场景规则子模块131,用于基于内置场景判断规则的场景规则引擎,对所述对话文本数据进行分析以获取通话场景的类型数据The scene rule sub-module 131 is used to analyze the conversation text data to obtain the type data of the call scene based on the scene rule engine of the built-in scene judgment rule
在另一些实施方式中,如图13所示,场景识别模块130包括:In other embodiments, as shown in FIG. 13, the scene recognition module 130 includes:
特征抽取子模块132,用于抽取所述对话文本数据中的文本特征;The feature extraction sub-module 132 is used to extract text features in the dialogue text data;
场景识别子模块133,用于基于训练好的机器学习模型,根据所述对话文本数据中的文本特征识别出通话场景的类型数据。The scene recognition sub-module 133 is used to identify the type data of the call scene according to the text features in the conversation text data based on the trained machine learning model.
情绪识别模块140,用于基于预先构建的情绪识别模型对所述第一通话音频、第二通话音频进行识别,以获取所述第一通话终端对应的第一通话人的情绪数据、所述第二通话终端对应的第二通话人的情绪数据。The emotion recognition module 140 is configured to recognize the first call audio and the second call audio based on a pre-built emotion recognition model to obtain the emotion data of the first caller corresponding to the first call terminal and the second call Emotional data of the second caller corresponding to the second call terminal.
具体的,如图13所示,情绪识别模块140包括:Specifically, as shown in FIG. 13, the emotion recognition module 140 includes:
第一情绪识别子模块141,用于基于预先构建的情绪识别模型对所述第一通话音频和对话文本数据进行识别,以获取所述第一通话终端对应的第一通话人的情绪数据。The first emotion recognition sub-module 141 is configured to recognize the first call audio and dialog text data based on a pre-built emotion recognition model to obtain the emotion data of the first caller corresponding to the first call terminal.
示例性的,第一情绪识别子模块141包括:Exemplarily, the first emotion recognition sub-module 141 includes:
音频特征提取子模块,用于从所述第一通话音频提取音量特征、语速特征、顺畅特征、停顿特征中的至少一种;An audio feature extraction sub-module for extracting at least one of a volume feature, a speech rate feature, a smooth feature, and a pause feature from the first call audio;
文本特征提取子模块,用于从所述对话文本数据提取文本特征;A text feature extraction sub-module for extracting text features from the dialogue text data;
情绪数据获取子模块,用于基于预先构建的情绪识别模型,对所述文本特征以及所述音量特征、语速特征、顺畅特征、停顿特征中的至少一种进行处理,以得到所述第一通话终端对应的第一通话人的情绪数据。The emotion data acquisition sub-module is used to process the text feature and at least one of the volume feature, speech rate feature, smooth feature, and pause feature based on the pre-built emotion recognition model to obtain the first Emotional data of the first caller corresponding to the call terminal.
第二情绪识别子模块142,用于基于预先构建的情绪识别模型对所述第二通话音频和对话文本数据进行识别,以获取所述第二通话终端对应的第二通话人的情绪数据。The second emotion recognition sub-module 142 is configured to recognize the second call audio and conversation text data based on a pre-built emotion recognition model to obtain the second caller's emotion data corresponding to the second call terminal.
第一提示模块150,用于根据所述通话场景的类型数据和所述第一通话人的情绪数据生成并向所述第一通话终端发送用于提示所述第一通话人调整情绪的第一提示信息。The first prompt module 150 is configured to generate and send to the first call terminal a first prompt for prompting the first caller to adjust emotions according to the type data of the call scene and the emotional data of the first caller. Prompt information.
在一些实施方式中,如图13所示,第一提示模块150包括:In some embodiments, as shown in FIG. 13, the first prompting module 150 includes:
第一提示规则子模块151,用于基于内置提示规则的提示规则引擎,对所述通话场景的类型数据和所述第一通话人的情绪数据进行分析以获取对应的第一提示信息,并将所述第一提示信息向所述第一通话终端发送以提示所述第一通话人调整情绪。The first prompt rule sub-module 151 is used to analyze the type data of the call scene and the emotional data of the first caller to obtain the corresponding first prompt information based on the prompt rule engine with built-in prompt rules, and The first prompt information is sent to the first call terminal to prompt the first caller to adjust emotions.
在另一些实施方式中,如图13所示,第一提示模块150包括:In other embodiments, as shown in FIG. 13, the first prompting module 150 includes:
第一提示生成子模块152,用于基于预先训练的第一提示模型,根据所述通话场景的类型数据、所述第一通话人的情绪数据以及所述对话文本数据生成并向所述第一通话终端发送用于提示所述第一通话人调整情绪的第一提示信息。The first prompt generation sub-module 152 is configured to generate and report to the first prompt model according to the type data of the call scene, the emotional data of the first caller, and the dialog text data based on the pre-trained first prompt model. The call terminal sends first prompt information for prompting the first caller to adjust emotions.
第二提示模块160,用于根据所述通话场景的类型数据和所述第二通话人的情绪数据生成并向所述第一通话终端发送用于提示所述第一通话人调整对话策略以应对所述第二通话人情绪的第二提示信息。The second prompting module 160 is configured to generate and send to the first call terminal according to the type data of the call scene and the emotional data of the second caller for prompting the first caller to adjust the dialogue strategy to deal with The second prompt information of the emotion of the second caller.
在一些实施方式中,如图13所示,第二提示模块160包括:In some embodiments, as shown in FIG. 13, the second prompting module 160 includes:
第二提示规则子模块161,用于基于内置提示规则的提示规则引擎,对所述通话场景的类型数据和所述第二通话人的情绪数据进行分析以获取对应的第二提示信息,并将所述第二提示信息向所述第一通话终端发送以提示所述第一通话人调整对话策略以应对所述第二通话人的情绪The second prompt rule sub-module 161 is configured to analyze the type data of the call scene and the emotional data of the second caller to obtain the corresponding second prompt information based on the prompt rule engine with built-in prompt rules, and The second prompt information is sent to the first call terminal to prompt the first caller to adjust the dialogue strategy to deal with the emotions of the second caller
在另一些实施方式中,如图13所示,第二提示模块160包括:In other embodiments, as shown in FIG. 13, the second prompting module 160 includes:
第二提示生成子模块162,用于基于预先训练的第二提示模型,根据所述通话场景的类型数据、所述第二通话人的情绪数据以及所述对话文本数据生成并向所述第一通话终端发送用于提示所述第一通话人调整对话策略以应对所述第二通话人情绪的第二提示信息。The second prompt generation sub-module 162 is configured to generate and report to the first prompt model based on the pre-trained second prompt model according to the type data of the call scene, the emotional data of the second caller, and the dialog text data. The call terminal sends second prompt information for prompting the first caller to adjust the dialogue strategy to deal with the emotion of the second caller.
需要说明的是,所属领域的技术人员可以清楚地了解到,为了描述的方便和简洁,上述描述的装置和各模块、单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。It should be noted that those skilled in the art can clearly understand that for the convenience and conciseness of description, the specific working process of the above-described device and each module and unit can refer to the corresponding process in the foregoing method embodiment. No longer.
本申请的方法、装置可用于众多通用或专用的计算系统环境或配置中。例如:个人计算机、服务器计算机、手持设备或便携式设备、平板型设备、多处理器系统、基于微处理器的 系统、机顶盒、可编程的消费电子设备、网络PC、小型计算机、大型计算机、包括以上任何系统或设备的分布式计算环境等等。The method and device of this application can be used in many general or special computing system environments or configurations. For example: personal computers, server computers, handheld or portable devices, tablet devices, multi-processor systems, microprocessor-based systems, set-top boxes, programmable consumer electronic devices, network PCs, small computers, large computers, including the above Distributed computing environment of any system or device, etc.
示例性的,上述的方法、装置可以实现为一种计算机程序的形式,该计算机程序可以在如图14所示的计算机设备上运行。Exemplarily, the above-mentioned method and apparatus may be implemented in the form of a computer program, and the computer program may run on the computer device as shown in FIG. 14.
请参阅图14,图14是本申请实施例提供的一种计算机设备的结构示意图。该计算机设备可以是服务器或终端。Please refer to FIG. 14. FIG. 14 is a schematic structural diagram of a computer device provided by an embodiment of the present application. The computer equipment can be a server or a terminal.
参阅图14,该计算机设备包括通过系统总线连接的处理器、存储器和网络接口,其中,存储器可以包括非易失性存储介质和内存储器。Referring to FIG. 14, the computer device includes a processor, a memory, and a network interface connected through a system bus, where the memory may include a non-volatile storage medium and an internal memory.
非易失性存储介质可存储操作系统和计算机程序。该计算机程序包括程序指令,该程序指令被执行时,可使得处理器执行任意一种基于语音识别的通信服务方法。The non-volatile storage medium can store an operating system and a computer program. The computer program includes program instructions, and when the program instructions are executed, the processor can execute any communication service method based on voice recognition.
处理器用于提供计算和控制能力,支撑整个计算机设备的运行。The processor is used to provide computing and control capabilities and support the operation of the entire computer equipment.
内存储器为非易失性存储介质中的计算机程序的运行提供环境,该计算机程序被处理器执行时,可使得处理器执行任意一种基于语音识别的通信服务方法。The internal memory provides an environment for the operation of the computer program in the non-volatile storage medium. When the computer program is executed by the processor, the processor can execute any communication service method based on voice recognition.
该网络接口用于进行网络通信,如发送分配的任务等。本领域技术人员可以理解,该计算机设备的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备的限定,具体的计算机设备可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。The network interface is used for network communication, such as sending assigned tasks. Those skilled in the art can understand that the structure of the computer device is only a block diagram of a part of the structure related to the solution of the application, and does not constitute a limitation on the computer device to which the solution of the application is applied. The specific computer device may include More or fewer components are shown in the figure, or some components are combined, or have different component arrangements.
应当理解的是,处理器可以是中央处理单元(Central Processing Unit,CPU),该处理器还可以是其他通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现场可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。其中,通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。It should be understood that the processor may be a central processing unit (Central Processing Unit, CPU), the processor may also be other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), and application specific integrated circuits (Application Specific Integrated Circuits). Circuit, ASIC), Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc. Among them, the general-purpose processor may be a microprocessor or the processor may also be any conventional processor.
其中,在一个实施例中,所述处理器用于运行存储在存储器中的计算机程序,以实现如下步骤:Wherein, in an embodiment, the processor is used to run a computer program stored in a memory to implement the following steps:
若第一通话终端与第二通话终端之间的通话接通,获取所述第一通话终端对应的第一通话音频和所述第二通话终端对应的第二通话音频;If the call between the first call terminal and the second call terminal is connected, acquiring the first call audio corresponding to the first call terminal and the second call audio corresponding to the second call terminal;
对所述第一通话音频和所述第二通话音频进行语音识别以得到对话文本数据;Performing voice recognition on the first call audio and the second call audio to obtain dialogue text data;
基于预先构建的场景识别模型对所述对话文本数据进行识别,以获取通话场景的类型数据;Recognizing the dialogue text data based on a pre-built scene recognition model to obtain type data of the call scene;
基于预先构建的情绪识别模型对所述第一通话音频、第二通话音频进行识别,以获取所述第一通话终端对应的第一通话人的情绪数据、所述第二通话终端对应的第二通话人的情绪数据;Recognize the first call audio and the second call audio based on the pre-built emotion recognition model to obtain the emotion data of the first caller corresponding to the first call terminal and the second call corresponding to the second call terminal Emotional data of the caller;
根据所述通话场景的类型数据和所述第一通话人的情绪数据生成并向所述第一通话终端发送用于提示所述第一通话人调整情绪的第一提示信息;Generating and sending first prompt information for prompting the first caller to adjust emotions to the first call terminal according to the type data of the call scene and the emotion data of the first caller;
根据所述通话场景的类型数据和所述第二通话人的情绪数据生成并向所述第一通话终端发送用于提示所述第一通话人调整对话策略以应对所述第二通话人情绪的第二提示信息。According to the type data of the call scene and the emotion data of the second caller, it is generated and sent to the first call terminal for prompting the first caller to adjust the dialogue strategy to deal with the second caller's emotions. The second prompt message.
具体的,所述处理器实现对所述第一通话音频和所述第二通话音频进行语音识别以得到对话文本数据时,具体实现:对所述第一通话音频进行语音识别以得到第一通话人对应的第一文本;对所述第二通话音频进行语音识别以得到第二通话人对应的第二文本;根据预设排序规则对所述第一文本、第二文本排序,以得到对话文本数据。Specifically, when the processor implements voice recognition on the first call audio and the second call audio to obtain dialog text data, it is specifically implemented: perform voice recognition on the first call audio to obtain the first call The first text corresponding to the person; perform voice recognition on the second call audio to obtain the second text corresponding to the second caller; sort the first text and the second text according to a preset sorting rule to obtain the dialogue text data.
具体的,所述处理器实现基于预先构建的场景识别模型对所述对话文本数据进行识别,以获取通话场景的类型数据时,具体实现:基于内置场景判断规则的场景规则引擎,对所述对话文本数据进行分析以获取通话场景的类型数据。Specifically, when the processor realizes the recognition of the dialogue text data based on the pre-built scene recognition model to obtain the type data of the call scene, it is specifically realized: the scene rule engine based on the built-in scene judgment rule is used for the dialogue The text data is analyzed to obtain the type data of the call scene.
或者,所述处理器实现基于预先构建的场景识别模型对所述对话文本数据进行识别,以获取通话场景的类型数据时时,具体实现:抽取所述对话文本数据中的文本特征;基于训练 好的机器学习模型,根据所述对话文本数据中的文本特征识别出通话场景的类型数据。Alternatively, when the processor realizes the recognition of the dialogue text data based on a pre-built scene recognition model to obtain the type data of the call scene, it is specifically implemented: extracting text features in the dialogue text data; based on the trained The machine learning model recognizes the type data of the call scene according to the text features in the dialog text data.
具体的,所述处理器实现基于预先构建的情绪识别模型对所述第一通话音频、第二通话音频进行识别,以获取所述第一通话终端对应的第一通话人的情绪数据、所述第二通话终端对应的第二通话人的情绪数据时,具体实现:基于预先构建的情绪识别模型对所述第一通话音频和对话文本数据进行识别,以获取所述第一通话终端对应的第一通话人的情绪数据;基于预先构建的情绪识别模型对所述第二通话音频和对话文本数据进行识别,以获取所述第二通话终端对应的第二通话人的情绪数据。Specifically, the processor realizes the recognition of the first call audio and the second call audio based on a pre-built emotion recognition model to obtain the emotion data of the first caller corresponding to the first call terminal and the When the emotion data of the second caller corresponding to the second call terminal is implemented, it is specifically implemented: based on the pre-built emotion recognition model, the first call audio and dialogue text data are recognized to obtain the second call corresponding to the first call terminal. Emotional data of a caller; the second call audio and dialog text data are recognized based on a pre-built emotion recognition model to obtain the second caller's emotional data corresponding to the second call terminal.
具体的,所述处理器实现基于预先构建的情绪识别模型对所述第一通话音频和对话文本数据进行识别,以获取所述第一通话终端对应的第一通话人的情绪数据时,具体实现:从所述第一通话音频提取音量特征、语速特征、顺畅特征、停顿特征中的至少一种;从所述对话文本数据提取文本特征;基于预先构建的情绪识别模型,对所述文本特征以及所述音量特征、语速特征、顺畅特征、停顿特征中的至少一种进行处理,以得到所述第一通话终端对应的第一通话人的情绪数据。Specifically, when the processor realizes the recognition of the first call audio and dialogue text data based on a pre-built emotion recognition model to obtain the emotion data of the first caller corresponding to the first call terminal, the specific implementation is : Extract at least one of the volume feature, the speaking rate feature, the smooth feature, and the pause feature from the first call audio; extract the text feature from the dialogue text data; based on the pre-built emotion recognition model, compare the text feature And at least one of the volume feature, the speaking rate feature, the smooth feature, and the pause feature are processed to obtain the emotion data of the first caller corresponding to the first call terminal.
具体的,所述处理器实现根据所述通话场景的类型数据和所述第一通话人的情绪数据生成并向所述第一通话终端发送用于提示所述第一通话人调整情绪的第一提示信息时,具体实现:基于内置提示规则的提示规则引擎,对所述通话场景的类型数据和所述第一通话人的情绪数据进行分析以获取对应的第一提示信息,并将所述第一提示信息向所述第一通话终端发送以提示所述第一通话人调整情绪;或者具体实现:基于预先训练的第一提示模型,根据所述通话场景的类型数据、所述第一通话人的情绪数据以及所述对话文本数据生成并向所述第一通话终端发送用于提示所述第一通话人调整情绪的第一提示信息。Specifically, the processor generates and sends to the first call terminal a first call for prompting the first caller to adjust emotions according to the type data of the call scene and the emotion data of the first caller. When prompting information, the specific implementation is: based on the prompt rule engine with built-in prompt rules, the type data of the call scene and the emotional data of the first caller are analyzed to obtain the corresponding first prompt information, and the A prompt message is sent to the first call terminal to prompt the first caller to adjust their emotions; or specific implementation: based on a pre-trained first prompt model, according to the type data of the call scene, the first caller The emotion data of and the dialog text data are generated and sent to the first call terminal to prompt the first caller to adjust emotions.
具体的,所述处理器实现向所述第一通话终端发送第一提示信息或向所述第一通话终端发送第二提示信息时时,还实现:暂停将所述第一通话终端对应的第一通话音频向所述第二通话终端发送以对所述第二通话人屏蔽所述第一提示信息或第二提示信息。Specifically, when the processor implements sending first prompt information to the first call terminal or sends second prompt information to the first call terminal, it also implements: pause the first call terminal corresponding to the first call terminal. The call audio is sent to the second call terminal to shield the first prompt information or the second prompt information from the second caller.
通过以上的实施方式的描述可知,本领域的技术人员可以清楚地了解到本申请可借助软件加必需的通用硬件平台的方式来实现。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品可以存储在存储介质中,如ROM/RAM、磁碟、光盘等,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例或者实施例的某些部分所述的方法,如:From the description of the foregoing implementation manners, it can be understood that those skilled in the art can clearly understand that this application can be implemented by means of software plus a necessary general hardware platform. Based on this understanding, the technical solution of this application essentially or the part that contributes to the existing technology can be embodied in the form of a software product, and the computer software product can be stored in a storage medium, such as ROM/RAM, magnetic disk , CD-ROM, etc., including several instructions to make a computer device (which can be a personal computer, a server, or a network device, etc.) execute the methods described in each embodiment of this application or some parts of the embodiment, such as:
一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,所述计算机程序中包括程序指令,所述处理器执行所述程序指令,实现本申请实施例提供的任一项基于语音识别的通信服务方法。A computer-readable storage medium, the computer-readable storage medium stores a computer program, the computer program includes program instructions, and the processor executes the program instructions to implement any item provided in the embodiments of this application based on Voice recognition communication service method.
其中,所述计算机可读存储介质可以是前述实施例所述的计算机设备的内部存储单元,例如所述计算机设备的硬盘或内存。所述计算机可读存储介质也可以是所述计算机设备的外部存储设备,例如所述计算机设备上配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)等。The computer-readable storage medium may be the internal storage unit of the computer device described in the foregoing embodiment, such as the hard disk or memory of the computer device. The computer-readable storage medium may also be an external storage device of the computer device, such as a plug-in hard disk, a smart memory card (SMC), or a secure digital (Secure Digital, SD) equipped on the computer device. ) Card, Flash Card, etc.
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到各种等效的修改或替换,这些修改或替换都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以权利要求的保护范围为准。The above are only specific implementations of this application, but the protection scope of this application is not limited to this. Anyone familiar with the technical field can easily think of various equivalents within the technical scope disclosed in this application. Modifications or replacements, these modifications or replacements shall be covered within the protection scope of this application. Therefore, the protection scope of this application shall be subject to the protection scope of the claims.

Claims (20)

  1. 一种基于语音识别的通信服务方法,其包括:A communication service method based on speech recognition, which includes:
    若第一通话终端与第二通话终端之间的通话接通,获取所述第一通话终端对应的第一通话音频和所述第二通话终端对应的第二通话音频;If the call between the first call terminal and the second call terminal is connected, acquiring the first call audio corresponding to the first call terminal and the second call audio corresponding to the second call terminal;
    对所述第一通话音频和所述第二通话音频进行语音识别以得到对话文本数据;Performing voice recognition on the first call audio and the second call audio to obtain dialogue text data;
    基于预先构建的场景识别模型对所述对话文本数据进行识别,以获取通话场景的类型数据;Recognizing the dialogue text data based on a pre-built scene recognition model to obtain type data of the call scene;
    基于预先构建的情绪识别模型对所述第一通话音频、第二通话音频进行识别,以获取所述第一通话终端对应的第一通话人的情绪数据、所述第二通话终端对应的第二通话人的情绪数据;Recognize the first call audio and the second call audio based on the pre-built emotion recognition model to obtain the emotion data of the first caller corresponding to the first call terminal and the second call corresponding to the second call terminal Emotional data of the caller;
    根据所述通话场景的类型数据和所述第一通话人的情绪数据生成并向所述第一通话终端发送用于提示所述第一通话人调整情绪的第一提示信息;Generating and sending first prompt information for prompting the first caller to adjust emotions to the first call terminal according to the type data of the call scene and the emotion data of the first caller;
    根据所述通话场景的类型数据和所述第二通话人的情绪数据生成并向所述第一通话终端发送用于提示所述第一通话人调整对话策略以应对所述第二通话人情绪的第二提示信息。According to the type data of the call scene and the emotion data of the second caller, it is generated and sent to the first call terminal for prompting the first caller to adjust the dialogue strategy to deal with the second caller's emotions. The second prompt message.
  2. 如权利要求1所述的通信服务方法,其中,所述对所述第一通话音频和所述第二通话音频进行语音识别以得到对话文本数据,包括:The communication service method according to claim 1, wherein said performing voice recognition on said first call audio and said second call audio to obtain dialogue text data comprises:
    对所述第一通话音频进行语音识别以得到第一通话人对应的第一文本;Performing voice recognition on the first call audio to obtain a first text corresponding to the first caller;
    对所述第二通话音频进行语音识别以得到第二通话人对应的第二文本;Performing voice recognition on the second call audio to obtain a second text corresponding to the second caller;
    根据预设排序规则对所述第一文本、第二文本排序,以得到对话文本数据。The first text and the second text are sorted according to a preset sorting rule to obtain dialogue text data.
  3. 如权利要求1所述的通信服务方法,其中,所述基于预先构建的场景识别模型对所述对话文本数据进行识别,以获取通话场景的类型数据,包括:The communication service method according to claim 1, wherein the recognizing the dialogue text data based on the pre-built scene recognition model to obtain the type data of the call scene comprises:
    基于内置场景判断规则的场景规则引擎,对所述对话文本数据进行分析以获取通话场景的类型数据;或者Based on a scene rule engine with built-in scene judgment rules, analyze the conversation text data to obtain type data of the call scene; or
    所述基于预先构建的场景识别模型对所述对话文本数据进行识别,以获取通话场景的类型数据,包括:The recognizing the dialogue text data based on the pre-built scene recognition model to obtain the type data of the call scene includes:
    抽取所述对话文本数据中的文本特征;Extracting text features in the dialogue text data;
    基于训练好的机器学习模型,根据所述对话文本数据中的文本特征识别出通话场景的类型数据。Based on the trained machine learning model, the type data of the call scene is identified according to the text features in the dialogue text data.
  4. 如权利要求1所述的通信服务方法,其中,所述基于预先构建的情绪识别模型对所述第一通话音频、第二通话音频进行识别,以获取所述第一通话终端对应的第一通话人的情绪数据、所述第二通话终端对应的第二通话人的情绪数据,包括:The communication service method of claim 1, wherein the first call audio and the second call audio are recognized based on the pre-built emotion recognition model to obtain the first call corresponding to the first call terminal The emotional data of the person and the emotional data of the second caller corresponding to the second call terminal include:
    基于预先构建的情绪识别模型对所述第一通话音频和对话文本数据进行识别,以获取所述第一通话终端对应的第一通话人的情绪数据;Recognizing the first call audio and dialogue text data based on a pre-built emotion recognition model to obtain the emotion data of the first caller corresponding to the first call terminal;
    基于预先构建的情绪识别模型对所述第二通话音频和对话文本数据进行识别,以获取所述第二通话终端对应的第二通话人的情绪数据。Recognizing the second call audio and dialogue text data based on a pre-built emotion recognition model to obtain the emotion data of the second caller corresponding to the second call terminal.
  5. 如权利要求4所述的通信服务方法,其中,所述基于预先构建的情绪识别模型对所述第一通话音频和对话文本数据进行识别,以获取所述第一通话终端对应的第一通话人的情绪数据,包括:The communication service method according to claim 4, wherein the first call audio and conversation text data are recognized based on the pre-built emotion recognition model to obtain the first caller corresponding to the first call terminal Sentiment data, including:
    从所述第一通话音频提取音量特征、语速特征、顺畅特征、停顿特征中的至少一种;Extracting at least one of a volume feature, a speech rate feature, a smooth feature, and a pause feature from the first call audio;
    从所述对话文本数据提取文本特征;Extracting text features from the dialogue text data;
    基于预先构建的情绪识别模型,对所述文本特征以及所述音量特征、语速特征、顺畅特征、停顿特征中的至少一种进行处理,以得到所述第一通话终端对应的第一通话人的情绪数据。Based on the pre-built emotion recognition model, process the text feature and at least one of the volume feature, speech rate feature, smooth feature, and pause feature to obtain the first caller corresponding to the first call terminal Sentiment data.
  6. 如权利要求1所述的通信服务方法,其中,所述根据所述通话场景的类型数据和所述第一通话人的情绪数据生成并向所述第一通话终端发送用于提示所述第一通话人调整情绪的第一提示信息,包括:The communication service method according to claim 1, wherein the generating and sending to the first calling terminal is used to prompt the first call based on the type data of the call scene and the emotional data of the first caller The first reminder information for the caller to adjust their emotions, including:
    基于内置提示规则的提示规则引擎,对所述通话场景的类型数据和所述第一通话人的情绪数据进行分析以获取对应的第一提示信息,并将所述第一提示信息向所述第一通话终端发送以提示所述第一通话人调整情绪;或者Based on the prompt rule engine with built-in prompt rules, it analyzes the type data of the call scene and the emotional data of the first caller to obtain corresponding first prompt information, and sends the first prompt information to the first Sent by a call terminal to prompt the first caller to adjust emotions; or
    基于预先训练的第一提示模型,根据所述通话场景的类型数据、所述第一通话人的情绪数据以及所述对话文本数据生成并向所述第一通话终端发送用于提示所述第一通话人调整情绪的第一提示信息。Based on the pre-trained first prompt model, generate and send to the first call terminal according to the type data of the call scene, the emotional data of the first caller, and the dialog text data for prompting the first call The first message for the caller to adjust his emotions.
  7. 如权利要求1所述的通信服务方法,其中,所述向所述第一通话终端发送第一提示信息或向所述第一通话终端发送第二提示信息时,暂停将所述第一通话终端对应的第一通话音频向所述第二通话终端发送以对所述第二通话人屏蔽所述第一提示信息或第二提示信息。The communication service method according to claim 1, wherein when the first prompt information is sent to the first call terminal or the second prompt information is sent to the first call terminal, the first call terminal is suspended. The corresponding first call audio is sent to the second call terminal to shield the first prompt information or the second prompt information from the second caller.
  8. 一种基于语音识别的通信服务装置,其包括:A communication service device based on voice recognition, which includes:
    音频获取模块,用于若第一通话终端与第二通话终端之间的通话接通,获取所述第一通话终端对应的第一通话音频和所述第二通话终端对应的第二通话音频;An audio acquisition module, configured to obtain a first call audio corresponding to the first call terminal and a second call audio corresponding to the second call terminal if the call between the first call terminal and the second call terminal is connected;
    语音识别模块,用于对所述第一通话音频和所述第二通话音频进行语音识别以得到对话文本数据;A voice recognition module, configured to perform voice recognition on the first call audio and the second call audio to obtain dialogue text data;
    场景识别模块,用于基于预先构建的场景识别模型对所述对话文本数据进行识别,以获取通话场景的类型数据;A scene recognition module, configured to recognize the dialogue text data based on a pre-built scene recognition model to obtain type data of the call scene;
    情绪识别模块,用于基于预先构建的情绪识别模型对所述第一通话音频、第二通话音频进行识别,以获取所述第一通话终端对应的第一通话人的情绪数据、所述第二通话终端对应的第二通话人的情绪数据;The emotion recognition module is configured to recognize the first call audio and the second call audio based on a pre-built emotion recognition model to obtain the emotion data of the first caller and the second call corresponding to the first call terminal Emotional data of the second caller corresponding to the call terminal;
    第一提示模块,用于根据所述通话场景的类型数据和所述第一通话人的情绪数据生成并向所述第一通话终端发送用于提示所述第一通话人调整情绪的第一提示信息;The first prompting module is configured to generate and send a first prompt for prompting the first caller to adjust emotions to the first call terminal according to the type data of the call scene and the emotional data of the first caller information;
    第二提示模块,用于根据所述通话场景的类型数据和所述第二通话人的情绪数据生成并向所述第一通话终端发送用于提示所述第一通话人调整对话策略以应对所述第二通话人情绪的第二提示信息。The second prompting module is configured to generate and send to the first call terminal according to the type data of the call scene and the emotional data of the second caller for prompting the first caller to adjust the dialogue strategy to deal with the situation. The second prompt message describing the emotion of the second caller.
  9. 一种计算机设备,其中,所述计算机设备包括存储器和处理器;A computer device, wherein the computer device includes a memory and a processor;
    所述存储器用于存储计算机程序;The memory is used to store computer programs;
    所述处理器,用于执行所述计算机程序并在执行所述计算机程序时实现如下步骤:The processor is configured to execute the computer program and implement the following steps when executing the computer program:
    若第一通话终端与第二通话终端之间的通话接通,获取所述第一通话终端对应的第一通话音频和所述第二通话终端对应的第二通话音频;If the call between the first call terminal and the second call terminal is connected, acquiring the first call audio corresponding to the first call terminal and the second call audio corresponding to the second call terminal;
    对所述第一通话音频和所述第二通话音频进行语音识别以得到对话文本数据;Performing voice recognition on the first call audio and the second call audio to obtain dialogue text data;
    基于预先构建的场景识别模型对所述对话文本数据进行识别,以获取通话场景的类型数据;Recognizing the dialogue text data based on a pre-built scene recognition model to obtain type data of the call scene;
    基于预先构建的情绪识别模型对所述第一通话音频、第二通话音频进行识别,以获取所述第一通话终端对应的第一通话人的情绪数据、所述第二通话终端对应的第二通话人的情绪数据;Recognize the first call audio and the second call audio based on the pre-built emotion recognition model to obtain the emotion data of the first caller corresponding to the first call terminal and the second call corresponding to the second call terminal Emotional data of the caller;
    根据所述通话场景的类型数据和所述第一通话人的情绪数据生成并向所述第一通话终端发送用于提示所述第一通话人调整情绪的第一提示信息;Generating and sending first prompt information for prompting the first caller to adjust emotions to the first call terminal according to the type data of the call scene and the emotion data of the first caller;
    根据所述通话场景的类型数据和所述第二通话人的情绪数据生成并向所述第一通话终端发送用于提示所述第一通话人调整对话策略以应对所述第二通话人情绪的第二提示信息。According to the type data of the call scene and the emotion data of the second caller, it is generated and sent to the first call terminal for prompting the first caller to adjust the dialogue strategy to deal with the second caller's emotions. The second prompt message.
  10. 如权利要求9所述的计算机设备,其中,所述处理器在实现所述对所述第一通话音频和所述第二通话音频进行语音识别以得到对话文本数据时,用于实现如下步骤:9. The computer device according to claim 9, wherein the processor is configured to implement the following steps when performing voice recognition on the first call audio and the second call audio to obtain dialog text data:
    对所述第一通话音频进行语音识别以得到第一通话人对应的第一文本;Performing voice recognition on the first call audio to obtain a first text corresponding to the first caller;
    对所述第二通话音频进行语音识别以得到第二通话人对应的第二文本;Performing voice recognition on the second call audio to obtain a second text corresponding to the second caller;
    根据预设排序规则对所述第一文本、第二文本排序,以得到对话文本数据。The first text and the second text are sorted according to a preset sorting rule to obtain dialogue text data.
  11. 如权利要求9所述的计算机设备,其中,所述处理器在实现所述基于预先构建的场 景识别模型对所述对话文本数据进行识别,以获取通话场景的类型数据时,用于实现如下步骤:The computer device according to claim 9, wherein the processor is configured to implement the following steps when implementing the recognition of the dialogue text data based on the pre-built scene recognition model to obtain the type data of the call scene :
    基于内置场景判断规则的场景规则引擎,对所述对话文本数据进行分析以获取通话场景的类型数据;或者Based on a scene rule engine with built-in scene judgment rules, analyze the conversation text data to obtain type data of the call scene; or
    所述基于预先构建的场景识别模型对所述对话文本数据进行识别,以获取通话场景的类型数据,包括:The recognizing the dialogue text data based on the pre-built scene recognition model to obtain the type data of the call scene includes:
    抽取所述对话文本数据中的文本特征;Extracting text features in the dialogue text data;
    基于训练好的机器学习模型,根据所述对话文本数据中的文本特征识别出通话场景的类型数据。Based on the trained machine learning model, the type data of the call scene is identified according to the text features in the dialogue text data.
  12. 如权利要求9所述的计算机设备,其中,所述处理器在实现所述基于预先构建的情绪识别模型对所述第一通话音频、第二通话音频进行识别,以获取所述第一通话终端对应的第一通话人的情绪数据、所述第二通话终端对应的第二通话人的情绪数据时,用于实现如下步骤:The computer device according to claim 9, wherein the processor is implementing the recognition of the first call audio and the second call audio based on the pre-built emotion recognition model to obtain the first call terminal Corresponding to the emotional data of the first caller and the emotional data of the second caller corresponding to the second call terminal, it is used to implement the following steps:
    基于预先构建的情绪识别模型对所述第一通话音频和对话文本数据进行识别,以获取所述第一通话终端对应的第一通话人的情绪数据;Recognizing the first call audio and dialogue text data based on a pre-built emotion recognition model to obtain the emotion data of the first caller corresponding to the first call terminal;
    基于预先构建的情绪识别模型对所述第二通话音频和对话文本数据进行识别,以获取所述第二通话终端对应的第二通话人的情绪数据。Recognizing the second call audio and dialogue text data based on a pre-built emotion recognition model to obtain the emotion data of the second caller corresponding to the second call terminal.
  13. 如权利要求12所述的计算机设备,其中,所述处理器在实现所述基于预先构建的情绪识别模型对所述第一通话音频和对话文本数据进行识别,以获取所述第一通话终端对应的第一通话人的情绪数据时,用于实现如下步骤:The computer device according to claim 12, wherein the processor recognizes the first call audio and dialogue text data based on the pre-built emotion recognition model, so as to obtain the corresponding data of the first call terminal. The emotional data of the first caller is used to implement the following steps:
    从所述第一通话音频提取音量特征、语速特征、顺畅特征、停顿特征中的至少一种;Extracting at least one of a volume feature, a speech rate feature, a smooth feature, and a pause feature from the first call audio;
    从所述对话文本数据提取文本特征;Extracting text features from the dialogue text data;
    基于预先构建的情绪识别模型,对所述文本特征以及所述音量特征、语速特征、顺畅特征、停顿特征中的至少一种进行处理,以得到所述第一通话终端对应的第一通话人的情绪数据。Based on the pre-built emotion recognition model, process the text feature and at least one of the volume feature, speech rate feature, smooth feature, and pause feature to obtain the first caller corresponding to the first call terminal Sentiment data.
  14. 如权利要求9所述的计算机设备,其中,所述处理器在实现所述根据所述通话场景的类型数据和所述第一通话人的情绪数据生成并向所述第一通话终端发送用于提示所述第一通话人调整情绪的第一提示信息时,用于实现如下步骤:The computer device according to claim 9, wherein the processor is configured to generate and send to the first call terminal according to the type data of the call scene and the emotional data of the first caller. When the first prompt message prompting the first caller to adjust emotions is used to implement the following steps:
    基于内置提示规则的提示规则引擎,对所述通话场景的类型数据和所述第一通话人的情绪数据进行分析以获取对应的第一提示信息,并将所述第一提示信息向所述第一通话终端发送以提示所述第一通话人调整情绪;或者Based on the prompt rule engine with built-in prompt rules, it analyzes the type data of the call scene and the emotional data of the first caller to obtain corresponding first prompt information, and sends the first prompt information to the first Sent by a call terminal to prompt the first caller to adjust emotions; or
    基于预先训练的第一提示模型,根据所述通话场景的类型数据、所述第一通话人的情绪数据以及所述对话文本数据生成并向所述第一通话终端发送用于提示所述第一通话人调整情绪的第一提示信息。Based on the pre-trained first prompt model, generate and send to the first call terminal according to the type data of the call scene, the emotional data of the first caller, and the dialog text data for prompting the first call The first message for the caller to adjust his emotions.
  15. 如权利要求9所述的计算机设备,其中,所述处理器在实现所述向所述第一通话终端发送第一提示信息或向所述第一通话终端发送第二提示信息时,用于实现如下步骤:暂停将所述第一通话终端对应的第一通话音频向所述第二通话终端发送以对所述第二通话人屏蔽所述第一提示信息或第二提示信息。The computer device according to claim 9, wherein the processor is configured to implement the sending of the first prompt information to the first call terminal or the second prompt information to the first call terminal The following steps are as follows: suspend sending the first call audio corresponding to the first call terminal to the second call terminal to shield the first prompt information or the second prompt information from the second caller.
  16. 一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,其中:若所述计算机程序被处理器执行,实现如下步骤:A computer-readable storage medium storing a computer program, wherein: if the computer program is executed by a processor, the following steps are implemented:
    若第一通话终端与第二通话终端之间的通话接通,获取所述第一通话终端对应的第一通话音频和所述第二通话终端对应的第二通话音频;If the call between the first call terminal and the second call terminal is connected, acquiring the first call audio corresponding to the first call terminal and the second call audio corresponding to the second call terminal;
    对所述第一通话音频和所述第二通话音频进行语音识别以得到对话文本数据;Performing voice recognition on the first call audio and the second call audio to obtain dialogue text data;
    基于预先构建的场景识别模型对所述对话文本数据进行识别,以获取通话场景的类型数据;Recognizing the dialogue text data based on a pre-built scene recognition model to obtain type data of the call scene;
    基于预先构建的情绪识别模型对所述第一通话音频、第二通话音频进行识别,以获取所述第一通话终端对应的第一通话人的情绪数据、所述第二通话终端对应的第二通话人的情绪数据;Recognize the first call audio and the second call audio based on the pre-built emotion recognition model to obtain the emotion data of the first caller corresponding to the first call terminal and the second call corresponding to the second call terminal Emotional data of the caller;
    根据所述通话场景的类型数据和所述第一通话人的情绪数据生成并向所述第一通话终端发送用于提示所述第一通话人调整情绪的第一提示信息;Generating and sending first prompt information for prompting the first caller to adjust emotions to the first call terminal according to the type data of the call scene and the emotion data of the first caller;
    根据所述通话场景的类型数据和所述第二通话人的情绪数据生成并向所述第一通话终端发送用于提示所述第一通话人调整对话策略以应对所述第二通话人情绪的第二提示信息。According to the type data of the call scene and the emotion data of the second caller, it is generated and sent to the first call terminal for prompting the first caller to adjust the dialogue strategy to deal with the second caller's emotions. The second prompt message.
  17. 如权利要求16所述的存储介质,其中,所述处理器在实现所述对所述第一通话音频和所述第二通话音频进行语音识别以得到对话文本数据时,用于实现如下步骤:16. The storage medium of claim 16, wherein the processor is configured to implement the following steps when performing voice recognition on the first call audio and the second call audio to obtain dialog text data:
    对所述第一通话音频进行语音识别以得到第一通话人对应的第一文本;Performing voice recognition on the first call audio to obtain a first text corresponding to the first caller;
    对所述第二通话音频进行语音识别以得到第二通话人对应的第二文本;Performing voice recognition on the second call audio to obtain a second text corresponding to the second caller;
    根据预设排序规则对所述第一文本、第二文本排序,以得到对话文本数据。The first text and the second text are sorted according to a preset sorting rule to obtain dialogue text data.
  18. 如权利要求16所述的存储介质,其中,所述处理器在实现所述基于预先构建的场景识别模型对所述对话文本数据进行识别,以获取通话场景的类型数据时,用于实现如下步骤:The storage medium according to claim 16, wherein the processor is configured to implement the following steps when implementing the recognition of the dialogue text data based on the pre-built scene recognition model to obtain the type data of the call scene :
    基于内置场景判断规则的场景规则引擎,对所述对话文本数据进行分析以获取通话场景的类型数据;或者Based on a scene rule engine with built-in scene judgment rules, analyze the conversation text data to obtain type data of the call scene; or
    所述基于预先构建的场景识别模型对所述对话文本数据进行识别,以获取通话场景的类型数据,包括:The recognizing the dialogue text data based on the pre-built scene recognition model to obtain the type data of the call scene includes:
    抽取所述对话文本数据中的文本特征;Extracting text features in the dialogue text data;
    基于训练好的机器学习模型,根据所述对话文本数据中的文本特征识别出通话场景的类型数据。Based on the trained machine learning model, the type data of the call scene is identified according to the text features in the dialogue text data.
  19. 如权利要求16所述的存储介质,其中,所述处理器在实现所述基于预先构建的情绪识别模型对所述第一通话音频、第二通话音频进行识别,以获取所述第一通话终端对应的第一通话人的情绪数据、所述第二通话终端对应的第二通话人的情绪数据时,用于实现如下步骤:The storage medium according to claim 16, wherein the processor is implementing the recognition of the first call audio and the second call audio based on the pre-built emotion recognition model to obtain the first call terminal Corresponding to the emotional data of the first caller and the emotional data of the second caller corresponding to the second call terminal, it is used to implement the following steps:
    基于预先构建的情绪识别模型对所述第一通话音频和对话文本数据进行识别,以获取所述第一通话终端对应的第一通话人的情绪数据;Recognizing the first call audio and dialogue text data based on a pre-built emotion recognition model to obtain the emotion data of the first caller corresponding to the first call terminal;
    基于预先构建的情绪识别模型对所述第二通话音频和对话文本数据进行识别,以获取所述第二通话终端对应的第二通话人的情绪数据;Recognizing the second call audio and dialogue text data based on a pre-built emotion recognition model to obtain the second caller's emotion data corresponding to the second call terminal;
    其中,所述处理器在实现所述基于预先构建的情绪识别模型对所述第一通话音频和对话文本数据进行识别,以获取所述第一通话终端对应的第一通话人的情绪数据时,用于实现如下步骤:Wherein, when the processor realizes the recognition of the first call audio and dialogue text data based on the pre-built emotion recognition model to obtain the emotion data of the first caller corresponding to the first call terminal, Used to implement the following steps:
    从所述第一通话音频提取音量特征、语速特征、顺畅特征、停顿特征中的至少一种;Extracting at least one of a volume feature, a speech rate feature, a smooth feature, and a pause feature from the first call audio;
    从所述对话文本数据提取文本特征;Extracting text features from the dialogue text data;
    基于预先构建的情绪识别模型,对所述文本特征以及所述音量特征、语速特征、顺畅特征、停顿特征中的至少一种进行处理,以得到所述第一通话终端对应的第一通话人的情绪数据。Based on the pre-built emotion recognition model, process the text feature and at least one of the volume feature, speech rate feature, smooth feature, and pause feature to obtain the first caller corresponding to the first call terminal Sentiment data.
  20. 如权利要求16所述的存储介质,其中,所述处理器在实现所述根据所述通话场景的类型数据和所述第一通话人的情绪数据生成并向所述第一通话终端发送用于提示所述第一通话人调整情绪的第一提示信息时,用于实现如下步骤:The storage medium according to claim 16, wherein the processor is configured to generate data based on the type data of the call scene and the emotional data of the first caller and send to the first call terminal When the first prompt message prompting the first caller to adjust emotions is used to implement the following steps:
    基于内置提示规则的提示规则引擎,对所述通话场景的类型数据和所述第一通话人的情绪数据进行分析以获取对应的第一提示信息,并将所述第一提示信息向所述第一通话终端发送以提示所述第一通话人调整情绪;或者Based on the prompt rule engine with built-in prompt rules, it analyzes the type data of the call scene and the emotional data of the first caller to obtain corresponding first prompt information, and sends the first prompt information to the first Sent by a call terminal to prompt the first caller to adjust emotions; or
    基于预先训练的第一提示模型,根据所述通话场景的类型数据、所述第一通话人的情绪数据以及所述对话文本数据生成并向所述第一通话终端发送用于提示所述第一通话人调整情 绪的第一提示信息。Based on the pre-trained first prompt model, generate and send to the first call terminal according to the type data of the call scene, the emotional data of the first caller, and the dialog text data for prompting the first call The first message for the caller to adjust their emotions.
PCT/CN2019/122167 2019-06-17 2019-11-29 Voice recognition-based communication service method, apparatus, computer device, and storage medium WO2020253128A1 (en)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
CN201910523567 2019-06-17
CN201910523567.X 2019-06-17
CN201910605732.6 2019-07-05
CN201910605732.6A CN110444229A (en) 2019-06-17 2019-07-05 Communication service method, device, computer equipment and storage medium based on speech recognition

Publications (1)

Publication Number Publication Date
WO2020253128A1 true WO2020253128A1 (en) 2020-12-24

Family

ID=68429455

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/122167 WO2020253128A1 (en) 2019-06-17 2019-11-29 Voice recognition-based communication service method, apparatus, computer device, and storage medium

Country Status (2)

Country Link
CN (1) CN110444229A (en)
WO (1) WO2020253128A1 (en)

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110444229A (en) * 2019-06-17 2019-11-12 深圳壹账通智能科技有限公司 Communication service method, device, computer equipment and storage medium based on speech recognition
CN111309715B (en) * 2020-01-15 2023-04-18 腾讯科技(深圳)有限公司 Call scene identification method and device
CN113316041B (en) * 2020-02-27 2023-08-01 阿里巴巴集团控股有限公司 Remote health detection system, method, device and equipment
CN111580773B (en) * 2020-04-15 2023-11-14 北京小米松果电子有限公司 Information processing method, device and storage medium
CN112995422A (en) * 2021-02-07 2021-06-18 成都薯片科技有限公司 Call control method and device, electronic equipment and storage medium
CN113037610B (en) * 2021-02-25 2022-08-19 腾讯科技(深圳)有限公司 Voice data processing method and device, computer equipment and storage medium
CN115204127B (en) * 2022-09-19 2023-01-06 深圳市北科瑞声科技股份有限公司 Form filling method, device, equipment and medium based on remote flow adjustment
CN116682414B (en) * 2023-06-06 2024-01-30 安徽迪科数金科技有限公司 Dialect voice recognition system based on big data
CN116631451B (en) * 2023-06-25 2024-02-06 安徽迪科数金科技有限公司 Voice emotion recognition system and method

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170270922A1 (en) * 2015-11-18 2017-09-21 Shenzhen Skyworth-Rgb Electronic Co., Ltd. Smart home control method based on emotion recognition and the system thereof
CN108536802A (en) * 2018-03-30 2018-09-14 百度在线网络技术(北京)有限公司 Exchange method based on children's mood and device
CN108962219A (en) * 2018-06-29 2018-12-07 百度在线网络技术(北京)有限公司 Method and apparatus for handling text
CN109587360A (en) * 2018-11-12 2019-04-05 平安科技(深圳)有限公司 Electronic device should talk with art recommended method and computer readable storage medium
CN110444229A (en) * 2019-06-17 2019-11-12 深圳壹账通智能科技有限公司 Communication service method, device, computer equipment and storage medium based on speech recognition

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105991849B (en) * 2015-02-13 2019-03-01 华为技术有限公司 One kind is attended a banquet method of servicing, apparatus and system
US10158758B2 (en) * 2016-11-02 2018-12-18 International Business Machines Corporation System and method for monitoring and visualizing emotions in call center dialogs at call centers
CN107423364B (en) * 2017-06-22 2024-01-26 百度在线网络技术(北京)有限公司 Method, device and storage medium for answering operation broadcasting based on artificial intelligence
CN108922564B (en) * 2018-06-29 2021-05-07 北京百度网讯科技有限公司 Emotion recognition method and device, computer equipment and storage medium
CN108962255B (en) * 2018-06-29 2020-12-08 北京百度网讯科技有限公司 Emotion recognition method, emotion recognition device, server and storage medium for voice conversation
CN109767791B (en) * 2019-03-21 2021-03-30 中国—东盟信息港股份有限公司 Voice emotion recognition and application system for call center calls

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170270922A1 (en) * 2015-11-18 2017-09-21 Shenzhen Skyworth-Rgb Electronic Co., Ltd. Smart home control method based on emotion recognition and the system thereof
CN108536802A (en) * 2018-03-30 2018-09-14 百度在线网络技术(北京)有限公司 Exchange method based on children's mood and device
CN108962219A (en) * 2018-06-29 2018-12-07 百度在线网络技术(北京)有限公司 Method and apparatus for handling text
CN109587360A (en) * 2018-11-12 2019-04-05 平安科技(深圳)有限公司 Electronic device should talk with art recommended method and computer readable storage medium
CN110444229A (en) * 2019-06-17 2019-11-12 深圳壹账通智能科技有限公司 Communication service method, device, computer equipment and storage medium based on speech recognition

Also Published As

Publication number Publication date
CN110444229A (en) 2019-11-12

Similar Documents

Publication Publication Date Title
WO2020253128A1 (en) Voice recognition-based communication service method, apparatus, computer device, and storage medium
Schuller et al. The INTERSPEECH 2021 computational paralinguistics challenge: COVID-19 cough, COVID-19 speech, escalation & primates
CN112804400B (en) Customer service call voice quality inspection method and device, electronic equipment and storage medium
CN111164601B (en) Emotion recognition method, intelligent device and computer readable storage medium
US11769492B2 (en) Voice conversation analysis method and apparatus using artificial intelligence
US11475897B2 (en) Method and apparatus for response using voice matching user category
CN109767765A (en) Talk about art matching process and device, storage medium, computer equipment
WO2022005661A1 (en) Detecting user identity in shared audio source contexts
EP3617946B1 (en) Context acquisition method and device based on voice interaction
US10750018B2 (en) Modeling voice calls to improve an outcome of a call between a representative and a customer
US10110743B2 (en) Automatic pattern recognition in conversations
CN104538043A (en) Real-time emotion reminder for call
CN112949708B (en) Emotion recognition method, emotion recognition device, computer equipment and storage medium
CN110188361A (en) Speech intention recognition methods and device in conjunction with text, voice and emotional characteristics
CN107316635B (en) Voice recognition method and device, storage medium and electronic equipment
CN113096647B (en) Voice model training method and device and electronic equipment
CN115083434B (en) Emotion recognition method and device, computer equipment and storage medium
CN108920640A (en) Context acquisition methods and equipment based on interactive voice
CN112581938B (en) Speech breakpoint detection method, device and equipment based on artificial intelligence
CN107085717A (en) A kind of family's monitoring method, service end and computer-readable recording medium
CN113129866B (en) Voice processing method, device, storage medium and computer equipment
CN109961152B (en) Personalized interaction method and system of virtual idol, terminal equipment and storage medium
CN112910761B (en) Instant messaging method, device, equipment, storage medium and program product
US10282417B2 (en) Conversational list management
CN110298150B (en) Identity verification method and system based on voice recognition

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19933410

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19933410

Country of ref document: EP

Kind code of ref document: A1

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 19.08.2022)

122 Ep: pct application non-entry in european phase

Ref document number: 19933410

Country of ref document: EP

Kind code of ref document: A1