CN112633172B - Communication optimization method, device, equipment and medium - Google Patents

Communication optimization method, device, equipment and medium Download PDF

Info

Publication number
CN112633172B
CN112633172B CN202011545611.6A CN202011545611A CN112633172B CN 112633172 B CN112633172 B CN 112633172B CN 202011545611 A CN202011545611 A CN 202011545611A CN 112633172 B CN112633172 B CN 112633172B
Authority
CN
China
Prior art keywords
target
emotion
voice
feature
video
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011545611.6A
Other languages
Chinese (zh)
Other versions
CN112633172A (en
Inventor
彭钊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Bank Co Ltd
Original Assignee
Ping An Bank Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Bank Co Ltd filed Critical Ping An Bank Co Ltd
Priority to CN202011545611.6A priority Critical patent/CN112633172B/en
Publication of CN112633172A publication Critical patent/CN112633172A/en
Application granted granted Critical
Publication of CN112633172B publication Critical patent/CN112633172B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/174Facial expression recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/70Reducing energy consumption in communication networks in wireless communication networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Signal Processing (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Acoustics & Sound (AREA)
  • General Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Child & Adolescent Psychology (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Psychiatry (AREA)
  • Hospice & Palliative Care (AREA)
  • Telephonic Communication Services (AREA)
  • Image Analysis (AREA)

Abstract

The invention relates to the field of artificial intelligence, and provides a communication optimization method, device, equipment and medium, which can comprehensively judge the emotion recognition result of videos and the emotion recognition result of voices, effectively improve the accuracy of emotion judgment, and when abnormal emotion of a customer is detected, perform soft processing on real-time voices input by the customer, so that customer service personnel can be prevented from being influenced by the emotion of the customer, communication between the two parties is smoother, customer complaint rate is further reduced, and better interaction experience is brought to the two parties. In addition, the invention also relates to a blockchain technology, and the emotion recognition model can be stored in the blockchain node.

Description

Communication optimization method, device, equipment and medium
Technical Field
The present invention relates to the field of artificial intelligence technologies, and in particular, to a method, an apparatus, a device, and a medium for optimizing communications.
Background
At present, many fields relate to frequent audio and video communication between clients and customer service so as to carry out problem consultation or business handling. In the process, if a party is bad, the call process is likely to be unpleasant, so that the customer can leave bad impressions on the enterprise corresponding to the customer service, and the customer service can be correspondingly subjected to responsibility.
In order to solve the above problems, a common method is to set a corresponding constraint system for customer service, so that the customer service maintains a cool and quiet constraint attitude based on professional quality and work performance considerations when facing customers with unstable moods, but only subjectively constrains the customer service to have a certain risk.
For the scheme of judging emotion by adopting an intelligent recognition mode, and further providing a customer emotion judgment basis for customer service, the judgment of emotion is usually carried out only according to the currently intercepted user picture, and the judgment result is not accurate enough, so that the correct response of the subsequent customer service is influenced.
Disclosure of Invention
In view of the foregoing, it is necessary to provide a communication optimization method, device, equipment and medium, which can comprehensively judge the emotion recognition result for video and the emotion recognition result for voice, effectively improve the accuracy of emotion judgment, and when abnormal emotion of a client is detected, perform soft processing on real-time voice input by the client, so as to avoid influence of the emotion of the client on customer service personnel, enable communication between the two parties to be smoother, further reduce the complaint rate of the customer, and bring better interactive experience to the two parties.
A communication optimization method, the communication optimization method comprising:
responding to a communication optimization instruction, acquiring target acquisition equipment according to the communication optimization instruction, and starting the target acquisition equipment to acquire target voice and target video of a target user;
carrying out feature interception on the target video to obtain a picture to be detected;
inputting the picture to be detected into a pre-trained emotion recognition model, and determining a first emotion type according to the output of the emotion recognition model, wherein the emotion recognition model is obtained based on a frame attention mechanism and residual network training;
carrying out emotion recognition on the target voice to obtain a second emotion type;
when the first emotion type and/or the second emotion type is abnormal, acquiring input voice of the target user in real time by using the target acquisition equipment, and performing optimization processing on the input voice to obtain real-time voice;
and outputting the real-time voice.
According to a preferred embodiment of the present invention, the acquiring a target acquisition device according to the communication optimization instruction includes:
analyzing the method body of the communication optimization instruction to obtain information carried by the communication optimization instruction;
Acquiring a preset label corresponding to the equipment identifier;
constructing a regular expression according to the preset label;
traversing the information carried by the communication optimization instruction by using the regular expression, and determining the traversed data as a target equipment identifier;
and determining user equipment according to the target equipment identifier, and determining acquisition equipment of the user equipment as the target acquisition equipment.
According to a preferred embodiment of the present invention, the feature capturing the target video to obtain the picture to be detected includes:
acquiring all frame pictures contained in the target video;
inputting each frame picture in all the frame pictures into a YOLOv3 network for identification to obtain a face area of each frame picture;
and intercepting the face area of each frame of picture to obtain the picture to be detected.
According to a preferred embodiment of the present invention, said determining a first emotion type from the output of said emotion recognition model comprises:
obtaining the predicted emotion of each picture to be detected and the corresponding predicted probability from the output of the emotion recognition model;
obtaining the maximum prediction probability from the prediction probabilities as a target prediction probability;
And acquiring the predicted emotion corresponding to the target predicted probability as the first emotion type.
According to a preferred embodiment of the invention, the method further comprises:
acquiring a sample video, and splitting the sample video with preset time length to obtain at least one sub-video;
performing feature interception on the at least one sub-video to obtain a training sample;
extracting features of the training samples by using a preset residual error network to obtain initial features;
inputting the initial characteristics to the full connection layer corresponding to each color channel, and outputting characteristic vectors;
processing the feature vector by a first sigmoid function to obtain a first attention weight;
converting the feature vector based on the first attention weight to obtain an initial global frame feature;
the feature vector is connected with the initial global frame feature in parallel to obtain a parallel feature;
processing the parallel connection feature by a second sigmoid function to obtain a second attention weight;
converting the parallel connection feature based on the second attention weight to obtain a target global frame feature;
processing the target global frame characteristics by a softmax function, and outputting a prediction result and a loss value;
And stopping training when the loss value convergence is detected, and obtaining the emotion recognition model.
According to a preferred embodiment of the present invention, the feature vector is transformed based on the first attention weight using the following formula, to obtain an initial global frame feature:
wherein f' v Alpha for the initial global bezel feature i For the first attention weight, f i For the feature vector, i is the number of frames to which the feature vector belongs, and n is the maximum number of frames;
and converting the parallel connection characteristic based on the second attention weight by adopting the following formula to obtain a target global frame characteristic:
wherein f v For the target global bezel feature, β i For the second attention weight, [ f ] i :f′ v ]Is the concatenation feature.
According to a preferred embodiment of the present invention, the optimizing the input speech to obtain real-time speech includes:
noise reduction processing is carried out on the input voice to obtain a first voice;
identifying target sound waves in the first voice, and deleting the target sound waves from the first voice to obtain second voice;
and carrying out fade-in and fade-out processing on the second voice to obtain the real-time voice.
A communication optimizing apparatus, the communication optimizing apparatus comprising:
The acquisition unit is used for responding to the communication optimization instruction, acquiring target acquisition equipment according to the communication optimization instruction, and starting the target acquisition equipment to acquire target voice and target video of a target user;
the intercepting unit is used for characteristic intercepting of the target video to obtain a picture to be detected;
the identification unit is used for inputting the picture to be detected into a pre-trained emotion identification model and determining a first emotion type according to the output of the emotion identification model, wherein the emotion identification model is obtained based on a frame attention mechanism and residual network training;
the recognition unit is further used for carrying out emotion recognition on the target voice to obtain a second emotion type;
the optimizing unit is used for acquiring the input voice of the target user in real time by using the target acquisition equipment when the first emotion type and/or the second emotion type is abnormal, and optimizing the input voice to obtain real-time voice;
and the output unit is used for outputting the real-time voice.
An electronic device, the electronic device comprising:
a memory storing at least one instruction; a kind of electronic device with high-pressure air-conditioning system
And the processor executes the instructions stored in the memory to realize the communication optimization method.
A computer-readable storage medium having stored therein at least one instruction for execution by a processor in an electronic device to implement the communications optimization method.
According to the technical scheme, the target acquisition equipment can be obtained according to the communication optimization instruction, the target acquisition equipment is started to acquire the target voice and the target video of the target user, the target video is subjected to feature interception to obtain the picture to be detected, the picture to be detected is input into the pre-trained emotion recognition model, and the first emotion type is determined according to the output of the emotion recognition model, wherein the emotion recognition model is obtained based on a frame attention mechanism and residual network training, the time-related sequence features are integrated based on the frame attention mechanism, so that the characteristics of a video segment are effectively classified, emotion recognition is carried out according to the video features, representative face emotion states can be captured more advantageously, emotion recognition is carried out on the target voice, a second emotion type is obtained, when the first emotion type and/or the second emotion type are abnormal, the input voice of the target user is acquired in real time by the target acquisition equipment, the input voice is optimized, the real-time voice is obtained, the voice is output, the comprehensive emotion recognition result of the customer is more easily judged, and the interaction and the voice is more easily and accurately judged.
Drawings
FIG. 1 is a flow chart of a preferred embodiment of the communication optimization method of the present invention.
FIG. 2 is a functional block diagram of a preferred embodiment of the communication optimizing apparatus of the present invention.
Fig. 3 is a schematic structural diagram of an electronic device according to a preferred embodiment of the present invention for implementing the communication optimization method.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in detail with reference to the accompanying drawings and specific embodiments.
FIG. 1 is a flow chart of a preferred embodiment of the communication optimization method of the present invention. The order of the steps in the flowchart may be changed and some steps may be omitted according to various needs.
The communication optimization method is applied to one or more electronic devices, wherein the electronic devices are devices capable of automatically performing numerical calculation and/or information processing according to preset or stored instructions, and the hardware of the electronic devices comprises, but is not limited to, microprocessors, application specific integrated circuits (Application Specific Integrated Circuit, ASICs), programmable gate arrays (Field-Programmable Gate Array, FPGA), digital processors (Digital Signal Processor, DSPs), embedded devices and the like.
The electronic device may be any electronic product that can interact with a user in a human-computer manner, such as a personal computer, tablet computer, smart phone, personal digital assistant (Personal Digital Assistant, PDA), game console, interactive internet protocol television (Internet Protocol Television, IPTV), smart wearable device, etc.
The electronic device may also include a network device and/or a user device. Wherein the network device includes, but is not limited to, a single network server, a server group composed of a plurality of network servers, or a Cloud based Cloud Computing (Cloud Computing) composed of a large number of hosts or network servers.
The network in which the electronic device is located includes, but is not limited to, the internet, a wide area network, a metropolitan area network, a local area network, a virtual private network (Virtual Private Network, VPN), and the like.
S10, responding to a communication optimization instruction, acquiring target acquisition equipment according to the communication optimization instruction, and starting the target acquisition equipment to acquire target voice and target video of a target user.
In at least one embodiment of the present invention, the communication optimization command may be triggered by a customer service currently performing a call, or may be automatically triggered when an audio/video start is detected, which is not limited by the present invention.
In at least one embodiment of the present invention, the acquiring the target acquisition device according to the communication optimization instruction includes:
analyzing the method body of the communication optimization instruction to obtain information carried by the communication optimization instruction;
acquiring a preset label corresponding to the equipment identifier;
Constructing a regular expression according to the preset label;
traversing the information carried by the communication optimization instruction by using the regular expression, and determining the traversed data as a target equipment identifier;
and determining user equipment according to the target equipment identifier, and determining acquisition equipment of the user equipment as the target acquisition equipment.
For example: when a bank customer service and a customer carry out video interaction, each of the two parties holds a terminal device for conference, the terminal device of the customer is determined to be the user device by analyzing the communication optimization instruction, and the acquisition device of the user device is determined to be the target acquisition device.
The information carried by the communication optimization instruction may include, but is not limited to: equipment identification, a user name triggering the communication optimization instruction, and the like.
Wherein the communication optimization instruction is essentially a code, and in the communication optimization instruction, according to the coding principle, the content between { } is called as the method body.
The preset tag can be configured in a self-defined manner, and has a one-to-one correspondence with the device identifier, for example: the preset label can be ZID, and a regular expression ZID () is further built by the preset label, and traversal is performed by ZID ().
Through the embodiment, the device identification can be rapidly determined based on the regular expression and the preset label, and the target acquisition device is further determined by using the device identification.
And S11, carrying out feature interception on the target video to obtain a picture to be detected.
Since each target video may include other non-facial information that would interfere with feature recognition, feature extraction is first performed on the video.
Specifically, the step of performing feature interception on the target video to obtain a picture to be detected includes:
acquiring all frame pictures contained in the target video;
inputting each frame picture in all the frame pictures into a YOLOv3 network for identification to obtain a face area of each frame picture;
and intercepting the face area of each frame of picture to obtain the picture to be detected.
According to the embodiment, as the YOLOv3 network has higher stability accuracy, the YOLOv3 network is used for intercepting facial features, redundant information in the video can be effectively removed, and the accuracy and the efficiency of subsequent emotion recognition are improved.
S12, inputting the picture to be detected into a pre-trained emotion recognition model, and determining a first emotion type according to the output of the emotion recognition model, wherein the emotion recognition model is obtained based on a frame attention mechanism and residual network training.
For example: the output of the emotion recognition model may be: anger, 0.95.
In this embodiment, the determining the first emotion type according to the output of the emotion recognition model includes:
obtaining the predicted emotion of each picture to be detected and the corresponding predicted probability from the output of the emotion recognition model;
obtaining the maximum prediction probability from the prediction probabilities as a target prediction probability;
and acquiring the predicted emotion corresponding to the target predicted probability as the first emotion type.
By the implementation mode, the final emotion recognition result can be determined by integrating all the recognition results, so that the recognition accuracy is higher.
In at least one embodiment of the invention, the method further comprises:
acquiring a sample video, and splitting the sample video with preset time length to obtain at least one sub-video;
performing feature interception on the at least one sub-video to obtain a training sample;
extracting features of the training samples by using a preset residual error network to obtain initial features;
inputting the initial characteristics to the full connection layer corresponding to each color channel, and outputting characteristic vectors;
processing the feature vector by a first sigmoid function to obtain a first attention weight;
Converting the feature vector based on the first attention weight to obtain an initial global frame feature;
the feature vector is connected with the initial global frame feature in parallel to obtain a parallel feature;
processing the parallel connection feature by a second sigmoid function to obtain a second attention weight;
converting the parallel connection feature based on the second attention weight to obtain a target global frame feature;
processing the target global frame characteristics by a softmax function, and outputting a prediction result and a loss value;
and stopping training when the loss value convergence is detected, and obtaining the emotion recognition model.
The preset duration may be configured in a user-defined manner, for example, 10 seconds.
Further, the at least one sub-video may be feature intercepted using a YOLOv3 network.
Still further, the preset residual network may be a Resnet50 network, which is not limited by the present invention.
In this embodiment, when the feature vector is connected in parallel with the initial global frame feature, a transverse parallel connection manner may be adopted.
For example: the two 1024 x 1 vectors are concatenated to obtain a 2048 x 1 vector.
By the implementation mode, the time-related sequence features can be integrated based on the frame attention mechanism, so that the features of the video segments can be effectively classified, and the emotion recognition model obtained through training has higher accuracy.
Specifically, the feature vector is converted based on the first attention weight by adopting the following formula, so as to obtain an initial global frame feature:
wherein f' v Alpha for the initial global bezel feature i For the first attention weight, f i For the feature vector, i is the number of frames to which the feature vector belongs, and n is the maximum number of frames;
and converting the parallel connection characteristic based on the second attention weight by adopting the following formula to obtain a target global frame characteristic:
wherein f v For the target global bezel feature, β i For the second attention weight, [ f ] i :f′ v ]Is the concatenation feature.
Through the embodiment, the characteristic normalization processing is carried out for a plurality of times based on the frame attention mechanism, the image characteristic is converted into the global video characteristic, and the emotion recognition is carried out by the video characteristic, so that the representative face emotion state can be captured more advantageously.
S13, emotion recognition is carried out on the target voice, and a second emotion type is obtained.
In at least one embodiment of the present invention, the emotion recognition may be performed on the target speech using GMM (Adaptive background mixture models for real-time tracking, gaussian mixture model), SVM (Support Vector Machine ), HMM (Hidden Markov Model, hidden markov model), multiple classifer system (multi-classifier system), and the like, which are not described herein.
It should be noted that the first emotion type refers to an emotion type recognized according to the emotion recognition model, and the second emotion type refers to an emotion type recognized after emotion recognition according to the target voice. Both the first emotion type and the second emotion type may include anger, happiness, etc.
S14, when the first emotion type and/or the second emotion type is abnormal, acquiring the input voice of the target user in real time by using the target acquisition equipment, and optimizing the input voice to obtain real-time voice.
In this embodiment, when the first emotion type and/or the second emotion type is detected as anger, excited, or the like, it may be determined as abnormal.
According to the embodiment, the emotion recognition result for the video and the emotion recognition result for the voice are comprehensively judged, and the accuracy of emotion judgment is effectively improved.
In at least one embodiment of the present invention, the optimizing the input speech to obtain real-time speech includes:
noise reduction processing is carried out on the input voice to obtain a first voice;
identifying target sound waves in the first voice, and deleting the target sound waves from the first voice to obtain second voice;
And carrying out fade-in and fade-out processing on the second voice to obtain the real-time voice.
Wherein the target sound wave may include an aggressive sound wave or the like, and the invention is not limited herein.
According to the embodiment, when the emotion abnormality of the client is detected, the real-time voice input by the client can be processed softly.
S15, outputting the real-time voice.
For example: and outputting the real-time voice on terminal equipment of customer service personnel.
Through the embodiment, customer service personnel can be prevented from being influenced by the emotion of the customer, so that communication between the two parties is smoother, customer complaint rate is further reduced, and better interaction experience is brought to the two parties.
It should be noted that, in order to further ensure the security of the data, the emotion recognition model may be deployed in the blockchain, so as to avoid malicious tampering of the data.
According to the technical scheme, the target acquisition equipment can be obtained according to the communication optimization instruction, the target acquisition equipment is started to acquire the target voice and the target video of the target user, the target video is subjected to feature interception to obtain the picture to be detected, the picture to be detected is input into the pre-trained emotion recognition model, and the first emotion type is determined according to the output of the emotion recognition model, wherein the emotion recognition model is obtained based on a frame attention mechanism and residual network training, the time-related sequence features are integrated based on the frame attention mechanism, so that the characteristics of a video segment are effectively classified, emotion recognition is carried out according to the video features, representative face emotion states can be captured more advantageously, emotion recognition is carried out on the target voice, a second emotion type is obtained, when the first emotion type and/or the second emotion type are abnormal, the input voice of the target user is acquired in real time by the target acquisition equipment, the input voice is optimized, the real-time voice is obtained, the voice is output, the comprehensive emotion recognition result of the customer is more easily judged, and the interaction and the voice is more easily and accurately judged.
FIG. 2 is a functional block diagram of a preferred embodiment of the communication optimizing apparatus of the present invention. The communication optimizing device 11 comprises an acquisition unit 110, an interception unit 111, an identification unit 112, an optimizing unit 113 and an output unit 114. The module/unit referred to in the present invention refers to a series of computer program segments capable of being executed by the processor 13 and of performing a fixed function, which are stored in the memory 12. In the present embodiment, the functions of the respective modules/units will be described in detail in the following embodiments.
In response to the communication optimization instruction, the acquisition unit 110 acquires the target acquisition device according to the communication optimization instruction, and starts the target acquisition device to acquire the target voice and the target video of the target user.
In at least one embodiment of the present invention, the communication optimization command may be triggered by a customer service currently performing a call, or may be automatically triggered when an audio/video start is detected, which is not limited by the present invention.
In at least one embodiment of the present invention, the acquisition unit 110 acquires a target acquisition device according to the communication optimization instruction, including:
analyzing the method body of the communication optimization instruction to obtain information carried by the communication optimization instruction;
Acquiring a preset label corresponding to the equipment identifier;
constructing a regular expression according to the preset label;
traversing the information carried by the communication optimization instruction by using the regular expression, and determining the traversed data as a target equipment identifier;
and determining user equipment according to the target equipment identifier, and determining acquisition equipment of the user equipment as the target acquisition equipment.
For example: when a bank customer service and a customer carry out video interaction, each of the two parties holds a terminal device for conference, the terminal device of the customer is determined to be the user device by analyzing the communication optimization instruction, and the acquisition device of the user device is determined to be the target acquisition device.
The information carried by the communication optimization instruction may include, but is not limited to: equipment identification, a user name triggering the communication optimization instruction, and the like.
Wherein the communication optimization instruction is essentially a code, and in the communication optimization instruction, according to the coding principle, the content between { } is called as the method body.
The preset tag can be configured in a self-defined manner, and has a one-to-one correspondence with the device identifier, for example: the preset label can be ZID, and a regular expression ZID () is further built by the preset label, and traversal is performed by ZID ().
Through the embodiment, the device identification can be rapidly determined based on the regular expression and the preset label, and the target acquisition device is further determined by using the device identification.
And the intercepting unit 111 performs characteristic interception on the target video to obtain a picture to be detected.
Since each target video may include other non-facial information that would interfere with feature recognition, feature extraction is first performed on the video.
Specifically, the capturing unit 111 performs feature capturing on the target video, and obtaining the picture to be detected includes:
acquiring all frame pictures contained in the target video;
inputting each frame picture in all the frame pictures into a YOLOv3 network for identification to obtain a face area of each frame picture;
and intercepting the face area of each frame of picture to obtain the picture to be detected.
According to the embodiment, as the YOLOv3 network has higher stability accuracy, the YOLOv3 network is used for intercepting facial features, redundant information in the video can be effectively removed, and the accuracy and the efficiency of subsequent emotion recognition are improved.
The recognition unit 112 inputs the picture to be detected to a pre-trained emotion recognition model, and determines a first emotion type according to the output of the emotion recognition model, wherein the emotion recognition model is obtained based on a frame attention mechanism and residual network training.
For example: the output of the emotion recognition model may be: anger, 0.95.
In this embodiment, the determining, by the identifying unit 112, the first emotion type according to the output of the emotion identifying model includes:
obtaining the predicted emotion of each picture to be detected and the corresponding predicted probability from the output of the emotion recognition model;
obtaining the maximum prediction probability from the prediction probabilities as a target prediction probability;
and acquiring the predicted emotion corresponding to the target predicted probability as the first emotion type.
By the implementation mode, the final emotion recognition result can be determined by integrating all the recognition results, so that the recognition accuracy is higher.
In at least one embodiment of the invention, a sample video is obtained, and the sample video is split in a preset time length to obtain at least one sub-video;
performing feature interception on the at least one sub-video to obtain a training sample;
extracting features of the training samples by using a preset residual error network to obtain initial features;
inputting the initial characteristics to the full connection layer corresponding to each color channel, and outputting characteristic vectors;
processing the feature vector by a first sigmoid function to obtain a first attention weight;
Converting the feature vector based on the first attention weight to obtain an initial global frame feature;
the feature vector is connected with the initial global frame feature in parallel to obtain a parallel feature;
processing the parallel connection feature by a second sigmoid function to obtain a second attention weight;
converting the parallel connection feature based on the second attention weight to obtain a target global frame feature;
processing the target global frame characteristics by a softmax function, and outputting a prediction result and a loss value;
and stopping training when the loss value convergence is detected, and obtaining the emotion recognition model.
The preset duration may be configured in a user-defined manner, for example, 10 seconds.
Further, the at least one sub-video may be feature intercepted using a YOLOv3 network.
Still further, the preset residual network may be a Resnet50 network, which is not limited by the present invention.
In this embodiment, when the feature vector is connected in parallel with the initial global frame feature, a transverse parallel connection manner may be adopted.
For example: the two 1024 x 1 vectors are concatenated to obtain a 2048 x 1 vector.
By the implementation mode, the time-related sequence features can be integrated based on the frame attention mechanism, so that the features of the video segments can be effectively classified, and the emotion recognition model obtained through training has higher accuracy.
Specifically, the feature vector is converted based on the first attention weight by adopting the following formula, so as to obtain an initial global frame feature:
wherein f' v Alpha for the initial global bezel feature i For the first attention weight, f i For the feature vector, i is the number of frames to which the feature vector belongs, and n is the maximum number of frames;
and converting the parallel connection characteristic based on the second attention weight by adopting the following formula to obtain a target global frame characteristic:
wherein f v For the target global bezel feature, β i For the second attention weight, [ f ] i :f′ v ]Is saidAnd the parallel connection characteristic.
Through the embodiment, the characteristic normalization processing is carried out for a plurality of times based on the frame attention mechanism, the image characteristic is converted into the global video characteristic, and the emotion recognition is carried out by the video characteristic, so that the representative face emotion state can be captured more advantageously.
The recognition unit 112 performs emotion recognition on the target voice to obtain a second emotion type.
In at least one embodiment of the present invention, the emotion recognition may be performed on the target speech using GMM (Adaptive background mixture models for real-time tracking, gaussian mixture model), SVM (Support Vector Machine ), HMM (Hidden Markov Model, hidden markov model), multiple classifer system (multi-classifier system), and the like, which are not described herein.
It should be noted that the first emotion type refers to an emotion type recognized according to the emotion recognition model, and the second emotion type refers to an emotion type recognized after emotion recognition according to the target voice. Both the first emotion type and the second emotion type may include anger, happiness, etc.
When the first emotion type and/or the second emotion type is abnormal, the optimizing unit 113 uses the target collecting device to collect the input voice of the target user in real time, and performs optimizing processing on the input voice to obtain real-time voice.
In this embodiment, when the first emotion type and/or the second emotion type is detected as anger, excited, or the like, it may be determined as abnormal.
According to the embodiment, the emotion recognition result for the video and the emotion recognition result for the voice are comprehensively judged, and the accuracy of emotion judgment is effectively improved.
In at least one embodiment of the present invention, the optimizing unit 113 performs an optimizing process on the input voice, and obtaining real-time voice includes:
noise reduction processing is carried out on the input voice to obtain a first voice;
Identifying target sound waves in the first voice, and deleting the target sound waves from the first voice to obtain second voice;
and carrying out fade-in and fade-out processing on the second voice to obtain the real-time voice.
Wherein the target sound wave may include an aggressive sound wave or the like, and the invention is not limited herein.
According to the embodiment, when the emotion abnormality of the client is detected, the real-time voice input by the client can be processed softly.
The output unit 114 outputs the real-time voice.
For example: and outputting the real-time voice on terminal equipment of customer service personnel.
Through the embodiment, customer service personnel can be prevented from being influenced by the emotion of the customer, so that communication between the two parties is smoother, customer complaint rate is further reduced, and better interaction experience is brought to the two parties.
It should be noted that, in order to further ensure the security of the data, the emotion recognition model may be deployed in the blockchain, so as to avoid malicious tampering of the data.
According to the technical scheme, the target acquisition equipment can be obtained according to the communication optimization instruction, the target acquisition equipment is started to acquire the target voice and the target video of the target user, the target video is subjected to feature interception to obtain the picture to be detected, the picture to be detected is input into the pre-trained emotion recognition model, and the first emotion type is determined according to the output of the emotion recognition model, wherein the emotion recognition model is obtained based on a frame attention mechanism and residual network training, the time-related sequence features are integrated based on the frame attention mechanism, so that the characteristics of a video segment are effectively classified, emotion recognition is carried out according to the video features, representative face emotion states can be captured more advantageously, emotion recognition is carried out on the target voice, a second emotion type is obtained, when the first emotion type and/or the second emotion type are abnormal, the input voice of the target user is acquired in real time by the target acquisition equipment, the input voice is optimized, the real-time voice is obtained, the voice is output, the comprehensive emotion recognition result of the customer is more easily judged, and the interaction and the voice is more easily and accurately judged.
Fig. 3 is a schematic structural diagram of an electronic device according to a preferred embodiment of the present invention for implementing the communication optimization method.
The electronic device 1 may comprise a memory 12, a processor 13 and a bus, and may further comprise a computer program, such as a communication optimization program, stored in the memory 12 and executable on the processor 13.
It will be appreciated by those skilled in the art that the schematic diagram is merely an example of the electronic device 1 and does not constitute a limitation of the electronic device 1, the electronic device 1 may be a bus type structure, a star type structure, the electronic device 1 may further comprise more or less other hardware or software than illustrated, or a different arrangement of components, for example, the electronic device 1 may further comprise an input-output device, a network access device, etc.
It should be noted that the electronic device 1 is only used as an example, and other electronic products that may be present in the present invention or may be present in the future are also included in the scope of the present invention by way of reference.
The memory 12 includes at least one type of readable storage medium including flash memory, a removable hard disk, a multimedia card, a card memory (e.g., SD or DX memory, etc.), a magnetic memory, a magnetic disk, an optical disk, etc. The memory 12 may in some embodiments be an internal storage unit of the electronic device 1, such as a mobile hard disk of the electronic device 1. The memory 12 may in other embodiments also be an external storage device of the electronic device 1, such as a plug-in mobile hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card) or the like, which are provided on the electronic device 1. Further, the memory 12 may also include both an internal storage unit and an external storage device of the electronic device 1. The memory 12 may be used not only for storing application software installed in the electronic device 1 and various types of data, such as codes of a communication optimization program, etc., but also for temporarily storing data that has been output or is to be output.
The processor 13 may be comprised of integrated circuits in some embodiments, for example, a single packaged integrated circuit, or may be comprised of multiple integrated circuits packaged with the same or different functions, including one or more central processing units (Central Processing unit, CPU), microprocessors, digital processing chips, graphics processors, a combination of various control chips, and the like. The processor 13 is a Control Unit (Control Unit) of the electronic device 1, connects the respective components of the entire electronic device 1 using various interfaces and lines, and executes various functions of the electronic device 1 and processes data by running or executing programs or modules (e.g., executing a communication optimization program, etc.) stored in the memory 12, and calling data stored in the memory 12.
The processor 13 executes the operating system of the electronic device 1 and various types of applications installed. The processor 13 executes the application program to implement the steps of the various communication optimization method embodiments described above, such as the steps shown in fig. 1.
Illustratively, the computer program may be partitioned into one or more modules/units that are stored in the memory 12 and executed by the processor 13 to complete the present invention. The one or more modules/units may be a series of computer readable instruction segments capable of performing the specified functions, which instruction segments describe the execution of the computer program in the electronic device 1. For example, the computer program may be divided into an acquisition unit 110, an interception unit 111, an identification unit 112, an optimization unit 113, an output unit 114.
The integrated units implemented in the form of software functional modules described above may be stored in a computer readable storage medium. The software functional modules are stored in a storage medium and include instructions for causing a computer device (which may be a personal computer, a computer device, or a network device, etc.) or a processor (processor) to perform portions of the communications optimization method according to the embodiments of the present invention.
The integrated modules/units of the electronic device 1 may be stored in a computer readable storage medium if implemented in the form of software functional units and sold or used as separate products. Based on this understanding, the present invention may also be implemented by a computer program for instructing a relevant hardware device to implement all or part of the procedures of the above-mentioned embodiment method, where the computer program may be stored in a computer readable storage medium and the computer program may be executed by a processor to implement the steps of each of the above-mentioned method embodiments.
Wherein the computer program comprises computer program code which may be in source code form, object code form, executable file or some intermediate form etc. The computer readable medium may include: any entity or device capable of carrying the computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM), a random access Memory, or the like.
Further, the computer-readable storage medium may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function, and the like; the storage data area may store data created from the use of blockchain nodes, and the like.
The blockchain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, encryption algorithm and the like. The Blockchain (Blockchain), which is essentially a decentralised database, is a string of data blocks that are generated by cryptographic means in association, each data block containing a batch of information of network transactions for verifying the validity of the information (anti-counterfeiting) and generating the next block. The blockchain may include a blockchain underlying platform, a platform product services layer, an application services layer, and the like.
The bus may be a peripheral component interconnect standard (peripheral component interconnect, PCI) bus or an extended industry standard architecture (extended industry standard architecture, EISA) bus, among others. The bus may be classified as an address bus, a data bus, a control bus, etc. For ease of illustration, only one arrow is shown in FIG. 3, but only one bus or one type of bus is not shown. The bus is arranged to enable a connection communication between the memory 12 and at least one processor 13 or the like.
Although not shown, the electronic device 1 may further comprise a power source (such as a battery) for powering the various components, which may preferably be logically connected to the at least one processor 13 via a power management means, so as to perform functions such as charge management, discharge management, and power consumption management via the power management means. The power supply may also include one or more of any of a direct current or alternating current power supply, recharging device, power failure detection circuit, power converter or inverter, power status indicator, etc. The electronic device 1 may further include various sensors, bluetooth modules, wi-Fi modules, etc., which will not be described herein.
Further, the electronic device 1 may also comprise a network interface, optionally the network interface may comprise a wired interface and/or a wireless interface (e.g. WI-FI interface, bluetooth interface, etc.), typically used for establishing a communication connection between the electronic device 1 and other electronic devices.
The electronic device 1 may optionally further comprise a user interface, which may be a Display, an input unit, such as a Keyboard (Keyboard), or a standard wired interface, a wireless interface. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch, or the like. The display may also be referred to as a display screen or display unit, as appropriate, for displaying information processed in the electronic device 1 and for displaying a visual user interface.
It should be understood that the embodiments described are for illustrative purposes only and are not limited to this configuration in the scope of the patent application.
Fig. 3 shows only an electronic device 1 with components 12-13, it being understood by a person skilled in the art that the structure shown in fig. 3 does not constitute a limitation of the electronic device 1, and may comprise fewer or more components than shown, or may combine certain components, or a different arrangement of components.
In connection with fig. 1, the memory 12 in the electronic device 1 stores a plurality of instructions for implementing a communication optimization method, the processor 13 being executable to implement:
responding to a communication optimization instruction, acquiring target acquisition equipment according to the communication optimization instruction, and starting the target acquisition equipment to acquire target voice and target video of a target user;
carrying out feature interception on the target video to obtain a picture to be detected;
inputting the picture to be detected into a pre-trained emotion recognition model, and determining a first emotion type according to the output of the emotion recognition model, wherein the emotion recognition model is obtained based on a frame attention mechanism and residual network training;
carrying out emotion recognition on the target voice to obtain a second emotion type;
When the first emotion type and/or the second emotion type is abnormal, acquiring input voice of the target user in real time by using the target acquisition equipment, and performing optimization processing on the input voice to obtain real-time voice;
and outputting the real-time voice.
Specifically, the specific implementation method of the above instructions by the processor 13 may refer to the description of the relevant steps in the corresponding embodiment of fig. 1, which is not repeated herein.
In the several embodiments provided in the present invention, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is merely a logical function division, and there may be other manners of division when actually implemented.
The modules described as separate components may or may not be physically separate, and components shown as modules may or may not be physical units, may be located in one place, or may be distributed over multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional module in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units can be realized in a form of hardware or a form of hardware and a form of software functional modules.
It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential characteristics thereof.
The present embodiments are, therefore, to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference signs in the claims shall not be construed as limiting the claim concerned.
Furthermore, it is evident that the word "comprising" does not exclude other elements or steps, and that the singular does not exclude a plurality. Multiple units or means as set forth in the system embodiments may also be implemented by one unit or means in software or hardware. The terms first, second, etc. are used to denote a name, but not any particular order.
Finally, it should be noted that the above-mentioned embodiments are merely for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications and equivalents may be made to the technical solution of the present invention without departing from the spirit and scope of the technical solution of the present invention.

Claims (7)

1. A communication optimization method, characterized in that the communication optimization method comprises:
responding to a communication optimization instruction, acquiring target acquisition equipment according to the communication optimization instruction, and starting the target acquisition equipment to acquire target voice and target video of a target user;
carrying out feature interception on the target video to obtain a picture to be detected;
inputting the picture to be detected into a pre-trained emotion recognition model, and determining a first emotion type according to the output of the emotion recognition model, wherein the emotion recognition model is obtained based on a frame attention mechanism and residual network training;
carrying out emotion recognition on the target voice to obtain a second emotion type;
when the first emotion type and/or the second emotion type is abnormal, the target collecting device is used for collecting input voice of the target user in real time, optimizing processing is performed on the input voice to obtain real-time voice, and the obtaining of the real-time voice comprises the following steps: noise reduction processing is carried out on the input voice to obtain a first voice; identifying target sound waves in the first voice, and deleting the target sound waves from the first voice to obtain second voice; performing fade-in and fade-out processing on the second voice to obtain the real-time voice;
Outputting the real-time voice;
wherein the method further comprises: acquiring a sample video, and splitting the sample video with preset time length to obtain at least one sub-video; performing feature interception on the at least one sub-video to obtain a training sample; extracting features of the training samples by using a preset residual error network to obtain initial features; inputting the initial characteristics to the full connection layer corresponding to each color channel, and outputting characteristic vectors; processing the feature vector by a first sigmoid function to obtain a first attention weight; converting the feature vector based on the first attention weight to obtain an initial global frame feature; the feature vector is connected with the initial global frame feature in parallel to obtain a parallel feature; processing the parallel connection feature by a second sigmoid function to obtain a second attention weight; converting the parallel connection feature based on the second attention weight to obtain a target global frame feature; processing the target global frame characteristics by a softmax function, and outputting a prediction result and a loss value; stopping training when the loss value convergence is detected, and obtaining the emotion recognition model;
The feature vector is converted based on the first attention weight by adopting the following formula to obtain an initial global frame feature:
wherein,for the initial global bezel feature, +.>For said first attention weight, +.>For the feature vector, i is the number of frames to which the feature vector belongs, and n is the maximum number of frames;
and converting the parallel connection characteristic based on the second attention weight by adopting the following formula to obtain a target global frame characteristic:
wherein,for the target global bezel feature, +.>For said second attention weight, +.>Is the concatenation feature.
2. The communication optimizing method as claimed in claim 1, wherein said acquiring the target acquisition device according to the communication optimizing instruction comprises:
analyzing the method body of the communication optimization instruction to obtain information carried by the communication optimization instruction;
acquiring a preset label corresponding to the equipment identifier;
constructing a regular expression according to the preset label;
traversing the information carried by the communication optimization instruction by using the regular expression, and determining the traversed data as a target equipment identifier;
and determining user equipment according to the target equipment identifier, and determining acquisition equipment of the user equipment as the target acquisition equipment.
3. The communication optimization method according to claim 1, wherein the feature capturing the target video to obtain the picture to be detected comprises:
acquiring all frame pictures contained in the target video;
inputting each frame picture in all the frame pictures into a YOLOv3 network for identification to obtain a face area of each frame picture;
and intercepting the face area of each frame of picture to obtain the picture to be detected.
4. The communication optimization method of claim 1, wherein the determining a first emotion type from the output of the emotion recognition model comprises:
obtaining the predicted emotion of each picture to be detected and the corresponding predicted probability from the output of the emotion recognition model;
obtaining the maximum prediction probability from the prediction probabilities as a target prediction probability;
and acquiring the predicted emotion corresponding to the target predicted probability as the first emotion type.
5. A communication optimizing apparatus, characterized in that the communication optimizing apparatus includes a module that implements the communication optimizing method according to any one of claims 1 to 4, the communication optimizing apparatus comprising:
the acquisition unit is used for responding to the communication optimization instruction, acquiring target acquisition equipment according to the communication optimization instruction, and starting the target acquisition equipment to acquire target voice and target video of a target user;
The intercepting unit is used for characteristic intercepting of the target video to obtain a picture to be detected;
the identification unit is used for inputting the picture to be detected into a pre-trained emotion identification model and determining a first emotion type according to the output of the emotion identification model, wherein the emotion identification model is obtained based on a frame attention mechanism and residual network training;
the recognition unit is further used for carrying out emotion recognition on the target voice to obtain a second emotion type;
the optimizing unit is used for acquiring the input voice of the target user in real time by using the target acquisition equipment when the first emotion type and/or the second emotion type is abnormal, and optimizing the input voice to obtain real-time voice;
and the output unit is used for outputting the real-time voice.
6. An electronic device, the electronic device comprising:
a memory storing at least one instruction; a kind of electronic device with high-pressure air-conditioning system
A processor executing instructions stored in the memory to implement the communication optimization method of any one of claims 1 to 4.
7. A computer-readable storage medium, characterized by: the computer-readable storage medium has stored therein at least one instruction that is executed by a processor in an electronic device to implement the communication optimization method of any one of claims 1-4.
CN202011545611.6A 2020-12-23 2020-12-23 Communication optimization method, device, equipment and medium Active CN112633172B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011545611.6A CN112633172B (en) 2020-12-23 2020-12-23 Communication optimization method, device, equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011545611.6A CN112633172B (en) 2020-12-23 2020-12-23 Communication optimization method, device, equipment and medium

Publications (2)

Publication Number Publication Date
CN112633172A CN112633172A (en) 2021-04-09
CN112633172B true CN112633172B (en) 2023-11-14

Family

ID=75324289

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011545611.6A Active CN112633172B (en) 2020-12-23 2020-12-23 Communication optimization method, device, equipment and medium

Country Status (1)

Country Link
CN (1) CN112633172B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108962255A (en) * 2018-06-29 2018-12-07 北京百度网讯科技有限公司 Emotion identification method, apparatus, server and the storage medium of voice conversation
CN111276162A (en) * 2020-01-14 2020-06-12 林泽珊 Hearing aid-based voice output optimization method, server and storage medium
CN111368609A (en) * 2018-12-26 2020-07-03 深圳Tcl新技术有限公司 Voice interaction method based on emotion engine technology, intelligent terminal and storage medium

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8214214B2 (en) * 2004-12-03 2012-07-03 Phoenix Solutions, Inc. Emotion detection device and method for use in distributed systems
US10628741B2 (en) * 2010-06-07 2020-04-21 Affectiva, Inc. Multimodal machine learning for emotion metrics
CN105334743B (en) * 2015-11-18 2018-10-26 深圳创维-Rgb电子有限公司 A kind of intelligent home furnishing control method and its system based on emotion recognition
CN108197115B (en) * 2018-01-26 2022-04-22 上海智臻智能网络科技股份有限公司 Intelligent interaction method and device, computer equipment and computer readable storage medium

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108962255A (en) * 2018-06-29 2018-12-07 北京百度网讯科技有限公司 Emotion identification method, apparatus, server and the storage medium of voice conversation
CN111368609A (en) * 2018-12-26 2020-07-03 深圳Tcl新技术有限公司 Voice interaction method based on emotion engine technology, intelligent terminal and storage medium
CN111276162A (en) * 2020-01-14 2020-06-12 林泽珊 Hearing aid-based voice output optimization method, server and storage medium

Also Published As

Publication number Publication date
CN112633172A (en) 2021-04-09

Similar Documents

Publication Publication Date Title
CN111488433B (en) Artificial intelligence interactive system suitable for bank and capable of improving field experience
US11436863B2 (en) Method and apparatus for outputting data
WO2021232594A1 (en) Speech emotion recognition method and apparatus, electronic device, and storage medium
US11315366B2 (en) Conference recording method and data processing device employing the same
JP7394809B2 (en) Methods, devices, electronic devices, media and computer programs for processing video
CN112001175B (en) Flow automation method, device, electronic equipment and storage medium
CN108920640B (en) Context obtaining method and device based on voice interaction
US11822568B2 (en) Data processing method, electronic equipment and storage medium
JP2021034003A (en) Human object recognition method, apparatus, electronic device, storage medium, and program
CN113343824A (en) Double-recording quality inspection method, device, equipment and medium
CN113345431B (en) Cross-language voice conversion method, device, equipment and medium
CN113205814B (en) Voice data labeling method and device, electronic equipment and storage medium
CN111814467A (en) Label establishing method, device, electronic equipment and medium for prompting call collection
CN112542172A (en) Communication auxiliary method, device, equipment and medium based on online conference
CN112201253B (en) Text marking method, text marking device, electronic equipment and computer readable storage medium
CN112528265A (en) Identity recognition method, device, equipment and medium based on online conference
CN112633172B (en) Communication optimization method, device, equipment and medium
CN113408265B (en) Semantic analysis method, device and equipment based on human-computer interaction and storage medium
CN112786041B (en) Voice processing method and related equipment
US11539915B2 (en) Transmission confirmation in a remote conference
CN111859985B (en) AI customer service model test method and device, electronic equipment and storage medium
CN114401346A (en) Response method, device, equipment and medium based on artificial intelligence
CN114548114A (en) Text emotion recognition method, device, equipment and storage medium
CN112101191A (en) Expression recognition method, device, equipment and medium based on frame attention network
CN112633170B (en) Communication optimization method, device, equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant