WO2021068843A1 - 一种情绪识别方法及装置、电子设备和可读存储介质 - Google Patents

一种情绪识别方法及装置、电子设备和可读存储介质 Download PDF

Info

Publication number
WO2021068843A1
WO2021068843A1 PCT/CN2020/119487 CN2020119487W WO2021068843A1 WO 2021068843 A1 WO2021068843 A1 WO 2021068843A1 CN 2020119487 W CN2020119487 W CN 2020119487W WO 2021068843 A1 WO2021068843 A1 WO 2021068843A1
Authority
WO
WIPO (PCT)
Prior art keywords
information
voice
emotion
text
feature
Prior art date
Application number
PCT/CN2020/119487
Other languages
English (en)
French (fr)
Inventor
方豪
占小杰
王少军
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2021068843A1 publication Critical patent/WO2021068843A1/zh

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M3/00Automatic or semi-automatic exchanges
    • H04M3/42Systems providing special services or facilities to subscribers
    • H04M3/50Centralised arrangements for answering calls; Centralised arrangements for recording messages for absent or busy subscribers ; Centralised arrangements for recording messages
    • H04M3/51Centralised call answering arrangements requiring operator intervention, e.g. call or contact centers for telemarketing
    • H04M3/5175Call or contact centers supervision arrangements

Definitions

  • the present invention belongs to the field of data recognition and processing, and more specifically, relates to an emotion recognition method and device, electronic equipment and readable storage medium.
  • the call center system refers to an operating system that uses modern communication and computer technology to automatically and flexibly handle a large number of different telephone inbound/outbound services to achieve service operations.
  • Artificial intelligence is used to conduct data mining on customer service call data, and timely and effectively track and monitor the emotional state of customer service and customers during customer service calls. It is of great significance for enterprises to improve their service quality. At present, most companies mainly rely on hiring specialized quality inspectors to sample and monitor call recordings to achieve this goal. Applicants found that this will bring additional costs to the company, and on the other hand, due to the lack of sampling coverage. Certainty and the subjective sentiment contained in artificial judgment make the effect of manual quality inspection have certain limitations.
  • quality inspectors can only evaluate the emotional performance of the customer service and the customer after the recording is completed, but it is difficult to monitor the emotional state of the customer service and the customer in real time during the call. When customers have very negative emotions, they cannot promptly and effectively remind customer service staff.
  • the present invention provides an emotion recognition method and device, electronic equipment, and readable storage medium.
  • the first aspect of the present invention provides an emotion recognition method, including:
  • voice emotion recognition and text emotion recognition on the voice recognition information and text recognition information to obtain voice emotion recognition information and text emotion recognition information;
  • the voice emotion recognition information and the text emotion recognition information are calculated to obtain the emotion information of the voice signal.
  • the second aspect of the present application provides an emotion recognition device, including:
  • Collection module used to collect voice signals
  • the processing module is used to process the voice signal to obtain voice recognition information and text recognition information;
  • a recognition module configured to perform voice emotion recognition and text emotion recognition on the voice recognition information and text recognition information to obtain voice emotion recognition information and text emotion recognition information;
  • the calculation module is configured to calculate the voice emotion recognition information and text emotion recognition information according to preset calculation rules to obtain the emotion information of the voice signal.
  • a third aspect of the present invention provides an electronic device, including: a memory and a processor, the memory includes an emotion recognition method program, the emotion recognition method program is executed by the processor to realize the emotion recognition method as described above A step of.
  • the fourth aspect of the present invention provides a computer-readable storage medium, the computer-readable storage medium includes an emotion recognition method program, when the emotion recognition method program is executed by a processor, the emotion recognition method as described above is implemented step.
  • the emotion recognition method, system and readable storage medium provided by the present invention perform emotion recognition by extracting speech and text from the speech signal, thereby improving the accuracy of emotion recognition. Through the screening of voice and text information, the efficiency and accuracy of processing are improved.
  • the present invention provides a specific and effective solution for the identification of negative emotions in the call center scene of the customer service, and plays an active and important role in improving the quality of customer service and the reference standard for performance evaluation of service personnel. For different application scenarios, the results of voice and text emotion models are combined to meet the actual requirements of the business.
  • Figure 1 shows a flow chart of an emotion recognition method of the present invention
  • Figure 2 shows a flow chart of the present invention to recognize voice information processing
  • Figure 3 shows a flow chart of speech emotion recognition of the present invention
  • Figure 4 shows a flow chart of text emotion recognition of the present invention
  • Fig. 5 shows a block diagram of an emotion recognition system of the present invention.
  • Fig. 1 shows a flowchart of an emotion recognition method of the present invention.
  • the present invention discloses an emotion recognition method, including:
  • S106 Perform voice emotion recognition and text emotion recognition on the voice recognition information and text recognition information to obtain voice emotion recognition information and text emotion recognition information;
  • S108 Calculate the voice emotion recognition information and the text emotion recognition information according to a preset calculation rule to obtain the emotion information of the voice signal.
  • the customer service or agent will collect their voice signals in real time during the call.
  • the collected voice signal can be collected in the form of sampling or fixed time window. For example, when sampling is used, the 5-7 seconds, 9-11 seconds, etc. of the call are collected for voice; when a fixed time window is used for collection, the 10-25 seconds of the call is collected for voice.
  • Those skilled in the art can choose the collection method according to actual needs, but any method of using the present invention for voice collection to judge emotions will fall into the protection scope of the present invention.
  • the voice signal is processed to obtain voice recognition information and text recognition information.
  • the voice recognition information is used to obtain emotion information through voice emotion recognition
  • the text recognition information is used to obtain emotion information through text emotion recognition.
  • the emotional information obtained by each different recognition method may not be the same, so in the end, it is necessary to comprehensively process the emotional information obtained by the two to obtain the emotional information. Through the comprehensive processing of the two recognition results, the accuracy of emotion recognition can be guaranteed.
  • Figure 2 shows a flow chart of the present invention for processing voice information recognition.
  • the processing the voice signal to obtain voice recognition information includes:
  • S204 Extract feature information of the multiple sub-voice information, and the feature information of each sub-voice information forms a total set of feature information of the sub-voice information;
  • S210 Calculate the matching degree of the feature amount of each sub-voice information according to the feature information set that matches the multiple feature statistics information and the total set of feature information of the sub-voice information;
  • S212 Determine the sub-voice information whose feature amount matching degree is greater than a preset feature amount threshold as voice recognition information.
  • the voice signal is divided into multiple sub-voice information.
  • the sub-voice information may be divided into multiple sub-voice information.
  • the sub-voice information may be divided by time or quantity, or may be performed by other rules.
  • the collected 15-second voice signal is divided into 3 second sub-voice information, which can be divided into 5 sections in total, and the division is carried out in chronological order, that is, the first 3 seconds are divided into one section, and the first 3 seconds are divided into one section. , And so on.
  • the feature information of the sub-voice information is extracted and matched with multiple feature statistical information in the preset voice library.
  • the voice feature statistics information is pre-stored in the background database.
  • the voice feature statistics information is the vocabulary or sentence information that is more reflective of emotions after screening and confirmation, which can be based on experience and Research the identified resources. For example, some useless words are not included in the feature statistics information, such as numbers, mathematical characters, punctuation marks, and Chinese characters that are used very frequently; the feature statistics can include words that are frequently used and can reflect emotional characteristics. Or phrases, such as hello, goodbye, no words, or another similar phrase, is there anything else, let’s do this first, etc.
  • the feature amount matching degree of each sub-voice information is calculated. It should be noted that if the sub-voice information overlaps with multiple preset feature statistics, the matching degree is high.
  • the sub-voice information whose matching degree is greater than the preset feature amount threshold is determined as the recognized voice information.
  • the preset feature value threshold can be 0.5, 0.7, etc., that is, when the matching degree is greater than 0.5, this sub-voice information is selected as the recognition voice information. Using this step, the voice data information with low matching degree can be filtered, and the speed and efficiency of emotion recognition can be improved.
  • Fig. 3 shows a flowchart of speech emotion recognition of the present invention. As shown in FIG. 3, according to the embodiment of the present invention, performing voice emotion recognition on the voice recognition information is specifically:
  • S306 Select an emotion corresponding to a probability value greater than a preset emotion threshold to obtain voice emotion recognition information of the voice signal.
  • the emotion training model is from the speech emotion database (Berlin emotion database).
  • the model is machine learning and can be used to classify the characteristic information representing emotions.
  • This voice database includes anger, boredom, disgust, fear, and joy.
  • the voice database may further include emotions other than the above seven emotions. For example, in the exemplary embodiment of the present invention, 535 sentences that are relatively complete and good are selected from the recorded 700 sentences as the data for training the voice emotion classification model.
  • the probability value of each different emotion will be obtained, and the probability value greater than the preset emotion threshold value will be selected as the corresponding emotion.
  • the probability value of the preset emotion threshold can be set by those skilled in the art according to actual needs and experience. For example, the probability value can be set to 70%, and emotions greater than 70% are determined as the final emotion recognition information.
  • the emotion corresponding to the average probability value of the multiple probability values is selected as the voice emotion recognition information of the voice signal.
  • the probability value for example, the probability of anger is 80%, the probability of disgust is 75%, and both are greater than the 70% threshold, then the one with the largest probability value is selected as the final emotion .
  • the present invention does not limit the specific implementation method of selecting emotions by probability values. That is to say, in other embodiments, other methods can be selected to perform probability value emotion recognition, for example, selecting emotion probability values recognized by multiple sub-voice information , The average is calculated, and the one with the highest probability is determined as the final emotion.
  • Fig. 4 schematically shows a flow chart of text emotion recognition.
  • performing text emotion recognition on text recognition information includes:
  • S402 Perform feature extraction on the text recognition information to generate multiple feature vectors
  • S404 Perform text model matching on multiple feature vectors respectively to obtain a classification result of each feature vector
  • the feature extraction of the text recognition information to generate multiple feature vectors includes: calculating each keyword in the keyword dictionary according to the pre-established keyword dictionary with a number of keywords of N Corresponding TF-IDF value; Generate the corresponding feature vector according to the TF-IDF value corresponding to each keyword.
  • the keyword dictionary mentioned here is extracted for the above-mentioned tested text set.
  • the dimension of the feature vector can be greatly reduced, thereby improving the efficiency of emotion classification.
  • the dimension of the feature vector is N, and the components in each dimension of the feature vector are the TF-IDF values corresponding to each keyword in the keyword dictionary.
  • the text model is a pre-trained text model. After each feature vector is input to the text model, a corresponding classification result will be obtained. Each feature vector may get different classification results, assign different classification results to emotion values, and then weight each emotion value according to a preset algorithm to obtain the final emotion information.
  • the preset algorithm may be to set a corresponding weighting coefficient according to each different keyword, and the feature vector corresponding to each keyword is also equal to the above-mentioned weighting coefficient. For example, the weighting coefficient corresponding to the keyword "hello" is 0.2, and the weighting coefficient of the keyword "goodbye” is 0.1.
  • the corresponding emotion value is multiplied by the corresponding weighting coefficient and then added. Get the final emotion value, then this emotion value will correspond to an emotion.
  • Those skilled in the art can also adjust the weight value in real time according to actual needs, so as to improve the accuracy of emotion recognition.
  • calculating the voice emotion recognition information and text emotion recognition information according to preset calculation rules to obtain emotion information includes:
  • the emotion values are respectively assigned according to the above information, and their values are added to obtain the result value.
  • the value range can be set by those skilled in the art according to actual needs, and if each value falls within the corresponding value range, it is determined as the corresponding emotion.
  • the emotion recognition information can be determined as positive emotions, neutral emotions, and negative emotions, and their emotion values are +1, 0, and -1, respectively. If the voice emotion is recognized as a positive emotion, the value is +1, and the text emotion is recognized as a negative emotion, and the value is -1. After the two are added, the value is 0, so it is judged as a neutral emotion.
  • the voice emotion is recognized as a positive emotion
  • the value is +1
  • the text emotion is recognized as a positive emotion
  • the value is +1.
  • the value is +2. If it is greater than 0, it is judged as a positive emotion .
  • the emotion training model in this embodiment may be a conventional emotion training model in the field.
  • the emotion training model may be trained using TensorFlow, or an algorithm such as RNN may be used for model training.
  • Fig. 5 shows a block diagram of an emotion recognition system of the present invention.
  • the second aspect of the present invention provides an emotion recognition system 5, which includes a memory 51 and a processor 52.
  • the memory includes an emotion recognition method program, and the emotion recognition method program is When the processor executes, the following steps are implemented:
  • voice emotion recognition and text emotion recognition on the voice recognition information and text recognition information to obtain voice emotion recognition information and text emotion recognition information;
  • the voice emotion recognition information and the text emotion recognition information are calculated to obtain the emotion information of the voice signal.
  • the customer service or agent will collect their voice signals in real time during the call.
  • the voice signal can be collected by sampling or collection with a fixed time window. For example, when sampling is used, the 5-7 seconds, 9-11 seconds, etc. of the call are collected for voice; when a fixed time window is used for collection, the 10-25 seconds of the call is collected for voice.
  • a fixed time window is used for collection, the 10-25 seconds of the call is collected for voice.
  • the voice signal is processed to obtain voice recognition information and text recognition information.
  • the voice recognition information is used to obtain emotion information through voice emotion recognition
  • the text recognition information is used to obtain emotion information through text emotion recognition.
  • the emotional information obtained by each different recognition method may not be the same, so in the end, it is necessary to comprehensively process the emotional information obtained by the two to obtain the emotional information. Through the comprehensive processing of the two recognition results, the accuracy of emotion recognition can be guaranteed.
  • the processing the voice signal to obtain voice recognition information includes:
  • Extracting feature information of the multiple sub-voice information, and the feature information of each sub-voice information forms a total set of feature information of the sub-voice information
  • the sub-voice information whose feature amount matching degree is greater than the preset feature amount threshold is determined as the voice recognition information.
  • the voice signal is divided into multiple sub-voice information.
  • the sub-voice information may be divided into multiple sub-voice information.
  • the sub-voice information may be divided by time or quantity, or may be performed by other rules.
  • the collected 15-second voice signal is divided into 3 second sub-voice information, which can be divided into 5 sections in total, and the division is carried out in chronological order, that is, the first 3 seconds are divided into one section, and the first 3 seconds are divided into one section. , And so on.
  • the feature information of the sub-voice information is extracted and matched with multiple feature statistical information in the preset voice library.
  • the voice feature statistics information is pre-stored in the background database.
  • the voice feature statistics information is the vocabulary or sentence information that is more reflective of emotions after screening and confirmation, which can be based on experience and Research the identified resources. For example, some useless words are not included in the feature statistics information, such as numbers, mathematical characters, punctuation marks, and Chinese characters that are used very frequently; the feature statistics can include words that are frequently used and can reflect emotional characteristics. Or phrases, such as hello, goodbye, no words, or another similar phrase, is there anything else, let’s do this first, etc.
  • the feature amount matching degree of each sub-voice information is calculated. It should be noted that if the sub-voice information overlaps with multiple preset feature statistics, the matching degree is high.
  • the sub-voice information whose matching degree is greater than the preset feature amount threshold is determined as the recognized voice information.
  • the preset feature value threshold can be 0.5, 0.7, etc., that is, when the matching degree is greater than 0.5, this sub-voice information is selected as the recognition voice information. Using this step, the voice data information with low matching degree can be filtered, and the speed and efficiency of emotion recognition can be improved.
  • performing voice emotion recognition on the voice recognition information is specifically:
  • S306 Select an emotion corresponding to a probability value greater than a preset emotion threshold to obtain voice emotion recognition information of the voice signal.
  • the emotion training model is from the speech emotion database (Berlin emotion database), this voice database contains seven emotions: anger, boredom, disgust, fear, joy, neutral, and sadness. These voice signals are composed of sentences corresponding to the seven emotions that a number of professional actors individually demonstrate. It is worth noting that the present invention does not limit the types of emotions to be recognized. In other words, in another embodiment, the voice database may further include emotions other than the above seven emotions. For example, in the exemplary embodiment of the present invention, 535 sentences that are relatively complete and good are selected from the recorded 700 sentences as the data for training the voice emotion classification model.
  • the probability value of each different emotion will be obtained, and the probability value greater than the preset emotion threshold value will be selected as the corresponding emotion.
  • the probability value of the preset emotion threshold can be set by those skilled in the art according to actual needs and experience. For example, the probability value can be set to 70%, and emotions greater than 70% are determined as the final emotion recognition information.
  • the emotion corresponding to the average probability value of the multiple probability values is selected as the voice emotion recognition information of the voice signal.
  • the probability value for example, the probability of anger is 80%, the probability of disgust is 75%, and both are greater than the 70% threshold, then the one with the largest probability value is selected as the final emotion .
  • the present invention does not limit the specific implementation method of selecting emotions by probability values. That is to say, in other embodiments, other methods can be selected to perform probability value emotion recognition, for example, selecting emotion probability values recognized by multiple sub-voice information , The average is calculated, and the one with the highest probability is determined as the final emotion.
  • performing text emotion recognition on text recognition information includes:
  • the emotion corresponding to the emotion value is used as the text emotion recognition information of the voice signal.
  • the feature extraction on the text recognition information to generate multiple feature vectors includes:
  • the corresponding feature vector is generated according to the TF-IDF value corresponding to each keyword.
  • the keyword dictionary mentioned here is extracted for the above-mentioned tested text set.
  • the dimension of the feature vector can be greatly reduced, thereby improving the efficiency of emotion classification.
  • the dimension of the feature vector is N, and the components in each dimension of the feature vector are the TF-IDF values corresponding to each keyword in the keyword dictionary.
  • the text model is a pre-trained text model. After each feature vector is input to the text model, a corresponding classification result will be obtained. Each feature vector may get different classification results, assign different classification results to emotion values, and then weight each emotion value according to a preset algorithm to obtain the final emotion information.
  • the preset algorithm may be to set a corresponding weighting coefficient according to each different keyword, and the feature vector corresponding to each keyword is also equal to the above-mentioned weighting coefficient. For example, the weighting coefficient corresponding to the keyword "hello" is 0.2, and the weighting coefficient of the keyword "goodbye” is 0.1.
  • the corresponding emotion value is multiplied by the corresponding weighting coefficient and then added. Get the final emotion value, then this emotion value will correspond to an emotion.
  • Those skilled in the art can also adjust the weight value in real time according to actual needs, so as to improve the accuracy of emotion recognition.
  • calculating the voice emotion recognition information and text emotion recognition information according to preset calculation rules to obtain emotion information includes:
  • the emotion values are respectively assigned according to the above information, and their values are added to obtain the result value.
  • the value range can be set by those skilled in the art according to actual needs, and if each value falls within the corresponding value range, it is determined as the corresponding emotion.
  • the emotion recognition information can be determined as positive emotions, neutral emotions, and negative emotions, and their emotion values are +1, 0, and -1, respectively. If the voice emotion is recognized as a positive emotion, the value is +1, and the text emotion is recognized as a negative emotion, and the value is -1. After the two are added, the value is 0, so it is judged as a neutral emotion.
  • the voice emotion is recognized as a positive emotion
  • the value is +1
  • the text emotion is recognized as a positive emotion
  • the value is +1.
  • the value is +2. If it is greater than 0, it is judged as a positive emotion .
  • the emotion training model in this embodiment may be a conventional emotion training model in the field.
  • the emotion training model may be trained using TensorFlow, or an algorithm such as RNN may be used for model training.
  • the third aspect of the present invention provides a computer-readable storage medium.
  • the computer-readable storage medium may be non-volatile or volatile.
  • the computer-readable storage medium includes an emotion recognition method program. When the emotion recognition method program is executed by the processor, the steps of the emotion recognition method as described in any one of the above are realized.
  • the emotion recognition method, system and readable storage medium provided by the present invention perform emotion recognition by extracting speech and text from the speech signal, thereby improving the accuracy of emotion recognition. Through the screening of voice and text information, the efficiency and accuracy of processing are improved.
  • the present invention provides a specific and effective solution for the identification of negative emotions in the customer service call center scene, and plays an active and important role in improving the quality of customer service and the reference standard for performance evaluation of service personnel. For different application scenarios, the results of voice and text emotion models are combined to meet the actual requirements of the business.
  • the disclosed device and method may be implemented in other ways.
  • the device embodiments described above are merely illustrative.
  • the division of the units is only a logical function division, and there may be other divisions in actual implementation, such as: multiple units or components can be combined, or It can be integrated into another system, or some features can be ignored or not implemented.
  • the coupling, or direct coupling, or communication connection between the components shown or discussed may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms. of.
  • the units described above as separate components may or may not be physically separate, and the components displayed as units may or may not be physical units; they may be located in one place or distributed on multiple network units; Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of the embodiment.
  • the functional units in the embodiments of the present invention can be all integrated into one processing unit, or each unit can be individually used as a unit, or two or more units can be integrated into one unit; the above-mentioned integration
  • the unit can be implemented in the form of hardware, or in the form of hardware plus software functional units.
  • the foregoing program can be stored in a computer readable storage medium.
  • the execution includes The steps of the foregoing method embodiment; and the foregoing storage medium includes: a mobile storage device, a read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disks or optical disks and other media that can store program codes.
  • the aforementioned integrated unit of the present invention is implemented in the form of a software function module and sold or used as an independent product, it can also be stored in a computer readable storage medium.
  • the computer software product is stored in a storage medium and includes several instructions for A computer device (which may be a personal computer, a server, or a network device, etc.) executes all or part of the methods described in the various embodiments of the present invention.
  • the aforementioned storage media include: removable storage devices, ROM, RAM, magnetic disks, or optical disks and other media that can store program codes.

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Computational Linguistics (AREA)
  • Multimedia (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Signal Processing (AREA)
  • Business, Economics & Management (AREA)
  • Marketing (AREA)
  • Child & Adolescent Psychology (AREA)
  • General Health & Medical Sciences (AREA)
  • Hospice & Palliative Care (AREA)
  • Psychiatry (AREA)
  • Telephonic Communication Services (AREA)
  • Machine Translation (AREA)

Abstract

一种情绪识别方法及装置、电子设备和可读存储介质,属于数据识别和处理领域,方法包括:采集语音信号(S102);将语音信号进行处理,得到语音识别信息和文本识别信息(S104);将语音识别信息和文本识别信息进行语音情绪识别和文本情绪识别,得到语音情绪识别信息和文本情绪识别信息(S106);根据预设计算规则,将语音情绪识别信息和文本情绪识别信息进行计算,得到情绪信息(S108)。通过对语音信号进行语音和文本的提取,进行情绪的识别,提高了情绪识别的准确率。通过对语音和文本信息的筛选,提高了处理的效率和准确率,为提高客户服务质量和对服务人员进行绩效考核起到了积极重要的作用。

Description

一种情绪识别方法及装置、电子设备和可读存储介质 技术领域
本申请要求于2019年10月08日提交中国专利局、申请号为201910949733.2,发明名称为“一种情绪识别方法及装置、电子设备和可读存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
背景技术
本发明属于数据识别和处理领域,更具体的,涉及一种情绪识别方法及装置、电子设备和可读存储介质。
技术问题
呼叫中心系统是指一种利用现代通讯与计算机技术,自动灵活地处理大量各种不同的电话呼入/呼出业务来实现服务运营的操作系统。随着经济发展,呼叫中心系统中客服交互的业务量也越来越大,利用人工智能,对客服通话数据进行数据挖掘,及时和有效的跟踪和监测客服通话中客服和客户的情绪状态,对于企业提升其服务质量具有重要的意义。目前,大多数企业主要依靠聘请专门的质检人员对通话录音进行抽样监听来实现这一目的,申请人发现,这一方面会给企业带来额外的成本,另一方面由于抽样覆盖范围的不确定性、以及人为判定含有的主观感情色彩,使得人工质检的效果存在一定的局限性。此外,质检人员只能在通话结束,获得录音以后对客服和客户的情绪表现进行事后的评价,而难以做到在通话进行当中去实时的监测客服和客户的情绪状态,当通话中客服或客户出现非常负面的情绪时,也无法及时有效的对客服人员进行提醒。
目前对客服电话中心中的对话语音进行负面情绪识别的产品或研究很少。申请人意识到,现有的情绪识别产品大部分都是在语音或文本质量较好并且样本均衡的情况下,只从语音或者文本一方面进行情绪识别。而在实际的客服电话中心,大部分都面临语音质量较差并且样本极不平衡的问题,所以无法较好的识别出客服人员的情绪。与此同时,公司为了提高客户服务质量和对服务人员进行绩效考核,服务人员又比较关心类别较少的负面情绪识别是否正确。现有的大部分情绪识别产品不适合用于客服电话中心场景,因此设计一种能够提高情绪识别的方法是亟不可待的。
技术解决方案
为了解决上述至少一个技术问题,本发明提出了一种情绪识别方法及装置、电子设备和可读存储介质。
本发明第一方面提供了一种情绪识别方法,包括:
采集语音信号;
将所述语音信号进行处理,得到语音识别信息和文本识别信息;
将所述语音识别信息和文本识别信息进行语音情绪识别和文本情绪识别,得到语音情绪识别信息和文本情绪识别信息;
根据预设计算规则,将所述语音情绪识别信息和文本情绪识别信息进行计算,得到所述语音信号的情绪信息。
本申请第二方面提供了一种情绪识别装置,包括:
采集模块,用于采集语音信号;
处理模块,用于将所述语音信号进行处理,得到语音识别信息和文本识别信息;
识别模块,用于将所述语音识别信息和文本识别信息进行语音情绪识别和文本情绪识别,得到语音情绪识别信息和文本情绪识别信息;
计算模块,用于根据预设计算规则,将所述语音情绪识别信息和文本情绪识别信息进行计算,得到所述语音信号的情绪信息。
本发明第三方面提供了一种电子设备,包括:存储器和处理器,所述存储器中包括情绪识别方法程序,所述情绪识别方法程序被所述处理器执行时实现如上所述的情绪识别方法的步骤。
本发明第四方面提供了一种计算机可读存储介质,所述计算机可读存储介质中包括情绪识别方法程序,所述情绪识别方法程序被处理器执行时,实现如上所述的情绪识别方法的步骤。
本发明提供的情绪识别方法、系统和可读存储介质,通过对语音信号进行语音和文本的提取,进行情绪的识别,提高了情绪识别的准确率。通过对语音和文本信息的筛选,提高了处理的效率和准确率。本发明为客服电话中心场景的负面情绪识别提供了具体有效的解决方案,为提高客户服务质量和对服务人员进行绩效考核的参考标准等起到了积极重要的作用。针对不同的应用场景,融合语音、文本情绪模型结果,达到了业务实际要求标准。
附图说明
图1示出了本发明一种情绪识别方法的流程图;
图2示出了本发明识别语音信息处理的流程图;
图3示出了本发明语音情绪识别的流程图;
图4示出了本发明文本情绪识别的流程图;
图5示出了本发明一种情绪识别系统的框图。
附图说明
为了能够更清楚地理解本发明的上述目的、特征和优点,下面结合附图和具体实施方式对本发明进行进一步的详细描述。需要说明的是,在不冲突的情况下,本申请的实施例及实施例中的特征可以相互组合。
在下面的描述中阐述了很多具体细节以便于充分理解本发明,但是,本发明还可以采用其他不同于在此描述的其他方式来实施,因此,本发明的保护范围并不受下面公开的具体实施例的限制。
图1示出了本发明一种情绪识别方法的流程图。
如图1所示,本发明公开了一种情绪识别方法,包括:
S102,采集语音信号;
S104,将所述语音信号进行处理,得到语音识别信息和文本识别信息;
S106,将所述语音识别信息和文本识别信息进行语音情绪识别和文本情绪识别,得到语音情绪识别信息和文本情绪识别信息;
S108,根据预设计算规则,将所述语音情绪识别信息和文本情绪识别信息进行计算,得到所述语音信号的情绪信息。
需要说明的是,客服或坐席在通话的过程中,将实时采集其语音信号。采集语音信号可以采用抽样采集或者固定时间窗形式采集。例如,采用抽样采集时将通话过程中的第5-7秒、9-11秒等的通话进行语音采集;采用固定时间窗采集时将通话过程中的第10-25秒的通话进行语音采集。本领域技术人员可根据实际需要选择采集的方式,但任何采用本发明进行语音采集判断情绪的方法都将落入本发明保护范围中。
进一步的,在采集了语音信号之后,将语音信号进行处理,得到语音识别信息和文本识别信息。其中语音识别信息用于通过语音情绪识别的方式获取情绪信息,文本识别信息用于通过文本情绪识别的方式获取情绪信息。每种不同的识别方式得到的情绪信息可能并不相同,所以在最后需要将两者得到的情绪信息进行综合处理得到情绪信息。通过对两种识别结果的综合处理,可以保证情绪识别的准确性。
图2示出了本发明识别语音信息处理的流程图。根据本发明实施例,所述将所述语音信号进行处理,得到语音识别信息,包括:
S202,分割语音信号为多个子语音信息;
S204,提取所述多个子语音信息的特征信息,每个子语音信息的特征信息组成所述子语音信息的特征信息总集合;
S206,统计每个子语音信息中的特征信息,将所述特征信息与预设的多个特征统计量信息进行匹配;
S208,记录与所述多个特征统计量信息匹配的每个子语音信息中的特征信息集合;
S210,根据与所述多个特征统计量信息匹配的特征信息集合,及子语音信息的特征信息总集合,计算每个子语音信息的特征量匹配度;
S212,将特征量匹配度大于预设特征量阈值的子语音信息确定为语音识别信息。
需要说明的是,在采集到语音信号之后,将所述语音信号分为多个子语音信息,分割子语音信息可以是通过时间或者数量进行分配,也可以是通过其他规则进行。例如,将采集的15秒的语音信号分割为每段3秒的子语音信息,一共可分割为5段,采用时间顺序进行分割,即前3秒分割为一段,第3-6秒分割为一段,以此类推。
进一步的,在分割为多个子语音信息之后,则提取子语音信息的特征信息,并与预设的语音库中的多个特征统计量信息进行匹配。值得一提的是,在后台的数据库中预存储有语音特征统计量信息,所述的语音特征统计量信息为经过筛选确认后的更能反映出情绪的词汇或者语句信息,可以是通过经验和研究确认的资源。例如,在特征统计量信息中不包括一些无用词,例如,数字、数学字符、标点符号及使用频率特高的汉字等;特征统计量中可以包括使用频率较高且能反应出情绪特征的词汇或者短语,例如,你好、再见、没有等词汇,或者又如,还有事吗、先这样吧等类似短语。在与预设的多个特征统计量信息进行匹配之后,则计算每个子语音信息的特征量匹配度。需要说明的是,子语音信息中与预设的多个特征统计量重合多的,则匹配度高。将匹配度大于预设特征量阈值的子语音信息确定为识别语音信息。本领域技术人员可根据实际需要选择预设特征量阈值,例如,可以为0.5、0.7等,也就是说,在匹配度大于0.5时,则将此子语音信息选为识别语音信息。采用此步骤,可以将匹配度低的语音数据信息进行过滤,提高情绪识别的速度和效率。
图3示出了本发明语音情绪识别的流程图。如图3所示,根据本发明实施例,将所述语音识别信息进行语音情绪识别,具体为:
S302,提取所述语音识别信息的特征信息;
S304,将所述特征信息与情绪训练模型进行匹配,得到每个不同情绪的概率值;
S306,选取大于预设情绪阈值的概率值对应的情绪,得到语音信号的语音情绪识别信息。
需要说明的是,获取了语音识别信息之后,将提取其特征信息。情绪训练模型为来自语音情绪数据库(Berlin emotion database),该模型经过机器学习,可以用于对代表情绪的特征信息进行分类,此语音数据库包含了生气(anger)、无聊(boredom)、厌恶(disgust)、害怕(fear)、开心(joy)、中性(neutral)和伤心(sadness)共七种情绪,并且此些语音信号是由多位专业演员各别演示上述七种情绪所对应的句子组成。值得注意的是,本发明并不加以限制所欲识别的情绪的种类,换句话说,在另一实施例中,语音数据库可还包括上述七种情绪以外的其他情绪。例如,在本发明范例实施例中,是从所录制的700句语句中选择较完整且较好的535句语句做为训练语音情绪分类模型的数据。
进一步的,在与情绪训练模型进行匹配之后,将得到每个不同情绪的概率值,选取大于预设情绪阈值的概率值作为对应的情绪。预设情绪阈值的概率值为本领域技术人员可根据实际需要和经验设定的,例如,可以设定所述概率值为70%,则将大于70%的情绪确定为最终的情绪识别信息。
在本发明实施例中,还包括:
若存在多个大于预设情绪阈值的概率值;
则选取多个所述概率值的平均概率值所对应的情绪作为所述语音信号的语音情绪识别信息。
值得一提的是,若存在多个情绪大于所述概率值,例如,生气概率值80%,厌恶概率值为75%,其均大于70%的阈值,则选择概率值最大的作为最终的情绪。本发明并未限制通过概率值选取情绪的具体实现方法,也就是说,在其他的实施例中,可以选择其他的方式进行概率值情绪识别,例如,选取多个子语音信息识别出来的情绪概率值,进行求平均计算,则概率最高的确定为最终的情绪。
图4示意性示出了文本情绪识别的流程图。如图4所示,根据本发明实施例,所述将文本识别信息进行文本情绪识别,包括:
S402,对文本识别信息进行特征提取,生成多个特征向量;
S404,将多个特征向量分别进行文本模型匹配,得到每个特征向量的分类结果;
S406,将所述每个特征向量的分类结果进行取值;
S408,根据所述取值计算所述文本识别信息对应的情绪值;
S410,将与所述情绪值对应的情绪,作为所述语音信号的文本情绪识别信息。
需要说明的是,所述对文本识别信息进行特征提取,生成多个特征向量,包括:根据预先建立的关键词数量为N的关键词词典,针对文本识别信息,计算关键词词典中各个关键词对应的TF-IDF值;根据各个关键词对应的TF-IDF值生成对应的特征向量。
这里所说的关键词词典是针对上述被测文本集进行提取的,通过提取关键词能够大幅度减少特征向量的维度,从而提高情绪分类的效率。其中,特征向量的维度为N,特征向量的各个维度上的分量为关键词词典中各个关键词对应的TF-IDF值。
需要说明的是,文本模型为预先训练文本模型,将每个特征向量输入至文本模型后,将得到对应的分类结果。每个特征向量可能会得出不同的分类结果,将不同的分类结果赋予情绪值,然后按照预设的算法将每个情绪值进行加权计算,得到最终的情绪信息。所述的预设的算法可以是根据每个不同的关键词设置对应的加权系数,每个关键词对应的特征向量也和上述的加权系数相等。例如,关键词“你好”对应的加权系数为0.2,关键词“再见”的加权系数为0.1,则在最后计算情绪信息时,将对应的情绪值乘以对应的加权系数再进行相加,得到最后的情绪值,则此情绪值会对应一个情绪。本领域技术人员还可以根据实际需要实时调整权重值,从而提高情绪识别的精确度。
根据本发明实施例,所述根据预设计算规则,将所述语音情绪识别信息和文本情绪识别信息进行计算,得到情绪信息,包括:
将所述语音情绪识别信息和文本情绪识别信息进行取值;
将对应取值进行相加,得到结果取值;
根据所述结果取值对应的范围,判定所述语音信号的情绪信息。
需要说明的是,在获取了语音情绪识别信息和文本情绪识别信息之后,将根据上述信息分别赋予情绪值,并且将其取值进行相加,得到结果取值。其取值范围,可以是本领域技术人员根据实际需要设定的,每个值落入相应的取值范围中,则判定为相应的情绪。例如,可以将情绪识别信息确定为正面情绪、中立情绪和负面情绪,其情绪值分别为+1、0、-1。若语音情绪识别为正面情绪,则取值为+1,文本情绪识别为负面情绪,则取值为-1,两者相加之后取值为0,所以将其判定为中立情绪。若语音情绪识别为正面情绪,则取值为+1,文本情绪识别为正面情绪,则取值为+1,两者相加之后取值为+2,大于0,则将其判定为正面情绪。
需要说明的是,本实施例中的情绪训练模型可以为本领域的惯用情绪训练模型,如情绪训练模型可以采用TensorFlow进行训练,或者采用RNN等算法进行模型训练。
图5示出了本发明一种情绪识别系统的框图。
如图5所示,本发明第二方面提供了一种情绪识别系统5,该系统包括:存储器51、处理器52,所述存储器中包括情绪识别方法程序,所述情绪识别方法程序被所述处理器执行时实现如下步骤:
采集语音信号;
将所述语音信号进行处理,得到语音识别信息和文本识别信息;
将所述语音识别信息和文本识别信息进行语音情绪识别和文本情绪识别,得到语音情绪识别信息和文本情绪识别信息;
根据预设计算规则,将所述语音情绪识别信息和文本情绪识别信息进行计算,得到所述语音信号的情绪信息。
需要说明的是,客服或坐席在通话的过程中,将实时采集其语音信号。采集语音信号可以采用抽样采集或者固定时间窗形势采集。例如,采用抽样采集时将通话过程中的第5-7秒、9-11秒等的通话进行语音采集;采用固定时间窗采集时将通话过程中的第10-25秒的通话进行语音采集。本领域技术人员可根据实际需要选择采集的方式,但任何采用本发明进行语音采集判断情绪的方法都将落入本发明保护范围中。
进一步的,在采集了语音信号之后,将语音信号进行处理,得到语音识别信息和文本识别信息。其中语音识别信息用于通过语音情绪识别的方式获取情绪信息,文本识别信息用于通过文本情绪识别的方式获取情绪信息。每种不同的识别方式得到的情绪信息可能并不相同,所以在最后需要将两者得到的情绪信息进行综合处理得到情绪信息。通过对两种识别结果的综合处理,可以保证情绪识别的准确性。
根据本发明实施例,所述将所述语音信号进行处理,得到语音识别信息,包括:
分割语音信号为多个子语音信息;
提取所述多个子语音信息的特征信息,每个子语音信息的特征信息组成所述子语音信息的特征信息总集合;
统计每个子语音信息中的特征信息,将所述特征信息与预设的多个特征统计量信息进行匹配;
记录与所述多个特征统计量信息匹配的每个子语音信息中的特征信息集合;
根据与所述多个特征统计量信息匹配的特征信息集合,及子语音信息的特征信息总集合,计算每个子语音信息的特征量匹配度;
将特征量匹配度大于预设特征量阈值的子语音信息确定为语音识别信息。
需要说明的是,在采集到语音信号之后,将所述语音信号分为多个子语音信息,分割子语音信息可以是通过时间或者数量进行分配,也可以是通过其他规则进行。例如,将采集的15秒的语音信号分割为每段3秒的子语音信息,一共可分割为5段,采用时间顺序进行分割,即前3秒分割为一段,第3-6秒分割为一段,以此类推。
进一步的,在分割为多个子语音信息之后,则提取子语音信息的特征信息,并与预设的语音库中的多个特征统计量信息进行匹配。值得一提的是,在后台的数据库中预存储有语音特征统计量信息,所述的语音特征统计量信息为经过筛选确认后的更能反映出情绪的词汇或者语句信息,可以是通过经验和研究确认的资源。例如,在特征统计量信息中不包括一些无用词,例如,数字、数学字符、标点符号及使用频率特高的汉字等;特征统计量中可以包括使用频率较高且能反应出情绪特征的词汇或者短语,例如,你好、再见、没有等词汇,或者又如,还有事吗、先这样吧等类似短语。在与预设的多个特征统计量信息进行匹配之后,则计算每个子语音信息的特征量匹配度。需要说明的是,子语音信息中与预设的多个特征统计量重合多的,则匹配度高。将匹配度大于预设特征量阈值的子语音信息确定为识别语音信息。本领域技术人员可根据实际需要选择预设特征量阈值,例如,可以为0.5、0.7等,也就是说,在匹配度大于0.5时,则将此子语音信息选为识别语音信息。采用此步骤,可以将匹配度低的语音数据信息进行过滤,提高情绪识别的速度和效率。
根据本发明实施例,将所述语音识别信息进行语音情绪识别,具体为:
提取所述语音识别信息的特征信息;
S304,将所述特征信息与情绪训练模型进行匹配,得到每个不同情绪的概率值;
S306,选取大于预设情绪阈值的概率值对应的情绪,得到语音信号的语音情绪识别信息。
需要说明的是,获取了语音识别信息之后,将提取其特征信息。情绪训练模型为来自语音情绪数据库(Berlin emotion database),此语音数据库包含了生气(anger)、无聊(boredom)、厌恶(disgust)、害怕(fear)、开心(joy)、中性(neutral)和伤心(sadness)共七种情绪,并且此些语音信号是由多位专业演员各别演示上述七种情绪所对应的句子组成。值得注意的是,本发明并不加以限制所欲识别的情绪的种类,换句话说,在另一实施例中,语音数据库可还包括上述七种情绪以外的其他情绪。例如,在本发明范例实施例中,是从所录制的700句语句中选择较完整且较好的535句语句做为训练语音情绪分类模型的数据。
进一步的,在与情绪训练模型进行匹配之后,将得到每个不同情绪的概率值,选取大于预设情绪阈值的概率值作为对应的情绪。预设情绪阈值的概率值为本领域技术人员可根据实际需要和经验设定的,例如,可以设定所述概率值为70%,则将大于70%的情绪确定为最终的情绪识别信息。
在本发明实施例中,还包括:
若存在多个大于预设情绪阈值的情绪;
则选取多个所述概率值的平均概率值所对应的情绪作为所述语音信号的语音情绪识别信息。
值得一提的是,若存在多个情绪大于所述概率值,例如,生气概率值80%,厌恶概率值为75%,其均大于70%的阈值,则选择概率值最大的作为最终的情绪。本发明并未限制通过概率值选取情绪的具体实现方法,也就是说,在其他的实施例中,可以选择其他的方式进行概率值情绪识别,例如,选取多个子语音信息识别出来的情绪概率值,进行求平均计算,则概率最高的确定为最终的情绪。
根据本发明实施例,所述将文本识别信息进行文本情绪识别,包括:
对文本识别信息进行特征提取,生成多个特征向量;
将多个特征向量分别进行文本模型匹配,得到每个特征向量的分类结果;
将所述每个特征向量的分类结果进行取值;
根据所述取值计算所述文本识别信息对应的情绪值;
将与所述情绪值对应的情绪,作为所述语音信号的文本情绪识别信息。
需要说明的是,所述对文本识别信息进行特征提取,生成多个特征向量,包括:
根据预先建立的关键词数量为N的关键词词典,针对文本识别信息,计算关键词词典中各个关键词对应的TF-IDF值;
根据各个关键词对应的TF-IDF值生成对应的特征向量。
这里所说的关键词词典是针对上述被测文本集进行提取的,通过提取关键词能够大幅度减少特征向量的维度,从而提高情绪分类的效率。其中,特征向量的维度为N,特征向量的各个维度上的分量为关键词词典中各个关键词对应的TF-IDF值。
需要说明的是,文本模型为预先训练文本模型,将每个特征向量输入至文本模型后,将得到对应的分类结果。每个特征向量可能会得出不同的分类结果,将不同的分类结果赋予情绪值,然后按照预设的算法将每个情绪值进行加权计算,得到最终的情绪信息。所述的预设的算法可以是根据每个不同的关键词设置对应的加权系数,每个关键词对应的特征向量也和上述的加权系数相等。例如,关键词“你好”对应的加权系数为0.2,关键词“再见”的加权系数为0.1,则在最后计算情绪信息时,将对应的情绪值乘以对应的加权系数再进行相加,得到最后的情绪值,则此情绪值会对应一个情绪。本领域技术人员还可以根据实际需要实时调整权重值,从而提高情绪识别的精确度。
根据本发明实施例,所述根据预设计算规则,将所述语音情绪识别信息和文本情绪识别信息进行计算,得到情绪信息,包括:
将所述语音情绪识别信息和文本情绪识别信息进行取值;
将对应取值进行相加,得到结果取值;
根据所述结果取值对应的范围,判定情绪信息。
需要说明的是,在获取了语音情绪识别信息和文本情绪识别信息之后,将根据上述信息分别赋予情绪值,并且将其取值进行相加,得到结果取值。其取值范围,可以是本领域技术人员根据实际需要设定的,每个值落入相应的取值范围中,则判定为相应的情绪。例如,可以将情绪识别信息确定为正面情绪、中立情绪和负面情绪,其情绪值分别为+1、0、-1。若语音情绪识别为正面情绪,则取值为+1,文本情绪识别为负面情绪,则取值为-1,两者相加之后取值为0,所以将其判定为中立情绪。若语音情绪识别为正面情绪,则取值为+1,文本情绪识别为正面情绪,则取值为+1,两者相加之后取值为+2,大于0,则将其判定为正面情绪。
需要说明的是,本实施例中的情绪训练模型可以为本领域的惯用情绪训练模型,如情绪训练模型可以采用TensorFlow进行训练,或者采用RNN等算法进行模型训练。
本发明第三方面提供了一种计算机可读存储介质,所述计算机可读存储介质可以是非易失性,也可以是易失性,所述计算机可读存储介质中包括情绪识别方法程序,所述情绪识别方法程序被处理器执行时,实现如上述任一项所述的一种情绪识别方法的步骤。
 本发明提供的情绪识别方法、系统和可读存储介质,通过对语音信号进行语音和文本的提取,进行情绪的识别,提高了情绪识别的准确率。通过对语音和文本信息的筛选,提高了处理的效率和准确率。本发明为客服电话中心场景的负面情绪识别提供了具体有效的解决方案,为提高客户服务质量和对服务人员进行绩效考核的参考标准等起到了积极重要的作用。针对不同的应用场景,融合语音、文本情绪模型结果,达到了业务实际要求标准。
在本申请所提供的几个实施例中,应该理解到,所揭露的设备和方法,可以通过其它的方式实现。以上所描述的设备实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,如:多个单元或组件可以结合,或可以集成到另一个系统,或一些特征可以忽略,或不执行。另外,所显示或讨论的各组成部分相互之间的耦合、或直接耦合、或通信连接可以是通过一些接口,设备或单元的间接耦合或通信连接,可以是电性的、机械的或其它形式的。
上述作为分离部件说明的单元可以是、或也可以不是物理上分开的,作为单元显示的部件可以是、或也可以不是物理单元;既可以位于一个地方,也可以分布到多个网络单元上;可以根据实际的需要选择其中的部分或全部单元来实现本实施例方案的目的。
另外,在本发明各实施例中的各功能单元可以全部集成在一个处理单元中,也可以是各单元分别单独作为一个单元,也可以两个或两个以上单元集成在一个单元中;上述集成的单元既可以采用硬件的形式实现,也可以采用硬件加软件功能单元的形式实现。
本领域普通技术人员可以理解:实现上述方法实施例的全部或部分步骤可以通过程序指令相关的硬件来完成,前述的程序可以存储于计算机可读取存储介质中,该程序在执行时,执行包括上述方法实施例的步骤;而前述的存储介质包括:移动存储设备、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、磁碟或者光盘等各种可以存储程序代码的介质。
或者,本发明上述集成的单元如果以软件功能模块的形式实现并作为独立的产品销售或使用时,也可以存储在一个计算机可读取存储介质中。基于这样的理解,本发明实施例的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机、服务器、或者网络设备等)执行本发明各个实施例所述方法的全部或部分。而前述的存储介质包括:移动存储设备、ROM、RAM、磁碟或者光盘等各种可以存储程序代码的介质。
以上所述,仅为本发明的具体实施方式,但本发明的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本发明揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本发明的保护范围之内。因此,本发明的保护范围应以所述权利要求的保护范围为准。
 
 
 
 
 
 
 
 
 
 
 
 
本发明的最佳实施方式
在此处键入本发明的最佳实施方式描述段落。
本发明的实施方式
在此处键入本发明的实施方式描述段落。
工业实用性
在此处键入工业实用性描述段落。
序列表自由内容
在此处键入序列表自由内容描述段落。

Claims (20)

  1. 一种情绪识别方法,其中,包括:
    采集语音信号;
    将所述语音信号进行处理,得到语音识别信息和文本识别信息;
    将所述语音识别信息和文本识别信息进行语音情绪识别和文本情绪识别,得到语音情绪识别信息和文本情绪识别信息;
    根据预设计算规则,将所述语音情绪识别信息和文本情绪识别信息进行计算,得到所述语音信号的情绪信息。
  2. 根据权利要求1所述的一种情绪识别方法,其中,所述将所述语音信号进行处理,得到语音识别信息,包括:
    分割语音信号为多个子语音信息;
    提取所述多个子语音信息的特征信息,每个子语音信息的特征信息组成所述子语音信息的特征信息总集合;
    统计每个子语音信息中的特征信息,将所述特征信息与预设的多个特征统计量信息进行匹配;
    记录与所述多个特征统计量信息匹配的每个子语音信息中的特征信息集合;
    根据与所述多个特征统计量信息匹配的特征信息集合,及子语音信息的特征信息总集合,计算每个子语音信息的特征量匹配度;
    将特征量匹配度大于预设特征量阈值的子语音信息确定为语音识别信息。
  3. 根据权利要求1所述的一种情绪识别方法,其中,将所述语音识别信息进行语音情绪识别,具体为:
    提取所述语音识别信息的特征信息;
    将所述特征信息与预设的情绪训练模型进行匹配,得到每个不同情绪的概率值;
    选取大于预设情绪阈值的概率值对应的情绪,作为所述语音信号的语音情绪识别信息。
  4. 根据权利要求3所述的一种情绪识别方法,其中,还包括:
    若存在多个大于预设情绪阈值的概率值;则选取多个所述概率值的平均概率值所对应的情绪作为所述语音信号的语音情绪识别信息。
  5. 根据权利要求1所述的一种情绪识别方法,其中,所述将文本识别信息进行文本情绪识别,包括:
    对文本识别信息进行特征提取,生成多个特征向量;
    将多个特征向量分别进行文本模型匹配,得到每个特征向量的分类结果;
    将所述每个特征向量的分类结果进行取值;
    根据所述取值计算所述文本识别信息对应的情绪值;
    将与所述情绪值对应的情绪,作为所述语音信号的文本情绪识别信息。
  6. 根据权利要求5所述的一种情绪识别方法,其中,所述对文本识别信息进行特征提取,生成多个特征向量,包括:
    根据预先建立的关键词数量为N的关键词词典,针对文本识别信息,计算关键词词典中各个关键词对应的TF-IDF值;
    根据各个关键词对应的TF-IDF值生成对应的特征向量。
  7. 根据权利要求1所述的一种情绪识别方法,其中,所述根据预设计算规则,将所述语音情绪识别信息和文本情绪识别信息进行计算,得到情绪信息,包括:
    将所述语音情绪识别信息和文本情绪识别信息进行取值;
    将对应取值进行相加,得到结果取值;
    根据所述结果取值对应的范围,判定所述语音信号的情绪信息。
  8. 一种情绪识别装置,其中,包括:
    采集模块,用于采集语音信号;
    处理模块,用于将所述语音信号进行处理,得到语音识别信息和文本识别信息;
    识别模块,用于将所述语音识别信息和文本识别信息进行语音情绪识别和文本情绪识别,得到语音情绪识别信息和文本情绪识别信息;
    计算模块,用于根据预设计算规则,将所述语音情绪识别信息和文本情绪识别信息进行计算,得到所述语音信号的情绪信息。
  9. 一种电子设备,其中,包括:存储器和处理器,所述存储器中包括情绪识别方法程序,所述情绪识别方法程序被所述处理器执行时实现如下步骤:
    采集语音信号;
    将所述语音信号进行处理,得到语音识别信息和文本识别信息;
    将所述语音识别信息和文本识别信息进行语音情绪识别和文本情绪识别,得到语音情绪识别信息和文本情绪识别信息;
    根据预设计算规则,将所述语音情绪识别信息和文本情绪识别信息进行计算,得到所述语音信号的情绪信息。
  10. 根据权利要求9所述的电子设备,其中,所述将所述语音信号进行处理,得到语音识别信息,包括:
    分割语音信号为多个子语音信息;
    提取所述多个子语音信息的特征信息,每个子语音信息的特征信息组成所述子语音信息的特征信息总集合;
    统计每个子语音信息中的特征信息,将所述特征信息与预设的多个特征统计量信息进行匹配;
    记录与所述多个特征统计量信息匹配的每个子语音信息中的特征信息集合;
    根据与所述多个特征统计量信息匹配的特征信息集合,及子语音信息的特征信息总集合,计算每个子语音信息的特征量匹配度;
    将特征量匹配度大于预设特征量阈值的子语音信息确定为语音识别信息。
  11. 根据权利要求9所述的电子设备,其中,将所述语音识别信息进行语音情绪识别,具体为:
    提取所述语音识别信息的特征信息;
    将所述特征信息与预设的情绪训练模型进行匹配,得到每个不同情绪的概率值;
    选取大于预设情绪阈值的概率值对应的情绪,作为所述语音信号的语音情绪识别信息。
  12. 根据权利要求11所述的电子设备,其中,还包括:
    若存在多个大于预设情绪阈值的概率值;则选取多个所述概率值的平均概率值所对应的情绪作为所述语音信号的语音情绪识别信息。
  13. 根据权利要求9所述的电子设备,其中,所述将文本识别信息进行文本情绪识别,包括:
    对文本识别信息进行特征提取,生成多个特征向量;
    将多个特征向量分别进行文本模型匹配,得到每个特征向量的分类结果;
    将所述每个特征向量的分类结果进行取值;
    根据所述取值计算所述文本识别信息对应的情绪值;
    将与所述情绪值对应的情绪,作为所述语音信号的文本情绪识别信息。
  14. 根据权利要求13所述的电子设备,其中,所述对文本识别信息进行特征提取,生成多个特征向量,包括:
    根据预先建立的关键词数量为N的关键词词典,针对文本识别信息,计算关键词词典中各个关键词对应的TF-IDF值;
    根据各个关键词对应的TF-IDF值生成对应的特征向量。
  15. 根据权利要求9所述的电子设备,其中,所述根据预设计算规则,将所述语音情绪识别信息和文本情绪识别信息进行计算,得到情绪信息,包括:
    将所述语音情绪识别信息和文本情绪识别信息进行取值;
    将对应取值进行相加,得到结果取值;
    根据所述结果取值对应的范围,判定所述语音信号的情绪信息。
  16. 一种计算机可读存储介质,其中,所述计算机可读存储介质中包括情绪识别方法程序,所述情绪识别方法程序被处理器执行时,实现如下步骤:
    采集语音信号;
    将所述语音信号进行处理,得到语音识别信息和文本识别信息;
    将所述语音识别信息和文本识别信息进行语音情绪识别和文本情绪识别,得到语音情绪识别信息和文本情绪识别信息;
    根据预设计算规则,将所述语音情绪识别信息和文本情绪识别信息进行计算,得到所述语音信号的情绪信息。
  17. 根据权利要求16所述的计算机可读存储介质,其中,所述将所述语音信号进行处理,得到语音识别信息,包括:
    分割语音信号为多个子语音信息;
    提取所述多个子语音信息的特征信息,每个子语音信息的特征信息组成所述子语音信息的特征信息总集合;
    统计每个子语音信息中的特征信息,将所述特征信息与预设的多个特征统计量信息进行匹配;
    记录与所述多个特征统计量信息匹配的每个子语音信息中的特征信息集合;
    根据与所述多个特征统计量信息匹配的特征信息集合,及子语音信息的特征信息总集合,计算每个子语音信息的特征量匹配度;
    将特征量匹配度大于预设特征量阈值的子语音信息确定为语音识别信息。
  18. 根据权利要求16所述的计算机可读存储介质,其中,将所述语音识别信息进行语音情绪识别,具体为:
    提取所述语音识别信息的特征信息;
    将所述特征信息与预设的情绪训练模型进行匹配,得到每个不同情绪的概率值;
    选取大于预设情绪阈值的概率值对应的情绪,作为所述语音信号的语音情绪识别信息。
  19. 根据权利要求18所述的计算机可读存储介质,其中,还包括:
    若存在多个大于预设情绪阈值的概率值;则选取多个所述概率值的平均概率值所对应的情绪作为所述语音信号的语音情绪识别信息。
  20. 根据权利要求16所述的计算机可读存储介质,其中,所述将文本识别信息进行文本情绪识别,包括:
    对文本识别信息进行特征提取,生成多个特征向量;
    将多个特征向量分别进行文本模型匹配,得到每个特征向量的分类结果;
    将所述每个特征向量的分类结果进行取值;
    根据所述取值计算所述文本识别信息对应的情绪值;
    将与所述情绪值对应的情绪,作为所述语音信号的文本情绪识别信息。
PCT/CN2020/119487 2019-10-08 2020-09-30 一种情绪识别方法及装置、电子设备和可读存储介质 WO2021068843A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910949733.2 2019-10-08
CN201910949733.2A CN110910901B (zh) 2019-10-08 2019-10-08 一种情绪识别方法及装置、电子设备和可读存储介质

Publications (1)

Publication Number Publication Date
WO2021068843A1 true WO2021068843A1 (zh) 2021-04-15

Family

ID=69815193

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/119487 WO2021068843A1 (zh) 2019-10-08 2020-09-30 一种情绪识别方法及装置、电子设备和可读存储介质

Country Status (2)

Country Link
CN (1) CN110910901B (zh)
WO (1) WO2021068843A1 (zh)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113539261A (zh) * 2021-06-30 2021-10-22 大众问问(北京)信息科技有限公司 人机语音交互方法、装置、计算机设备和存储介质
CN113704504A (zh) * 2021-08-30 2021-11-26 平安银行股份有限公司 基于聊天记录的情绪识别方法、装置、设备及存储介质
CN114312997A (zh) * 2021-12-09 2022-04-12 科大讯飞股份有限公司 一种车辆转向控制方法、装置、系统和存储介质
CN114463827A (zh) * 2022-04-12 2022-05-10 之江实验室 一种基于ds证据理论的多模态实时情绪识别方法及系统
CN115578115A (zh) * 2022-09-21 2023-01-06 支付宝(杭州)信息技术有限公司 资源抽选处理方法及装置
CN116564281A (zh) * 2023-07-06 2023-08-08 世优(北京)科技有限公司 基于ai的情绪识别方法及装置
WO2024040793A1 (zh) * 2022-08-26 2024-02-29 天翼电子商务有限公司 一种结合分层策略的多模态情绪识别方法

Families Citing this family (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110910901B (zh) * 2019-10-08 2023-03-28 平安科技(深圳)有限公司 一种情绪识别方法及装置、电子设备和可读存储介质
CN111694938B (zh) * 2020-04-27 2024-05-14 平安科技(深圳)有限公司 基于情绪识别的答复方法、装置、计算机设备及存储介质
CN111583968A (zh) * 2020-05-25 2020-08-25 桂林电子科技大学 一种语音情感识别方法和系统
CN111883113B (zh) * 2020-07-30 2024-01-30 云知声智能科技股份有限公司 一种语音识别的方法及装置
CN114254136A (zh) * 2020-09-23 2022-03-29 上海哔哩哔哩科技有限公司 情绪识别与引导方法、装置、设备及可读存储介质
CN113037610B (zh) * 2021-02-25 2022-08-19 腾讯科技(深圳)有限公司 语音数据处理方法、装置、计算机设备和存储介质
CN112951233A (zh) * 2021-03-30 2021-06-11 平安科技(深圳)有限公司 语音问答方法、装置、电子设备及可读存储介质
CN113314150A (zh) * 2021-05-26 2021-08-27 平安普惠企业管理有限公司 基于语音数据的情绪识别方法、装置及存储介质
CN113810548A (zh) * 2021-09-17 2021-12-17 广州科天视畅信息科技有限公司 基于iot的智能通话质检方法系统
CN113902404A (zh) * 2021-09-29 2022-01-07 平安银行股份有限公司 基于人工智能的员工晋升分析方法、装置、设备及介质
CN113987123A (zh) * 2021-10-27 2022-01-28 建信金融科技有限责任公司 一种情绪识别方法、装置、设备及介质
CN113743126B (zh) * 2021-11-08 2022-06-14 北京博瑞彤芸科技股份有限公司 一种基于用户情绪的智能交互方法和装置
CN114171063A (zh) * 2021-12-08 2022-03-11 国家电网有限公司客户服务中心 一种实时话务客户情绪分析辅助方法及系统
CN114298019A (zh) * 2021-12-29 2022-04-08 中国建设银行股份有限公司 情绪识别方法、装置、设备、存储介质、程序产品
CN114662499A (zh) * 2022-03-17 2022-06-24 平安科技(深圳)有限公司 基于文本的情绪识别方法、装置、设备及存储介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108305642A (zh) * 2017-06-30 2018-07-20 腾讯科技(深圳)有限公司 情感信息的确定方法和装置
CN108305641A (zh) * 2017-06-30 2018-07-20 腾讯科技(深圳)有限公司 情感信息的确定方法和装置
CN108305643A (zh) * 2017-06-30 2018-07-20 腾讯科技(深圳)有限公司 情感信息的确定方法和装置
US20190325897A1 (en) * 2018-04-21 2019-10-24 International Business Machines Corporation Quantifying customer care utilizing emotional assessments
CN110390956A (zh) * 2019-08-15 2019-10-29 龙马智芯(珠海横琴)科技有限公司 情感识别网络模型、方法及电子设备
CN110910901A (zh) * 2019-10-08 2020-03-24 平安科技(深圳)有限公司 一种情绪识别方法及装置、电子设备和可读存储介质

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN100530196C (zh) * 2007-11-16 2009-08-19 北京交通大学 一种基于分层匹配的快速音频广告识别方法
CN109948124B (zh) * 2019-03-15 2022-12-23 腾讯科技(深圳)有限公司 语音文件切分方法、装置及计算机设备
JP2021124530A (ja) * 2020-01-31 2021-08-30 Hmcomm株式会社 情報処理装置、情報処理方法及びプログラム

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108305642A (zh) * 2017-06-30 2018-07-20 腾讯科技(深圳)有限公司 情感信息的确定方法和装置
CN108305641A (zh) * 2017-06-30 2018-07-20 腾讯科技(深圳)有限公司 情感信息的确定方法和装置
CN108305643A (zh) * 2017-06-30 2018-07-20 腾讯科技(深圳)有限公司 情感信息的确定方法和装置
US20190325897A1 (en) * 2018-04-21 2019-10-24 International Business Machines Corporation Quantifying customer care utilizing emotional assessments
CN110390956A (zh) * 2019-08-15 2019-10-29 龙马智芯(珠海横琴)科技有限公司 情感识别网络模型、方法及电子设备
CN110910901A (zh) * 2019-10-08 2020-03-24 平安科技(深圳)有限公司 一种情绪识别方法及装置、电子设备和可读存储介质

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113539261A (zh) * 2021-06-30 2021-10-22 大众问问(北京)信息科技有限公司 人机语音交互方法、装置、计算机设备和存储介质
CN113704504A (zh) * 2021-08-30 2021-11-26 平安银行股份有限公司 基于聊天记录的情绪识别方法、装置、设备及存储介质
CN113704504B (zh) * 2021-08-30 2023-09-19 平安银行股份有限公司 基于聊天记录的情绪识别方法、装置、设备及存储介质
CN114312997A (zh) * 2021-12-09 2022-04-12 科大讯飞股份有限公司 一种车辆转向控制方法、装置、系统和存储介质
CN114312997B (zh) * 2021-12-09 2023-04-07 科大讯飞股份有限公司 一种车辆转向控制方法、装置、系统和存储介质
CN114463827A (zh) * 2022-04-12 2022-05-10 之江实验室 一种基于ds证据理论的多模态实时情绪识别方法及系统
WO2024040793A1 (zh) * 2022-08-26 2024-02-29 天翼电子商务有限公司 一种结合分层策略的多模态情绪识别方法
CN115578115A (zh) * 2022-09-21 2023-01-06 支付宝(杭州)信息技术有限公司 资源抽选处理方法及装置
CN115578115B (zh) * 2022-09-21 2023-09-08 支付宝(杭州)信息技术有限公司 资源抽选处理方法及装置
CN116564281A (zh) * 2023-07-06 2023-08-08 世优(北京)科技有限公司 基于ai的情绪识别方法及装置
CN116564281B (zh) * 2023-07-06 2023-09-05 世优(北京)科技有限公司 基于ai的情绪识别方法及装置

Also Published As

Publication number Publication date
CN110910901B (zh) 2023-03-28
CN110910901A (zh) 2020-03-24

Similar Documents

Publication Publication Date Title
WO2021068843A1 (zh) 一种情绪识别方法及装置、电子设备和可读存储介质
CN109767791B (zh) 一种针对呼叫中心通话的语音情绪识别及应用系统
US8676586B2 (en) Method and apparatus for interaction or discourse analytics
US8145482B2 (en) Enhancing analysis of test key phrases from acoustic sources with key phrase training models
CN105874530B (zh) 预测自动语音识别系统中的短语识别质量
CN102623011B (zh) 信息处理装置、信息处理方法及信息处理系统
US8615419B2 (en) Method and apparatus for predicting customer churn
US7596498B2 (en) Monitoring, mining, and classifying electronically recordable conversations
US20170084272A1 (en) System and method for analyzing and classifying calls without transcription via keyword spotting
US20100332287A1 (en) System and method for real-time prediction of customer satisfaction
CN110310663A (zh) 违规话术检测方法、装置、设备及计算机可读存储介质
US20090043573A1 (en) Method and apparatus for recognizing a speaker in lawful interception systems
CN113468296A (zh) 可配置业务逻辑的模型自迭代式智能客服质检系统与方法
CN105141787A (zh) 服务录音的合规检查方法及装置
CN112966082B (zh) 音频质检方法、装置、设备以及存储介质
CN106202031B (zh) 一种基于群聊数据对群成员进行关联的系统及方法
CN105808721A (zh) 一种基于数据挖掘的客服内容分析方法及其系统
US20150066549A1 (en) System, Method and Apparatus for Voice Analytics of Recorded Audio
CN111010484A (zh) 一种通话录音自动质检方法
CN113434670A (zh) 话术文本生成方法、装置、计算机设备和存储介质
CN110705309A (zh) 服务质量评测方法及系统
CN116071032A (zh) 基于深度学习的人力资源面试识别方法、装置及存储介质
CN113505606A (zh) 一种培训信息获取方法、装置、电子设备及存储介质
CN116828109A (zh) 一种电话客服服务质量智能评估方法及系统
CN116883888A (zh) 基于多模态特征融合的银行柜面服务问题溯源系统及方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20875020

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20875020

Country of ref document: EP

Kind code of ref document: A1