WO2020082734A1 - Text emotion recognition method and apparatus, electronic device, and computer non-volatile readable storage medium - Google Patents

Text emotion recognition method and apparatus, electronic device, and computer non-volatile readable storage medium Download PDF

Info

Publication number
WO2020082734A1
WO2020082734A1 PCT/CN2019/089166 CN2019089166W WO2020082734A1 WO 2020082734 A1 WO2020082734 A1 WO 2020082734A1 CN 2019089166 W CN2019089166 W CN 2019089166W WO 2020082734 A1 WO2020082734 A1 WO 2020082734A1
Authority
WO
WIPO (PCT)
Prior art keywords
text
sample
emotion
cost
error rate
Prior art date
Application number
PCT/CN2019/089166
Other languages
French (fr)
Chinese (zh)
Inventor
方豪
马骏
王少军
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2020082734A1 publication Critical patent/WO2020082734A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Definitions

  • the present application relates to the field of artificial intelligence technology, and in particular, to a text emotion recognition method, device, electronic device, and computer non-volatile readable storage medium.
  • emotional recognition of text is an important task, such as emotional recognition of service evaluations made by users, emotional recognition and classification of Internet articles, etc., so as to better understand user demands or achieve precise positioning of text With beneficial effects such as recommendation.
  • an object of the present application is to provide a text emotion recognition method, device, electronic device, and computer non-volatile readable storage medium.
  • a text sentiment recognition method is characterized by comprising: acquiring a sample text set, the sample text set includes a plurality of sample texts and an emotion classification label corresponding to each of the sample texts; according to the The number distribution of sentiment classification labels corrects the initial cost to obtain a modified cost; through the sample text set and the modified cost, a lifting algorithm learning model is trained to obtain a text emotion recognition model; the text emotion recognition model is used to recognize Recognize the text to obtain the emotion recognition result of the text to be recognized.
  • a text sentiment recognition device is characterized by comprising: a sample acquisition module for acquiring a sample text set, the sample text set including a plurality of sample texts and emotion classification tags corresponding to each of the sample texts;
  • the cost correction module is used for correcting and calculating the initial cost according to the number distribution of the sentiment classification tags in the sample text set, to obtain the correction cost.
  • the model acquisition module is used to train a lifting algorithm learning model through the sample text set and the correction cost to obtain a text emotion recognition model.
  • the target recognition module is used for recognizing the text to be recognized through the text emotion recognition model to obtain the emotion recognition result of the text to be recognized.
  • a text emotion recognition device includes a processor and a memory, and the memory stores computer-readable instructions, which are implemented by the processor to implement the text emotion recognition method as described above .
  • a computer non-volatile readable storage medium has stored thereon a computer program, which when executed by a processor implements the text emotion recognition method as described above.
  • a text emotion recognition model is trained and obtained based on the acquired sample text set and the modified cost weights obtained based on the number distribution of different emotion sample texts, and then emotion recognition is performed on the text to be recognized through the text emotion recognition model.
  • the initial cost is corrected according to the number distribution of sample texts of different emotions, so that the correction cost can balance the deviation of the number of sample texts of different emotions, which can improve the accuracy and balance of the recognition rate of different emotion texts by the text emotion recognition model.
  • FIG. 1 schematically shows a flowchart of a text emotion recognition method in this exemplary embodiment
  • FIG. 2 schematically shows a sub-flow diagram of a text emotion recognition method in this exemplary embodiment
  • FIG. 3 schematically shows a sub-flow diagram of another text emotion recognition method in this exemplary embodiment
  • FIG. 4 schematically shows a structural block diagram of a text emotion recognition device in this exemplary embodiment
  • FIG. 5 shows a block diagram of an electronic device for implementing the above method according to an exemplary embodiment
  • FIG. 6 shows a schematic diagram of a computer non-volatile readable storage medium for implementing the above method according to an exemplary embodiment.
  • Example embodiments will now be described more fully with reference to the drawings.
  • the example embodiments can be implemented in various forms and should not be construed as being limited to the examples set forth herein; on the contrary, providing these embodiments makes the disclosure more comprehensive and complete, and fully conveys the concept of the example embodiments For those skilled in the art.
  • the described features, structures, or characteristics may be combined in one or more embodiments in any suitable manner.
  • Exemplary embodiments of the present disclosure first provide a text emotion recognition method, where text generally refers to information in text form.
  • speech information can also be converted into text by a specific tool for emotion recognition; emotion recognition may be Classify and judge the sentiment states conveyed by the text, such as whether the sentiment of the text is positive or negative, commendatory or derogatory, etc.
  • the text emotion recognition method may include the following steps S110-S140:
  • Step S110 Acquire a sample text set, where the sample text set includes a plurality of sample texts and emotion classification tags corresponding to the sample texts.
  • the sample text may be text extracted from the corpus of a specific application scenario, and may generally cover various types of text in the corpus. According to the needs of text emotion recognition in this application scenario, the sample text can be labeled with emotion classification to obtain the emotion classification label.
  • emotion classification For example, in the scenario of identifying the emotions of e-commerce consumers on product evaluation, it is usually necessary to classify emotions as positive and negative , You can extract a large number of sample texts from the evaluation text and mark them as positive or negative emotional texts one by one; for example, when identifying the personal dynamic emotions of social network users, you usually need to classify the emotions as "happy” and "frustrated” "," Anger ",” sadness “and other categories, for the sample text” weather is too good ", you can label its emotion classification label as” happy ", for the sample text” really bad luck today ", you can label its emotion classification label For "frustrated” etc.
  • the specific content of the emotion classification label is not particularly limited.
  • step S120 the initial cost is corrected and calculated according to the number distribution of the sentiment classification tags in the sample text set to obtain the modified cost.
  • cost is a concept in cost-sensitive learning, reflecting the severity of consequences caused by misidentification.
  • the initial cost may be a parameter determined from the application scenario and considering the cost of misrecognizing the sentiment of the text.
  • the initial cost of incorrect recognition of texts of different emotion types is usually different; in different application scenarios, the initial cost of incorrect recognition of text of the same emotion type may also be different. For example, when using a positive evaluation system to evaluate agent customer service personnel, they generally pay more attention to the positive emotional evaluations given by customers to encourage and praise excellent customer service personnel.
  • the positive emotional text is mistakenly identified as the initial negative emotional text
  • the cost is high, and the initial cost of misrecognizing negative emotional text as positive emotional text is low; when evaluating e-commerce products, usually pay more attention to the negative emotional evaluation given by consumers to improve product quality.
  • the initial cost of misidentifying negative emotional text as positive emotional text is higher, and the initial cost of misidentifying positive emotional text as negative emotional text is lower.
  • the distribution of the number of sentiment classification labels reflects the imbalance of the sample texts of different emotions, which can be quantitatively expressed by one or more indicators such as the ratio, variance, or standard deviation between the sample texts of different emotions, for example:
  • the number distribution of sentiment classification tags in the sample set can be 4: 1; or in the sample text set, the number The distribution reflects that the "positive" sentiment classification tags account for 4/5 of the total sentiment classification tags, and the "negative” sentiment classification tags account for 1/5 of the total sentiment classification tags.
  • variance or standard deviation is usually used to represent the number distribution of sentiment classification tags. This embodiment is not particularly limited.
  • the initial cost of texts of different sentiment types can be corrected and calculated through a specific function or formula, and combined with the desired correction direction, the correction cost can be obtained. For example, if the proportion of positive sample text is low or the number is small, the initial cost of positive emotional text can be modified to have a higher cost weight. If the proportion of negative sample text is low or the number is small, the initial cost of negative emotional text can be modified to make it have a higher cost weight.
  • Step S130 Train a lifting algorithm learning model through the sample text set and the correction cost to obtain a text emotion recognition model.
  • the lifting algorithm learning model can be applied to the scenario of improving the accuracy of the weak classification algorithm.
  • the lifting algorithm learning model can set different sampling weights for sample texts with different accuracy rates, so that the model pays more attention to the correction cost. High sample text.
  • the lifting algorithm learning model may include a variety of models, for example, gradient lifting decision tree model, Adaboost model or Xgboost model.
  • the training process can include: lifting the algorithm learning model to take the sample text as input, output the sentiment classification result of the sample text, and compare the sentiment classification result with the sentiment classification label; and then calculate the comparison result by correcting the cost to obtain the accuracy of the model recognition Rate; by iteratively adjusting the parameters of the model until the accuracy rate reaches a certain standard, it can be considered that the training is complete.
  • the trained learning model of the lifting algorithm is the text emotion recognition model.
  • Step S140 Recognize the text to be recognized through the text emotion recognition model to obtain the emotion recognition result of the text to be recognized.
  • the text emotion recognition model completed through the above training can recognize the text to be recognized, and the emotion recognition result is the emotion classification result of the text to be recognized.
  • the emotion recognition result may be a positive emotion text or a negative emotion text.
  • a text emotion recognition model is trained and obtained, and then the recognized text is treated through the text emotion recognition model Perform emotion recognition.
  • the initial cost is corrected according to the number distribution of sample texts of different emotions, so that the correction cost can balance the deviation of the number of sample texts of different emotions, which can improve the accuracy and balance of the recognition rate of different emotion texts by the text emotion recognition model.
  • the sentiment classification label may include positive sentiment text and negative sentiment text.
  • Step S120 can be implemented by the following steps:
  • Cost 10 is the initial cost of mistaken positive emotion text as negative emotion text
  • cost 01 is the initial cost of mistaken negative emotion text as positive emotion text.
  • the number of positive emotional text Q1 and the number of negative emotional text Q0 in the sample text set are counted.
  • R 10 is the sample deviation ratio
  • costm 10 is the correction cost of mistaken positive emotion text as negative emotion text
  • costm 01 is the correction cost of mistaken negative emotion text as positive emotion text
  • a is the exponential parameter.
  • sample texts with different sentiment classifications in the sample text set have different initial costs and correction costs.
  • emotion classification labels are positive emotion text and negative emotion text
  • "0" can be used to indicate negative emotions
  • "1” can be used to indicate positive emotions.
  • the obtained initial costs cost 10 and cost 01 can respectively represent the initial cost of misrepresenting positive emotional text as negative emotional text and the initial cost of misrepresenting negative emotional text as positive emotional text.
  • the correction cost can be calculated by formula (1), formula (2) and formula (3), a is the index parameter, reflecting the degree of correction, the greater a , Indicates the higher the degree of correction; generally 0 ⁇ a ⁇ 1, the value of a can be set according to experience and actual use.
  • the initial cost can also be corrected by calculating the deviation ratio of the sample texts of different sentiment classifications.
  • the number of negative emotion texts is Q0
  • the number of positive emotion texts is Q1
  • the deviation ratio of negative emotions can be:
  • the correction cost of positive emotional text can also be lower than its initial cost, and the correction cost of negative emotional text is higher than its initial cost.
  • step S130 may include the following steps:
  • step S202 the training subset T is used to train the lifting algorithm to learn the model.
  • step S203 the emotion recognition result f (xi) of each sample text xi in the verification subset D is obtained by improving the algorithm learning model.
  • Step S204 calculating the error rate of the algorithm learning model according to formula (4):
  • Step S205 if the error rate is lower than the learning threshold, it is determined that the training model of the lifting algorithm learning is completed, and the trained learning model of the lifting algorithm is determined as a text emotion recognition model.
  • m is the number of sample texts in the verification subset, i ⁇ [1, m]; E is the error rate of the algorithm learning model, D + is the positive emotion sample text subset of verification subset D, and D- is the verification subset D is the negative sentiment sample text subset, y i is the sentiment classification label of the sample text xi.
  • the lifting algorithm learning model can take the training subset as input, output the sentiment classification result of the sample text in the training subset, adjust the model parameters, continue training the model, and then can verify whether the model meets the requirements by verifying the subset, by formula (4)
  • the calculation improves the error rate of the algorithm learning model.
  • II ( ⁇ ) is the indicator function, and the values in brackets are 1 and 0 when true and false, respectively.
  • the error index of xi is 0; if the output of the model is different from the sentiment classification label, the error index of xi is costm10 (when xi is positive sample text) or costm01 (when xi is negative Sample text); taking the arithmetic average of the error indices of all sample texts in D, the error rate E of the model can be obtained. The lower the value of the error rate E, the better the effect of improving the algorithm learning model training.
  • a learning threshold judgment mechanism can be set to judge whether the error rate of the improved algorithm learning model is within an acceptable range. If the calculated error rate is lower than the learning threshold, it is judged that the model training is completed to obtain a text emotion recognition model; if the calculated error rate is equal to or higher than the learning threshold, it cannot pass the verification, and the model can continue to be trained.
  • the learning threshold can be set according to experience or actual usage, and this embodiment does not limit its specific value.
  • the text emotion recognition method may further include the following steps:
  • s is the number of positive sentiment sample texts in the verification subset D, that is, the number of sample texts in D +
  • v is the number of negative sentiment sample texts in the verification subset D, namely the number of sample texts in D-
  • m s + v.
  • the positive sample error rate E + and the negative sample error rate E- of the improved algorithm learning model can be calculated according to formulas (5) and (6), respectively.
  • the positive sample error rate E + is The positive sample text subset D + verification is used to increase the error rate of the algorithm learning model, that is, the error rate for positive sample text recognition;
  • the negative sample error rate E- is the negative sample text subset D- verification to improve the algorithm learning model error rate, That is, the error rate for negative sample text recognition.
  • the error rate calculated by the above formula (4) is the error rate for the overall recognition of positive sample text and negative sample text.
  • the formula can also be used.
  • the error rate of the sample text subset D verification lifting algorithm learning model is calculated, which is consistent with the error rate calculated by the above formula (4).
  • the error rate ratio A of the lifting algorithm model can be calculated.
  • A reflects the imbalance of the error rate of the model for different emotion sample text recognition.
  • A 1
  • the error rate of the model for the positive sample text and the negative sample text recognition is balanced; when A and 1 are too different, whether it is greater than 1 Or less than 1, it means that the model has a high degree of unevenness in the recognition rate of positive sample text and negative sample text, and the training has not met the requirements.
  • This embodiment means that before determining whether the error rate of the learning model of the lifting algorithm meets the requirement, first determine whether the error rate of the model identifying different emotion sample texts is balanced, and if the balance reaches the requirement, continue to determine whether the error rate meets the requirement.
  • a preset range can be set to measure whether the error rate balance meets the requirements.
  • the error rate ratio is within the preset range, it means that the balance reaches the requirements and can be continued Determine whether the error rate reaches the standard of learning threshold.
  • the calculated error rate ratio A 2 which is within the preset range, indicates that this degree of imbalance can be accepted, and continues to detect whether the error rate is below the learning threshold.
  • B
  • can also be used to quantitatively express the degree of unbalanced error rate of the lifting algorithm learning model for different emotion sample text recognition.
  • B 0, it means complete equilibrium. The larger the B, the more balanced it is. Poor, so you can set a threshold on B to measure whether the model's error rate balance meets the requirements.
  • the text emotion recognition method further includes the following steps:
  • the training subset T is used again to train the lifting algorithm to learn the model.
  • FIG. 3 shows a flow chart of a text emotion recognition model training in this exemplary embodiment.
  • the sample deviation ratio is calculated for the sample text set, and the correction cost is calculated according to the sample deviation ratio to train the lifting algorithm to learn the model; and then calculate The error rate ratio and error rate of model training, and judge accordingly; if it is judged that the error rate ratio is not within the preset range, you can return to the model training step to continue training to improve the algorithm to learn the model, if the error rate ratio is judged to be preset Within the range, you can continue to judge whether the error rate is lower than the learning threshold; further, if you judge that the error rate is equal to or higher than the learning threshold, you can return to the model training step to continue training to improve the algorithm to learn the model, if you judge the error rate is low Based on the learning threshold, it can be considered that the model training is completed and a text emotion recognition model is obtained.
  • the emotion classification tags may include: level 1 positive emotion text, level 2 positive emotion text, ..., level n positive emotion text and level 1 negative emotion text, level 2 negative emotion text, ..., N-level negative emotion text, n is an integer greater than 1.
  • the emotions of the sample text can be classified into positive emotions and negative emotions. Further, positive emotions and negative emotions can be divided into level 1 positive emotion text, level 2 positive emotion text, ..., level n positive emotion text and Level 1 negative emotion text, level 2 negative emotion text, ..., level n negative emotion text.
  • the sentiment classification tags may also include neutral sentiment text, etc., which is not specifically limited here.
  • the apparatus 400 may include a sample acquisition module 410, a cost correction module 420, a model acquisition module 430, and a target recognition module 440.
  • the sample acquisition module 410 is used to acquire a sample text set
  • the sample text set includes multiple sample texts and the sentiment classification tags corresponding to each sample text
  • the cost correction module 420 is used to determine the initial value based on the number of sentiment classification tags in the sample text set.
  • the cost is corrected and calculated to obtain the modified cost;
  • the model acquisition module 430 is used to train a lifting algorithm to learn the model through the sample text set and the modified cost, and the text emotion recognition model is obtained;
  • the target recognition module 440 is used to treat the recognized text through the text emotion recognition model Recognize and get the emotion recognition result of the text to be recognized.
  • the sentiment classification label includes positive sentiment text and negative sentiment text
  • the cost correction module may include: an initial cost acquisition unit, used to obtain initial costs cost 10 and cost 01 , cost 10 is the positive sentiment text error Think of the initial cost of negative emotion text, cost 01 is the initial cost of mistaken negative emotion text as positive emotion text; text statistics unit, used to count the number of positive emotion text Q1 and the number of negative emotion text Q0 in the sample text set; cost correction unit , Used to modify the initial cost by the following formula to obtain the modified cost:
  • R 10 is the sample deviation ratio
  • costm 10 is the correction cost of mistaken positive emotion text as negative emotion text
  • costm 01 is the correction cost of mistaken negative emotion text as positive emotion text
  • a is the exponential parameter.
  • Judgment unit used to determine that the lifting algorithm learning model training is completed when the error rate is lower than the learning threshold, and determine the trained lifting algorithm learning model as a text emotion recognition model;
  • m is the number of sample text in the verification subset, i ⁇ [1, m];
  • E is the error rate of the learning model of the algorithm,
  • D + is the positive emotion sample text subset of the verification subset D,
  • D- is the negative emotion sample text subset of the verification subset D, and
  • y i is the sample Sentiment classification label for text xi.
  • the calculation unit may also be used to calculate the positive sample error rate E + and the negative sample error rate E- of the boosting algorithm learning model according to equations (5) and (6), respectively:
  • the judgment unit can also be used to continue to detect whether the error rate is lower than the learning threshold when the error rate ratio is within a preset range.
  • s is the number of positive emotion sample texts in the verification subset D
  • v is the number of negative emotion sample texts in the verification subset D
  • m s + v.
  • the training unit can also be used to train the lifting algorithm to learn the model again using the training subset T if the error rate ratio is not within the preset range;
  • the computing unit can also be used to recalculate the lifting algorithm by the following formula The error rate ratio of the learning model:
  • the judgment unit can also be used to detect again whether the error rate ratio is within a preset range.
  • the emotion classification tags may include level 1 positive emotion text, level 2 positive emotion text, ..., level n positive emotion text and level 1 negative emotion text, level 2 negative emotion text, ..., n Grade negative emotion text, n is an integer greater than 1.
  • the lifting algorithm learning model may include a gradient lifting decision tree model, an Adaboost model, or an Xgboost model.
  • Exemplary embodiments of the present disclosure also provide an electronic device capable of implementing the above method.
  • the electronic device 500 according to this exemplary embodiment of the present disclosure is described below with reference to FIG. 5.
  • the electronic device 500 shown in FIG. 5 is only an example, and should not bring any limitation to the functions and use scope of the embodiments of the present disclosure.
  • the electronic device 500 is represented in the form of a general-purpose computing device.
  • the components of the electronic device 500 may include, but are not limited to: the at least one processing unit 510, the at least one storage unit 520, a bus 530 connecting different system components (including the storage unit 520 and the processing unit 510), and the display unit 540.
  • the storage unit stores the program code
  • the program code may be executed by the processing unit 510, so that the processing unit 510 executes the steps according to various exemplary embodiments of the present disclosure described in the "Exemplary Method" section of this specification.
  • the processing unit 510 may execute steps S110 to S140 shown in FIG. 1, or may execute steps S201 to S205 shown in FIG. 2, and so on.
  • the storage unit 520 may include a readable medium in the form of a volatile storage unit, such as a random access storage unit (RAM) 521 and / or a cache storage unit 522, and may further include a read-only storage unit (ROM) 523.
  • RAM random access storage unit
  • ROM read-only storage unit
  • the storage unit 520 may further include a program / utility tool 524 having a set of (at least one) program modules 525.
  • program modules 525 include but are not limited to: an operating system, one or more application programs, other program modules, and program data. Each of these examples or some combination may include an implementation of the network environment.
  • the bus 530 may represent one or more of several types of bus structures, including a storage unit bus or a storage unit controller, a peripheral bus, a graphics acceleration port, a processing unit, or a local area using any of a variety of bus structures bus.
  • the electronic device 500 may also communicate with one or more external devices 700 (eg, keyboard, pointing device, Bluetooth device, etc.), and may also communicate with one or more devices that enable a user to interact with the electronic device 500, and / or This enables the electronic device 500 to communicate with any device (eg, router, modem, etc.) that communicates with one or more other computing devices. Such communication may be performed through an input / output (I / O) interface 550. Moreover, the electronic device 500 can also communicate with one or more networks (such as a local area network (LAN), a wide area network (WAN), and / or a public network, such as the Internet) through a network adapter 560.
  • LAN local area network
  • WAN wide area network
  • public network such as the Internet
  • the network adapter 560 communicates with other modules of the electronic device 500 through the bus 530. It should be understood that although not shown in the figure, other hardware and / or software modules may be used in conjunction with the electronic device 500, including but not limited to: microcode, device driver, redundant processing unit, external disk drive array, RAID system, tape drive And data backup storage system.
  • the example embodiments described herein can be implemented by software, or can be implemented by software in combination with necessary hardware. Therefore, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (which may be a CD-ROM, U disk, mobile hard disk, etc.) or on a network , Including several instructions to cause a computing device (which may be a personal computer, server, terminal device, or network device, etc.) to perform the method according to an exemplary embodiment of the present disclosure.
  • a computing device which may be a personal computer, server, terminal device, or network device, etc.
  • Exemplary embodiments of the present disclosure also provide a computer non-volatile readable storage medium on which is stored a program product capable of implementing the above method of this specification.
  • various aspects of the present disclosure may also be implemented in the form of a program product, which includes program code, and when the program product runs on the terminal device, the program code is used to cause the terminal device to execute The steps according to various exemplary embodiments of the present disclosure described in the "Exemplary Method" section.
  • a program product 600 for implementing the above method according to an exemplary embodiment of the present disclosure is described, which can adopt a portable compact disk read-only memory (CD-ROM) and include a program code, and can be used in a terminal Devices, such as personal computers.
  • the program product of the present disclosure is not limited thereto, and in this document, the readable storage medium may be any tangible medium containing or storing a program, which may be used by or in combination with an instruction execution system, apparatus, or device.
  • the program product may use any combination of one or more readable media.
  • the readable medium may be a readable signal medium or a readable storage medium.
  • the readable storage medium may be, for example, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, device, or device, or any combination of the above. More specific examples of readable storage media (non-exhaustive list) include: electrical connections with one or more wires, portable disks, hard disks, random access memory (RAM), read only memory (ROM), erasable Programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the foregoing.
  • the computer-readable signal medium may include a data signal that is transmitted in baseband or as part of a carrier wave, in which readable program code is carried.
  • This propagated data signal can take many forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the above.
  • the readable signal medium may also be any readable medium other than a readable storage medium, and the readable medium may send, propagate, or transmit a program for use by or in combination with an instruction execution system, apparatus, or device.

Abstract

The present application relates to the technical field of artificial intelligence, and provides a text emotion recognition method and apparatus, an electronic device, and a computer non-volatile readable storage medium. The method comprises: obtaining a sample text set, the sample text set comprising multiple sample texts and emotion classification labels corresponding to the multiple sample texts; carrying out correction calculation on an initial cost according to the number distribution of the emotion classification labels in the sample text set to obtain a correction cost; training a lifting algorithm learning model by means of the sample text set and the correction cost to obtain a text emotion recognition model; and recognizing a text to be recognized by means of the text emotion recognition model to obtain the emotion recognition result of said text. By means of the present application, the recognition accuracy and balance of the text of different emotional categories can be improved, the recognition effect is improved, and the method has strong applicability. (FIG. 3)

Description

文本情感识别方法、装置、电子设备及计算机非易失性可读存储介质Text emotion recognition method, device, electronic equipment and computer non-volatile readable storage medium
本申请要求2018年10月24日递交、申请名称为“文本情感识别方法、装置、电子设备及计算机非易失性可读存储介质”的中国专利申请201811244553.6的优先权,在此通过引用将其全部内容合并于此。This application requires the priority of the Chinese patent application 201811244553.6 filed on October 24, 2018, with the application name "Text Emotion Recognition Method, Device, Electronic Equipment, and Computer Non-Volatile Readable Storage Medium", which is incorporated herein by reference All content is merged here.
技术领域Technical field
本申请涉及人工智能技术领域,尤其涉及一种文本情感识别方法、装置、电子设备及计算机非易失性可读存储介质。The present application relates to the field of artificial intelligence technology, and in particular, to a text emotion recognition method, device, electronic device, and computer non-volatile readable storage medium.
背景技术Background technique
随着计算机技术的发展,越来越多的互联网企业致力于通过分析大数据以提高服务质量。其中,对文本进行情感识别是一项重要的工作,例如对用户作出的服务评价进行情感识别、对互联网文章进行情感识别与分类等,从而更好地了解用户的诉求、或者实现文本的精准定位与推荐等有益效果。With the development of computer technology, more and more Internet companies are committed to improving service quality by analyzing big data. Among them, emotional recognition of text is an important task, such as emotional recognition of service evaluations made by users, emotional recognition and classification of Internet articles, etc., so as to better understand user demands or achieve precise positioning of text With beneficial effects such as recommendation.
现有的文本情感识别方法大多采用常规的机器学习模型,依赖于特定语料的样本文本对模型进行训练。然而,本申请的发明人意识到,在很多语料中,不同情感的样本文本都存在比例不均衡的问题,例如在识别电商消费者对商品评价的情感的场景中,由于正面评价的数量通常远多于负面评价的数量,导致样本文本的比例不均衡,训练出的机器学习模型识别正面情感文本的准确率会高于识别负面情感文本的准确率,影响文本情感识别的效果。Most existing text emotion recognition methods use conventional machine learning models and rely on sample texts of specific corpora to train the model. However, the inventor of the present application realized that in many corpora, the sample texts of different emotions have a problem of imbalanced proportions. For example, in the scene of recognizing the emotions of e-commerce consumers on product evaluation, the number of positive evaluations is usually Far more than the number of negative evaluations, resulting in an uneven proportion of sample text, the accuracy of the machine learning model trained to recognize positive emotional text will be higher than the accuracy of identifying negative emotional text, which affects the effect of text emotional recognition.
发明内容Summary of the invention
为了解决上述技术问题,本申请的一个目的在于提供一种文本情感识别方法、装置、电子设备及计算机非易失性可读存储介质。In order to solve the above technical problems, an object of the present application is to provide a text emotion recognition method, device, electronic device, and computer non-volatile readable storage medium.
其中,本申请所采用的技术方案为:Among them, the technical solutions adopted in this application are:
一方面,一种文本情感识别方法,其特征在于,包括:获取样本文本集,所述样本文本集包括多个样本文本以及各所述样本文本对应的情感分类标签;根据所述样本文本集中的情感分类标签的数量分布对初始代价进行修正计算,获得修正代价;通过所述样本文本集与所述修正代价训练一提升算法学习模型,得到文本情感识别模型;通过所述文本情感识别模型对待识别文本进行识别,得到所述待识别文本的情感识别结果。On the one hand, a text sentiment recognition method is characterized by comprising: acquiring a sample text set, the sample text set includes a plurality of sample texts and an emotion classification label corresponding to each of the sample texts; according to the The number distribution of sentiment classification labels corrects the initial cost to obtain a modified cost; through the sample text set and the modified cost, a lifting algorithm learning model is trained to obtain a text emotion recognition model; the text emotion recognition model is used to recognize Recognize the text to obtain the emotion recognition result of the text to be recognized.
另一方面,一种文本情感识别装置,其特征在于,包括:样本获取模块,用于获取样本文本集,所述样本文本集包括多个样本文本以及各所述样本文本对应的情感分类标签;代价修正模块,用于根据所述样本文本集中的情感分类标签的数量分布对初始代价进行修正计算,获得修正代价。模型获取模块,用于通过所述样本文本集与所述修正代价训练一 提升算法学习模型,得到文本情感识别模型。目标识别模块,用于通过所述文本情感识别模型对待识别文本进行识别,得到所述待识别文本的情感识别结果。On the other hand, a text sentiment recognition device is characterized by comprising: a sample acquisition module for acquiring a sample text set, the sample text set including a plurality of sample texts and emotion classification tags corresponding to each of the sample texts; The cost correction module is used for correcting and calculating the initial cost according to the number distribution of the sentiment classification tags in the sample text set, to obtain the correction cost. The model acquisition module is used to train a lifting algorithm learning model through the sample text set and the correction cost to obtain a text emotion recognition model. The target recognition module is used for recognizing the text to be recognized through the text emotion recognition model to obtain the emotion recognition result of the text to be recognized.
另一方面,一种文本情感识别装置,包括处理器及存储器,所述存储器上存储有计算机可读指令,所述计算机可读指令被所述处理器执行时实现如上所述的文本情感识别方法。On the other hand, a text emotion recognition device includes a processor and a memory, and the memory stores computer-readable instructions, which are implemented by the processor to implement the text emotion recognition method as described above .
另一方面,一种计算机非易失性可读存储介质,其上存储有计算机程序,所述计算机程序被处理器执行时实现如上所述的文本情感识别方法。On the other hand, a computer non-volatile readable storage medium has stored thereon a computer program, which when executed by a processor implements the text emotion recognition method as described above.
在上述技术方案中,根据获取的样本文本集以及基于不同情感样本文本的数量分布得到的修正代价权重,训练并得到文本情感识别模型,再通过文本情感识别模型对待识别文本进行情感识别。一方面,根据不同情感的样本文本的数量分布对初始代价进行修正计算,使得修正代价能够平衡不同情感的样本文本的数量偏差情况,可以提高文本情感识别模型识别不同情感文本的准确率均衡性,改善文本情感识别效果;另一方面,在训练提升算法学习模型时,通过修正代价对模型的偏好进行一定的引导,可以加强对修正代价较高的样本文本的关注,从而加速训练过程,实现更好的训练效果;再一方面,本实施例中,对应用场景的语料没有特别限定要求,且可以通过调整修正代价来满足不同场景的需求,使得本实施例的文本情感识别方法具有较强的适用性。In the above technical solution, a text emotion recognition model is trained and obtained based on the acquired sample text set and the modified cost weights obtained based on the number distribution of different emotion sample texts, and then emotion recognition is performed on the text to be recognized through the text emotion recognition model. On the one hand, the initial cost is corrected according to the number distribution of sample texts of different emotions, so that the correction cost can balance the deviation of the number of sample texts of different emotions, which can improve the accuracy and balance of the recognition rate of different emotion texts by the text emotion recognition model. Improve the effect of text emotion recognition; on the other hand, when training the lifting algorithm to learn the model, by modifying the cost to guide the model's preference, it can strengthen the focus on the sample text with higher correction cost, thereby speeding up the training process and achieve more Good training effect; on the other hand, in this embodiment, there are no special requirements for the corpus of the application scene, and the correction cost can be adjusted to meet the needs of different scenarios, making the text emotion recognition method of this embodiment has a strong applicability.
应当理解的是,以上的一般描述和后文的细节描述仅是示例性和解释性的,并不能限制本申请。It should be understood that the above general description and the following detailed description are only exemplary and explanatory, and do not limit the present application.
附图说明BRIEF DESCRIPTION
此处的附图被并入说明书中并构成本说明书的一部分,示出了符合本申请的实施例,并于说明书一起用于解释本申请的原理。The drawings here are incorporated into the specification and constitute a part of the specification, show embodiments consistent with the present application, and are used together with the specification to explain the principles of the present application.
图1示意性示出本示例性实施例中一种文本情感识别方法的流程图;FIG. 1 schematically shows a flowchart of a text emotion recognition method in this exemplary embodiment;
图2示意性示出本示例性实施例中一种文本情感识别方法的子流程图;FIG. 2 schematically shows a sub-flow diagram of a text emotion recognition method in this exemplary embodiment;
图3示意性示出本示例性实施例中另一种文本情感识别方法的子流程图;FIG. 3 schematically shows a sub-flow diagram of another text emotion recognition method in this exemplary embodiment;
图4示意性示出本示例性实施例中一种文本情感识别装置的结构框图;FIG. 4 schematically shows a structural block diagram of a text emotion recognition device in this exemplary embodiment;
图5示出根据示例性实施例的用于实现上述方法的电子设备的框图;5 shows a block diagram of an electronic device for implementing the above method according to an exemplary embodiment;
图6示出根据示例性实施例的用于实现上述方法的计算机非易失性可读存储介质的示意图。FIG. 6 shows a schematic diagram of a computer non-volatile readable storage medium for implementing the above method according to an exemplary embodiment.
通过上述附图,已示出本申请明确的实施例,后文中将有更详细的描述,这些附图和文字描述并不是为了通过任何方式限制本申请构思的范围,而是通过参考特定实施例为本领域技术人员说明本申请的概念。Through the above drawings, clear embodiments of the present application have been shown, which will be described in more detail later. These drawings and text descriptions are not intended to limit the scope of the present application in any way, but by referring to specific embodiments The concept of the present application will be explained to those skilled in the art.
具体实施方式detailed description
这里将详细地对示例性实施例执行说明,其示例表示在附图中。下面的描述涉及附图时,除非另有表示,不同附图中的相同数字表示相同或相似的要素。以下示例性实施例中所描述的实施方式并不代表与本申请相一致的所有实施方式。相反,它们仅是与如所附权利要求书中所详述的、本申请的一些方面相一致的装置和方法的例子。The exemplary embodiments will be explained in detail here, examples of which are shown in the drawings. When the following description refers to the accompanying drawings, unless otherwise indicated, the same numerals in different drawings represent the same or similar elements. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with this application. Rather, they are merely examples of devices and methods consistent with some aspects of the application as detailed in the appended claims.
现在将参考附图更全面地描述示例实施方式。然而,示例实施方式能够以多种形式实施,且不应被理解为限于在此阐述的范例;相反,提供这些实施方式使得本公开将更加全面和完整,并将示例实施方式的构思全面地传达给本领域的技术人员。所描述的特征、结构或特性可以以任何合适的方式结合在一个或更多实施方式中。Example embodiments will now be described more fully with reference to the drawings. However, the example embodiments can be implemented in various forms and should not be construed as being limited to the examples set forth herein; on the contrary, providing these embodiments makes the disclosure more comprehensive and complete, and fully conveys the concept of the example embodiments For those skilled in the art. The described features, structures, or characteristics may be combined in one or more embodiments in any suitable manner.
本公开的示例性实施例首先提供了一种文本情感识别方法,其中文本一般指文字形式的信息,本实施例中也可以将语音信息通过特定工具转化为文本后进行情感识别;情感识别可以是对文本所传达的情感状态的分类判断,例如文本情感为正面或负面、褒义或贬义等。Exemplary embodiments of the present disclosure first provide a text emotion recognition method, where text generally refers to information in text form. In this embodiment, speech information can also be converted into text by a specific tool for emotion recognition; emotion recognition may be Classify and judge the sentiment states conveyed by the text, such as whether the sentiment of the text is positive or negative, commendatory or derogatory, etc.
下面结合附图1对本示例性实施例做进一步说明,如图1所示,文本情感识别方法可以包括以下步骤S110~S140:The following further describes the exemplary embodiment with reference to FIG. 1. As shown in FIG. 1, the text emotion recognition method may include the following steps S110-S140:
步骤S110,获取样本文本集,样本文本集包括多个样本文本以及各样本文本对应的情感分类标签。Step S110: Acquire a sample text set, where the sample text set includes a plurality of sample texts and emotion classification tags corresponding to the sample texts.
其中,样本文本可以是在特定应用场景的语料中抽取的文本,通常可以覆盖该语料中各种类型的文本。根据该应用场景的文本情感识别需要,可以对样本文本进行情感分类标注,以得到情感分类标签,例如在识别电商消费者对商品评价的情感的场景中,通常需要将情感分类为正面与负面,则可以从评价文本中抽取大量样本文本,并逐一标注其为正面情感文本或负面情感文本;又例如在识别社交网络用户个人动态的情感时,通常需要将情感分类为“高兴”、“沮丧”、“愤怒”、“悲伤”等多种类别,对于样本文本“天气太好了”,可以标注其情感分类标签为“高兴”,对于样本文本“今天真倒霉”,可以标注其情感分类标签为“沮丧”等。本实施例对于情感分类标签的具体内容不做特别限定。The sample text may be text extracted from the corpus of a specific application scenario, and may generally cover various types of text in the corpus. According to the needs of text emotion recognition in this application scenario, the sample text can be labeled with emotion classification to obtain the emotion classification label. For example, in the scenario of identifying the emotions of e-commerce consumers on product evaluation, it is usually necessary to classify emotions as positive and negative , You can extract a large number of sample texts from the evaluation text and mark them as positive or negative emotional texts one by one; for example, when identifying the personal dynamic emotions of social network users, you usually need to classify the emotions as "happy" and "frustrated" "," Anger "," sadness "and other categories, for the sample text" weather is too good ", you can label its emotion classification label as" happy ", for the sample text" really bad luck today ", you can label its emotion classification label For "frustrated" etc. In this embodiment, the specific content of the emotion classification label is not particularly limited.
步骤S120,根据样本文本集中的情感分类标签的数量分布对初始代价进行修正计算,获得修正代价。In step S120, the initial cost is corrected and calculated according to the number distribution of the sentiment classification tags in the sample text set to obtain the modified cost.
其中,代价是代价敏感学习中的概念,反映错误识别所导致的后果严重程度。初始代价可以是从应用场景出发,考虑对文本情感错误识别的代价而确定的参数。在同一应用场景中,对于不同情感类型的文本错误识别的初始代价通常不同;在不同的应用场景中,对于同一情感类型的文本错误识别的初始代价也可能不同。举例而言,在采用好评制来评价坐席客服人员时,一般更关注客户给予的正面情感评价,以鼓励和表扬优秀客服人员,在该场景中,将正面情感文本错误识别为负面情感文本的初始代价较高,将负面情感文本错误识别为正面情感文本的初始代价较低;在对电商产品进行评估时,通常更关注消费者给予的负面情感评价,以改进产品质量,在该场景中,将负面情感文本错误识别为正面情感文本的初始代价较高,将正面情感文本错误识别为负面情感文本的初始代价较低。Among them, cost is a concept in cost-sensitive learning, reflecting the severity of consequences caused by misidentification. The initial cost may be a parameter determined from the application scenario and considering the cost of misrecognizing the sentiment of the text. In the same application scenario, the initial cost of incorrect recognition of texts of different emotion types is usually different; in different application scenarios, the initial cost of incorrect recognition of text of the same emotion type may also be different. For example, when using a positive evaluation system to evaluate agent customer service personnel, they generally pay more attention to the positive emotional evaluations given by customers to encourage and praise excellent customer service personnel. In this scenario, the positive emotional text is mistakenly identified as the initial negative emotional text The cost is high, and the initial cost of misrecognizing negative emotional text as positive emotional text is low; when evaluating e-commerce products, usually pay more attention to the negative emotional evaluation given by consumers to improve product quality. In this scenario, The initial cost of misidentifying negative emotional text as positive emotional text is higher, and the initial cost of misidentifying positive emotional text as negative emotional text is lower.
在样本文本集中,情感分类标签的数量分布反映不同情感的样本文本的不均衡情况,可以通过不同情感的样本文本之间的比例、方差或标准差等一个或多个指标定量表示,例如:在某样本文本集中,“正面”的情感分类标签有80000条,“负面”的情感分类标签有20000条,则该样本集中情感分类标签的数量分布可以是4:1;或者该样本文本集中,数 量分布体现为“正面”的情感分类标签占总情感分类标签的4/5,“负面”的情感分类标签占总情感分类标签的1/5等。在多分类的场景中,通常采用方差或标准差来表示情感分类标签的数量分布。本实施例对此不做特别限定。In the sample text set, the distribution of the number of sentiment classification labels reflects the imbalance of the sample texts of different emotions, which can be quantitatively expressed by one or more indicators such as the ratio, variance, or standard deviation between the sample texts of different emotions, for example: In a sample text set, there are 80,000 “positive” sentiment classification tags and 20,000 “negative” sentiment classification tags, then the number distribution of sentiment classification tags in the sample set can be 4: 1; or in the sample text set, the number The distribution reflects that the "positive" sentiment classification tags account for 4/5 of the total sentiment classification tags, and the "negative" sentiment classification tags account for 1/5 of the total sentiment classification tags. In multi-classification scenarios, variance or standard deviation is usually used to represent the number distribution of sentiment classification tags. This embodiment is not particularly limited.
根据上述情感分类标签的数量分布,可以通过特定的函数或公式对不同情感类型的文本的初始代价进行修正计算,结合期望的修正方向,可以获得修正代价。举例而言,如果正面样本文本比例较低或数量较少,则可以修正正面情感文本的初始代价,使其具有更高的代价权重。如果负面样本文本比例较低或数量较少,则可以修正负面情感文本的初始代价,使其具有更高的代价权重。According to the above-mentioned distribution of the number of sentiment classification tags, the initial cost of texts of different sentiment types can be corrected and calculated through a specific function or formula, and combined with the desired correction direction, the correction cost can be obtained. For example, if the proportion of positive sample text is low or the number is small, the initial cost of positive emotional text can be modified to have a higher cost weight. If the proportion of negative sample text is low or the number is small, the initial cost of negative emotional text can be modified to make it have a higher cost weight.
步骤S130,通过样本文本集与修正代价训练一提升算法学习模型,得到文本情感识别模型。Step S130: Train a lifting algorithm learning model through the sample text set and the correction cost to obtain a text emotion recognition model.
提升算法学习模型可以应用于提高弱分类算法准确度的场景中,在本实施例中,提升算法学习模型可以对不同准确率的样本文本设置不同的采样权重,从而使模型更加关注于修正代价较高的样本文本。提升算法学习模型可以包括多种模型,例如,梯度提升决策树模型、Adaboost模型或Xgboost模型等。The lifting algorithm learning model can be applied to the scenario of improving the accuracy of the weak classification algorithm. In this embodiment, the lifting algorithm learning model can set different sampling weights for sample texts with different accuracy rates, so that the model pays more attention to the correction cost. High sample text. The lifting algorithm learning model may include a variety of models, for example, gradient lifting decision tree model, Adaboost model or Xgboost model.
训练过程可以包括:提升算法学习模型以样本文本为输入,输出样本文本的情感分类结果,将情感分类结果与情感分类标签进行对比;再通过修正代价对对比的结果进行计算,得到模型识别的准确率;通过迭代调整模型的参数,直至准确率达到一定的标准,可以认为训练完成。训练完成的提升算法学习模型即为文本情感识别模型。The training process can include: lifting the algorithm learning model to take the sample text as input, output the sentiment classification result of the sample text, and compare the sentiment classification result with the sentiment classification label; and then calculate the comparison result by correcting the cost to obtain the accuracy of the model recognition Rate; by iteratively adjusting the parameters of the model until the accuracy rate reaches a certain standard, it can be considered that the training is complete. The trained learning model of the lifting algorithm is the text emotion recognition model.
步骤S140,通过文本情感识别模型对待识别文本进行识别,得到待识别文本的情感识别结果。Step S140: Recognize the text to be recognized through the text emotion recognition model to obtain the emotion recognition result of the text to be recognized.
通过上述训练完成的文本情感识别模型,可以对待识别文本进行识别,情感识别结果为待识别文本的情感分类结果,例如,情感识别结果可以是正面情感文本或负面情感文本等。The text emotion recognition model completed through the above training can recognize the text to be recognized, and the emotion recognition result is the emotion classification result of the text to be recognized. For example, the emotion recognition result may be a positive emotion text or a negative emotion text.
基于上述说明,在本示例性实施例中,根据获取的样本文本集以及基于不同情感样本文本的数量分布得到的修正代价权重,训练并得到文本情感识别模型,再通过文本情感识别模型对待识别文本进行情感识别。一方面,根据不同情感的样本文本的数量分布对初始代价进行修正计算,使得修正代价能够平衡不同情感的样本文本的数量偏差情况,可以提高文本情感识别模型识别不同情感文本的准确率均衡性,改善文本情感识别效果;另一方面,在训练提升算法学习模型时,通过修正代价对模型的偏好进行一定的引导,可以加强对修正代价较高的样本文本的关注,从而加速训练过程,实现更好的训练效果;再一方面,本实施例中,对应用场景的语料没有特别限定要求,且可以通过调整修正代价来满足不同场景的需求,使得本实施例的文本情感识别方法具有较强的适用性。Based on the above description, in this exemplary embodiment, based on the acquired sample text set and the modified cost weights obtained based on the number distribution of different emotion sample texts, a text emotion recognition model is trained and obtained, and then the recognized text is treated through the text emotion recognition model Perform emotion recognition. On the one hand, the initial cost is corrected according to the number distribution of sample texts of different emotions, so that the correction cost can balance the deviation of the number of sample texts of different emotions, which can improve the accuracy and balance of the recognition rate of different emotion texts by the text emotion recognition model. Improve the effect of text emotion recognition; on the other hand, when training the lifting algorithm to learn the model, by modifying the cost to guide the model's preference, it can strengthen the focus on the sample text with higher correction cost, thereby speeding up the training process and achieve more Good training effect; on the other hand, in this embodiment, there are no special requirements for the corpus of the application scene, and the correction cost can be adjusted to meet the needs of different scenarios, making the text emotion recognition method of this embodiment has a strong applicability.
在一示例性实施例中,情感分类标签可以包括正面情感文本与负面情感文本。步骤S120可以通过以下步骤实现:In an exemplary embodiment, the sentiment classification label may include positive sentiment text and negative sentiment text. Step S120 can be implemented by the following steps:
获取初始代价cost 10和cost 01,cost 10为将正面情感文本误认为负面情感文本的初始代 价,cost 01为将负面情感文本误认为正面情感文本的初始代价。 The initial costs cost 10 and cost 01 are obtained. Cost 10 is the initial cost of mistaken positive emotion text as negative emotion text, and cost 01 is the initial cost of mistaken negative emotion text as positive emotion text.
统计样本文本集中的正面情感文本数量Q1与负面情感文本数量Q0。The number of positive emotional text Q1 and the number of negative emotional text Q0 in the sample text set are counted.
通过以下公式对初始代价进行修正计算,获得修正代价:Modify the initial cost by the following formula to obtain the modified cost:
Figure PCTCN2019089166-appb-000001
Figure PCTCN2019089166-appb-000001
Figure PCTCN2019089166-appb-000002
Figure PCTCN2019089166-appb-000002
Figure PCTCN2019089166-appb-000003
Figure PCTCN2019089166-appb-000003
其中,R 10为样本偏差比例,costm 10为将正面情感文本误认为负面情感文本的修正代价,costm 01为将负面情感文本误认为正面情感文本的修正代价,a为指数参数。 Among them, R 10 is the sample deviation ratio, costm 10 is the correction cost of mistaken positive emotion text as negative emotion text, costm 01 is the correction cost of mistaken negative emotion text as positive emotion text, and a is the exponential parameter.
根据上述分析,样本文本集中不同情感分类的样本文本具有不同的初始代价与修正代价。当情感分类标签为正面情感文本与负面情感文本时,可以用“0”表示负面情感,“1”表示正面情感。获取的初始代价cost 10和cost 01,就可以分别表示将正面情感文本误认为负面情感文本的初始代价和将负面情感文本误认为正面情感文本的初始代价。 According to the above analysis, sample texts with different sentiment classifications in the sample text set have different initial costs and correction costs. When the emotion classification labels are positive emotion text and negative emotion text, "0" can be used to indicate negative emotions, and "1" can be used to indicate positive emotions. The obtained initial costs cost 10 and cost 01 can respectively represent the initial cost of misrepresenting positive emotional text as negative emotional text and the initial cost of misrepresenting negative emotional text as positive emotional text.
基于样本文本集中正面情感文本数量Q1与负面情感文本数量Q0,可以通过公式(1)、公式(2)以及公式(3),计算修正代价,a为指数参数,反映修正的程度,a越大,表示修正的程度越高;一般0<a≤1,可以根据经验及实际使用的情况设定a的数值。Based on the number of positive emotional text Q1 and the number of negative emotional text Q0 in the sample text set, the correction cost can be calculated by formula (1), formula (2) and formula (3), a is the index parameter, reflecting the degree of correction, the greater a , Indicates the higher the degree of correction; generally 0 <a≤1, the value of a can be set according to experience and actual use.
举例而言,如果正面情感文本数量Q1=80000,负面情感文本数量Q0=20000,设置a=1/2,根据公式计算可以得到R 10=4,代入公式(2)和公式(3)中计算,可得costm10=0.5cost10,costm01=2cost01。可见,经过修正计算,正面情感文本的修正代价低于其初始代价,负面情感文本的修正代价高于其初始代价。 For example, if the number of positive sentiment texts Q1 = 80000, the number of negative sentiment texts Q0 = 20000, set a = 1/2, according to the formula calculation can get R 10 = 4, substituted into formula (2) and formula (3) , Available costm10 = 0.5cost10, costm01 = 2cost01. It can be seen that after correction calculation, the correction cost of positive emotional text is lower than its initial cost, and the correction cost of negative emotional text is higher than its initial cost.
在其他实施例中,也可以通过计算不同情感分类的样本文本的偏差比例,来对初始代价进行修正。例如:在样本文本集中,负面情感文本数量为Q0,正面情感文本数量为Q1,负面情感的偏差比例可以为:
Figure PCTCN2019089166-appb-000004
修正代价可以通过公式:costm 10=cost 10·R 0
Figure PCTCN2019089166-appb-000005
来计算。
In other embodiments, the initial cost can also be corrected by calculating the deviation ratio of the sample texts of different sentiment classifications. For example: in the sample text set, the number of negative emotion texts is Q0, the number of positive emotion texts is Q1, and the deviation ratio of negative emotions can be:
Figure PCTCN2019089166-appb-000004
The cost can be modified by the formula: costm 10 = cost 10 · R 0 ,
Figure PCTCN2019089166-appb-000005
To calculate.
举例而言,如果正面情感文本数量Q1=80000,负面情感文本数量Q0=20000,R 0=0.4,分别代入公式costm 10=cost 10·R 0和公式
Figure PCTCN2019089166-appb-000006
调整初始代价,得到costm10=0.4cost10,costm01=2.5cost01。经过修正计算,也可以使正面情感文本的修正代价低于其初始代价,负面情感文本的修正代价高于其初始代价。
For example, if the number of positive sentiment texts Q1 = 80000, the number of negative sentiment texts Q0 = 20000, and R 0 = 0.4, then substitute the formula costm 10 = cost 10 · R 0 and the formula
Figure PCTCN2019089166-appb-000006
Adjust the initial cost to obtain costm10 = 0.4cost10 and costm01 = 2.5cost01. After the correction calculation, the correction cost of positive emotional text can also be lower than its initial cost, and the correction cost of negative emotional text is higher than its initial cost.
在一示例性实施例中,参考图2所示,步骤S130可以包括以下步骤:In an exemplary embodiment, referring to FIG. 2, step S130 may include the following steps:
步骤S201,将样本文本集划分为训练子集T与验证子集D,D={x 1,x 2…x m}。 In step S201, the sample text set is divided into a training subset T and a verification subset D, D = {x 1 , x 2 … x m }.
步骤S202,利用训练子集T训练提升算法学习模型。In step S202, the training subset T is used to train the lifting algorithm to learn the model.
步骤S203,通过提升算法学习模型获取验证子集D中每个样本文本xi的情感识别结 果f(xi)。In step S203, the emotion recognition result f (xi) of each sample text xi in the verification subset D is obtained by improving the algorithm learning model.
步骤S204,根据公式(4)计算提升算法学习模型的错误率:Step S204, calculating the error rate of the algorithm learning model according to formula (4):
Figure PCTCN2019089166-appb-000007
Figure PCTCN2019089166-appb-000007
步骤S205,如果错误率低于学习阈值,则判定提升算法学习模型训练完成,将训练后的提升算法学习模型确定为文本情感识别模型。Step S205, if the error rate is lower than the learning threshold, it is determined that the training model of the lifting algorithm learning is completed, and the trained learning model of the lifting algorithm is determined as a text emotion recognition model.
其中,m为验证子集中的样本文本数量,i∈[1,m];E为提升算法学习模型的错误率,D+为验证子集D的正面情感样本文本子集,D-为验证子集D的负面情感样本文本子集,y i为样本文本xi的情感分类标签。 Where m is the number of sample texts in the verification subset, i ∈ [1, m]; E is the error rate of the algorithm learning model, D + is the positive emotion sample text subset of verification subset D, and D- is the verification subset D is the negative sentiment sample text subset, y i is the sentiment classification label of the sample text xi.
在步骤S201中,可以直接将样本文本集划分为两个互斥的集合,其中一个集合作为训练子集,另一个作为验证子集,在训练完成模型之后,以便于用来评估其验证误差,作为对泛化误差的估计。假设样本文本集包含100000个样本文本,采取8/2分样,可以将其划分为包含80000个训练样本文本的子集,即训练子集T,以及包含20000个验证样本文本的子集,即验证子集D,D={x 1,x 2…x m},x1、x2等代表D中的样本文本。其中,训练子集和验证子集的分配比例可以根据需要确定,在此不做具体限制。 In step S201, the sample text set can be directly divided into two mutually exclusive sets, one of which is used as a training subset and the other is used as a verification subset. After training the model, it is used to evaluate its verification error. As an estimate of the generalization error. Assuming that the sample text set contains 100,000 sample texts, taking 8/2 samples, it can be divided into a subset containing 80,000 training sample texts, namely training subset T, and a subset containing 20,000 verification sample texts, namely Verification subset D, D = {x 1 , x 2 … x m }, x1, x2, etc. represent the sample text in D. Among them, the distribution ratio of the training subset and the verification subset can be determined according to needs, and no specific limitation is made here.
提升算法学习模型可以以训练子集为输入,输出对训练子集中样本文本的情感分类结果,调整模型参数,继续训练模型,然后可以通过验证子集来验证模型是否符合要求,通过公式(4)计算提升算法学习模型的错误率。公式(4)中,Ⅱ(·)为指示函数,在括号内的·为真和假时分别取值为1和0,对于D中的每个样本文本xi,如果模型输出的结果f(xi)与情感分类标签yi相同,则xi的错误指数为0;如果模型输出的结果与情感分类标签不同,则xi的错误指数为costm10(当xi为正面样本文本时)或costm01(当xi为负面样本文本时);对D中的所有样本文本的错误指数取算术平均值,可以得到模型的错误率E。错误率E的值越低表示提升算法学习模型训练的效果越好。The lifting algorithm learning model can take the training subset as input, output the sentiment classification result of the sample text in the training subset, adjust the model parameters, continue training the model, and then can verify whether the model meets the requirements by verifying the subset, by formula (4) The calculation improves the error rate of the algorithm learning model. In formula (4), Ⅱ (·) is the indicator function, and the values in brackets are 1 and 0 when true and false, respectively. For each sample text xi in D, if the model outputs the result f (xi ) Is the same as the sentiment classification label yi, the error index of xi is 0; if the output of the model is different from the sentiment classification label, the error index of xi is costm10 (when xi is positive sample text) or costm01 (when xi is negative Sample text); taking the arithmetic average of the error indices of all sample texts in D, the error rate E of the model can be obtained. The lower the value of the error rate E, the better the effect of improving the algorithm learning model training.
在模型的训练中,可以设置一学习阈值的判断机制,来判断提升算法学习模型的错误率是否在可接受范围内。如果计算所得的错误率低于学习阈值,则判断模型训练完成,得到文本情感识别模型;如果计算所得的错误率等于或高于学习阈值,则不能通过验证,可以继续对模型进行训练。可以根据经验或实际使用情况设定学习阈值,本实施例对于其具体数值不做限定。In the model training, a learning threshold judgment mechanism can be set to judge whether the error rate of the improved algorithm learning model is within an acceptable range. If the calculated error rate is lower than the learning threshold, it is judged that the model training is completed to obtain a text emotion recognition model; if the calculated error rate is equal to or higher than the learning threshold, it cannot pass the verification, and the model can continue to be trained. The learning threshold can be set according to experience or actual usage, and this embodiment does not limit its specific value.
在一示例性实施例中,文本情感识别方法还可以包括以下步骤:In an exemplary embodiment, the text emotion recognition method may further include the following steps:
分别根据公式(5)与公式(6)计算提升算法学习模型的正面样本错误率E+与负面样本错误率E-:According to formula (5) and formula (6), calculate the positive sample error rate E + and negative sample error rate E- of the improved algorithm learning model:
Figure PCTCN2019089166-appb-000008
Figure PCTCN2019089166-appb-000008
Figure PCTCN2019089166-appb-000009
Figure PCTCN2019089166-appb-000009
根据公式(7)计算提升算法学习模型的错误率比例:According to formula (7), calculate the error rate ratio of the improved algorithm learning model:
Figure PCTCN2019089166-appb-000010
Figure PCTCN2019089166-appb-000010
如果错误率比例处于预设范围,则继续检测错误率是否低于学习阈值。If the error rate ratio is within a preset range, continue to detect whether the error rate is below the learning threshold.
其中,s为验证子集D的正面情感样本文本数量,即D+的样本文本数量,v为验证子集D的负面情感样本文本数量,即D-的样本文本数量,m=s+v。Where s is the number of positive sentiment sample texts in the verification subset D, that is, the number of sample texts in D +, and v is the number of negative sentiment sample texts in the verification subset D, namely the number of sample texts in D-, m = s + v.
考虑到正面样本与负面样本错误率的差异性,可以根据公式(5)与公式(6)分别计算提升算法学习模型的正面样本错误率E+与负面样本错误率E-,正面样本错误率E+为利用正面样本文本子集D+验证提升算法学习模型的错误率,即对于正面样本文本识别的错误率;负面样本错误率E-为利用负面样本文本子集D-验证提升算法学习模型的错误率,即对于负面样本文本识别的错误率。则通过上述公式(4)计算的错误率为对于正面样本文本与负面样本文本整体识别的错误率。Considering the difference between the positive sample and negative sample error rates, the positive sample error rate E + and the negative sample error rate E- of the improved algorithm learning model can be calculated according to formulas (5) and (6), respectively. The positive sample error rate E + is The positive sample text subset D + verification is used to increase the error rate of the algorithm learning model, that is, the error rate for positive sample text recognition; the negative sample error rate E- is the negative sample text subset D- verification to improve the algorithm learning model error rate, That is, the error rate for negative sample text recognition. Then the error rate calculated by the above formula (4) is the error rate for the overall recognition of positive sample text and negative sample text.
在一示例性实施例中,在计算出E+与E-后,也可以通过公式
Figure PCTCN2019089166-appb-000011
计算出样本文本子集D验证提升算法学习模型的错误率,与上述公式(4)所计算的错误率一致。
In an exemplary embodiment, after calculating E + and E-, the formula can also be used
Figure PCTCN2019089166-appb-000011
The error rate of the sample text subset D verification lifting algorithm learning model is calculated, which is consistent with the error rate calculated by the above formula (4).
根据公式(7)可以计算提升算法模型的错误率比例A,A反映了模型对于不同情感样本文本识别的错误率不均衡程度。当A为1时,表示正面样本错误率E+与负面样本错误率E-相等,此时模型对于正面样本文本与负面样本文本识别的错误率均衡;当A与1相差过多,无论是大于1或者小于1,都说明模型对于正面样本文本与负面样本文本识别的错误率不均衡程度较高,训练并未达到要求。本实施例意为在判断提升算法学习模型的错误率是否达到要求之前,首先判断模型识别不同情感样本文本的错误率是否均衡,如果均衡性达到要求,再继续判断错误率是否达到要求。According to formula (7), the error rate ratio A of the lifting algorithm model can be calculated. A reflects the imbalance of the error rate of the model for different emotion sample text recognition. When A is 1, it means that the positive sample error rate E + is equal to the negative sample error rate E-. At this time, the error rate of the model for the positive sample text and the negative sample text recognition is balanced; when A and 1 are too different, whether it is greater than 1 Or less than 1, it means that the model has a high degree of unevenness in the recognition rate of positive sample text and negative sample text, and the training has not met the requirements. This embodiment means that before determining whether the error rate of the learning model of the lifting algorithm meets the requirement, first determine whether the error rate of the model identifying different emotion sample texts is balanced, and if the balance reaches the requirement, continue to determine whether the error rate meets the requirement.
根据应用场景中可接收的错误率不均衡程度,可以设置一预设范围,以衡量错误率均衡性是否达到要求,当错误率比例处于预设范围内时,则说明均衡性达到要求,可以继续判断错误率是否达到学习阈值的标准。例如:可以设置预设范围为[0.5,2],当正面情感样本错误率是负面情感样本错误率的2倍时,计算所得错误率比例A=0.5,当负面情感样本错误率是正面情感样本错误率的2倍时,计算所得错误率比例A=2,处于预设范围之内,表示可以接受这种程度的不均衡性,继续检测错误率是否低于学习阈值。According to the acceptable degree of error rate imbalance in the application scenario, a preset range can be set to measure whether the error rate balance meets the requirements. When the error rate ratio is within the preset range, it means that the balance reaches the requirements and can be continued Determine whether the error rate reaches the standard of learning threshold. For example, the preset range can be set to [0.5, 2], when the error rate of positive emotion samples is twice the error rate of negative emotion samples, the calculated error rate ratio A = 0.5, when the error rate of negative emotion samples is positive emotion samples When the error rate is twice, the calculated error rate ratio A = 2, which is within the preset range, indicates that this degree of imbalance can be accepted, and continues to detect whether the error rate is below the learning threshold.
在其他实施例中,也可以利用B=|lgA|来定量表示提升算法学习模型对于不同情感样本文本识别的错误率不均衡程度,当B=0时表示完全均衡,B越大表示均衡性越差,因此可以设置关于B的阈值以衡量模型的错误率均衡性是否达到要求。In other embodiments, B = | lgA | can also be used to quantitatively express the degree of unbalanced error rate of the lifting algorithm learning model for different emotion sample text recognition. When B = 0, it means complete equilibrium. The larger the B, the more balanced it is. Poor, so you can set a threshold on B to measure whether the model's error rate balance meets the requirements.
进一步的,如果错误率比例未处于预设范围之内,需要对提升算法学习模型进一步进行训练。在一示例性实施例中,文本情感识别方法还包括以下步骤:Further, if the error rate ratio is not within the preset range, the learning model of the lifting algorithm needs to be further trained. In an exemplary embodiment, the text emotion recognition method further includes the following steps:
如果错误率比例未处于预设范围,则再次利用训练子集T训练提升算法学习模型。If the error rate ratio is not within the preset range, the training subset T is used again to train the lifting algorithm to learn the model.
通过以下公式重新计算提升算法学习模型的错误率比例:Recalculate the proportion of error rates that improve the algorithm learning model by the following formula:
Figure PCTCN2019089166-appb-000012
Figure PCTCN2019089166-appb-000012
Figure PCTCN2019089166-appb-000013
Figure PCTCN2019089166-appb-000013
Figure PCTCN2019089166-appb-000014
Figure PCTCN2019089166-appb-000014
再次检测错误率比例是否处于所述预设范围。It is checked again whether the error rate ratio is within the preset range.
举例而言,如果在公式(5)和公式(6)中计算所得正面样本错误率E+大于负面样本错误率E-,导致错误率比例A大于1,为了提高提升算法学习模型的错误率均衡性,可以再次训练模型,并通过公式(8)与公式(9)再次计算E-与E+。在公式(8)与公式(9)中,如果上一次验证中计算的A大于1,则E+通过乘以A得到了提高,E-通过乘以1/A得到了降低,即在本轮训练中,如果E+与E-没有得到较大改善,其比例A将继续增大,因此将使得模型的训练过程加速,提升训练效果。通过上述过程可以更快的实现模型的错误率均衡。For example, if the positive sample error rate E + calculated in equations (5) and (6) is greater than the negative sample error rate E-, the error rate ratio A is greater than 1. In order to improve the error rate balance of the algorithm learning model , You can train the model again, and calculate E- and E + again by formula (8) and formula (9). In formula (8) and formula (9), if the A calculated in the previous verification is greater than 1, E + is improved by multiplying by A, and E- is reduced by multiplying by 1 / A, that is, training in this round If E + and E- are not greatly improved, the ratio A will continue to increase, so the training process of the model will be accelerated and the training effect will be improved. Through the above process, the error rate of the model can be balanced faster.
图3示出了本示例性实施例中一种文本情感识别模型训练的流程图,通过对样本文本集计算样本偏差比例,并根据样本偏差比例计算修正代价,以训练提升算法学习模型;然后计算模型训练的错误率比例及错误率,并依此进行判断;如果判断错误率比例未处于预设范围内,则可以返回模型训练步骤,继续训练提升算法学习模型,如果判断错误率比例处于预设范围内,则可以继续进行错误率是否低于学习阈值的判断;进一步的,如果判断错误率等于或高于学习阈值,则可以返回模型训练步骤,继续训练提升算法学习模型,如果判断错误率低于学习阈值,则可以认为模型训练完成,得到文本情感识别模型。FIG. 3 shows a flow chart of a text emotion recognition model training in this exemplary embodiment. The sample deviation ratio is calculated for the sample text set, and the correction cost is calculated according to the sample deviation ratio to train the lifting algorithm to learn the model; and then calculate The error rate ratio and error rate of model training, and judge accordingly; if it is judged that the error rate ratio is not within the preset range, you can return to the model training step to continue training to improve the algorithm to learn the model, if the error rate ratio is judged to be preset Within the range, you can continue to judge whether the error rate is lower than the learning threshold; further, if you judge that the error rate is equal to or higher than the learning threshold, you can return to the model training step to continue training to improve the algorithm to learn the model, if you judge the error rate is low Based on the learning threshold, it can be considered that the model training is completed and a text emotion recognition model is obtained.
在一示例性实施例中,情感分类标签可以包括:1级正面情感文本、2级正面情感文本、……、n级正面情感文本和1级负面情感文本、2级负面情感文本、……、n级负面情感文本,n为大于1的整数。In an exemplary embodiment, the emotion classification tags may include: level 1 positive emotion text, level 2 positive emotion text, ..., level n positive emotion text and level 1 negative emotion text, level 2 negative emotion text, ..., N-level negative emotion text, n is an integer greater than 1.
其中,样本文本的情感可以分类正面情感和负面情感,进一步的,正面情感与负面情感根据情感程度还可以分为1级正面情感文本、2级正面情感文本、……、n级正面情感文本和1级负面情感文本、2级负面情感文本、……、n级负面情感文本。可以通过识别关键词或关键字的方式来确定情感分类等级,例如,关键词为“好”的样本文本可以标注其情感分类标签为1级正面情感文本,关键词包括“非常”、“好”的样本文本可以标注其情感分类标签为2级正面情感文本等等。另外,情感分类标签还可以包括中性情感文本等,在此不做具体限制。Among them, the emotions of the sample text can be classified into positive emotions and negative emotions. Further, positive emotions and negative emotions can be divided into level 1 positive emotion text, level 2 positive emotion text, ..., level n positive emotion text and Level 1 negative emotion text, level 2 negative emotion text, ..., level n negative emotion text. You can determine the sentiment classification level by identifying keywords or keywords. For example, sample text with a keyword of "good" can be labeled with a sentiment classification label of level 1 positive sentiment text. Keywords include "very" and "good" The sample text of can be marked with a sentiment classification label of level 2 positive sentiment text and so on. In addition, the sentiment classification tags may also include neutral sentiment text, etc., which is not specifically limited here.
本公开的示例性实施例还提供了一种文本情感识别装置。参照图4,该装置400可以包括,样本获取模块410,代价修正模块420,模型获取模块430及目标识别模块440。其中,样本获取模块410用于获取样本文本集,样本文本集包括多个样本文本以及各样本文本对应的情感分类标签;代价修正模块420用于根据样本文本集中的情感分类标签的数量分布对初始代价进行修正计算,获得修正代价;模型获取模块430用于通过样本文本集与修正代价训练一提升算法学习模型,得到文本情感识别模型;目标识别模块440,用于通过文本情感识别模型对待识别文本进行识别,得到待识别文本的情感识别结果。Exemplary embodiments of the present disclosure also provide a text emotion recognition device. Referring to FIG. 4, the apparatus 400 may include a sample acquisition module 410, a cost correction module 420, a model acquisition module 430, and a target recognition module 440. Among them, the sample acquisition module 410 is used to acquire a sample text set, the sample text set includes multiple sample texts and the sentiment classification tags corresponding to each sample text; the cost correction module 420 is used to determine the initial value based on the number of sentiment classification tags in the sample text set. The cost is corrected and calculated to obtain the modified cost; the model acquisition module 430 is used to train a lifting algorithm to learn the model through the sample text set and the modified cost, and the text emotion recognition model is obtained; the target recognition module 440 is used to treat the recognized text through the text emotion recognition model Recognize and get the emotion recognition result of the text to be recognized.
在一示例性实施例中,情感分类标签包括正面情感文本与负面情感文本;代价修正模块可以包括:初始代价获取单元,用于获取初始代价cost 10和cost 01,cost 10为将正面情感文本误认为负面情感文本的初始代价,cost 01为将负面情感文本误认为正面情感文本的初始代价;文本统计单元,用于统计样本文本集中的正面情感文本数量Q1与负面情感文本数量Q0;代价修正单元,用于通过以下公式对初始代价进行修正计算,获得修正代价: In an exemplary embodiment, the sentiment classification label includes positive sentiment text and negative sentiment text; the cost correction module may include: an initial cost acquisition unit, used to obtain initial costs cost 10 and cost 01 , cost 10 is the positive sentiment text error Think of the initial cost of negative emotion text, cost 01 is the initial cost of mistaken negative emotion text as positive emotion text; text statistics unit, used to count the number of positive emotion text Q1 and the number of negative emotion text Q0 in the sample text set; cost correction unit , Used to modify the initial cost by the following formula to obtain the modified cost:
Figure PCTCN2019089166-appb-000015
Figure PCTCN2019089166-appb-000015
Figure PCTCN2019089166-appb-000016
Figure PCTCN2019089166-appb-000016
Figure PCTCN2019089166-appb-000017
Figure PCTCN2019089166-appb-000017
其中,R 10为样本偏差比例,costm 10为将正面情感文本误认为负面情感文本的修正代价,costm 01为将负面情感文本误认为正面情感文本的修正代价,a为指数参数。 Among them, R 10 is the sample deviation ratio, costm 10 is the correction cost of mistaken positive emotion text as negative emotion text, costm 01 is the correction cost of mistaken negative emotion text as positive emotion text, and a is the exponential parameter.
在一示例性实施例中,模型获取模块可以包括:划分单元,用于将样本文本集划分为训练子集T与验证子集D,D={x 1,x 2…x m};训练单元,用于利用训练子集T训练提升算法学习模型;验证单元,用于通过提升算法学习模型获取验证子集D中每个样本文本xi的情感识别结果f(xi);计算单元,用于根据公式(4)计算提升算法学习模型的错误率: In an exemplary embodiment, the model acquisition module may include: a division unit for dividing the sample text set into a training subset T and a verification subset D, D = {x 1 , x 2 … x m }; the training unit , Used to train the lifting algorithm to learn the model using the training subset T; the verification unit, to obtain the sentiment recognition result f (xi) of each sample text xi in the verification subset D through the lifting algorithm learning model; Formula (4) is calculated to improve the error rate of the algorithm learning model:
Figure PCTCN2019089166-appb-000018
Figure PCTCN2019089166-appb-000018
判断单元,用于在错误率低于学习阈值时,判定提升算法学习模型训练完成,将训练后的提升算法学习模型确定为文本情感识别模型;其中,m为验证子集中的样本文本数量,i∈[1,m];E为提升算法学习模型的错误率,D+为验证子集D的正面情感样本文本子集,D-为验证子集D的负面情感样本文本子集,y i为样本文本xi的情感分类标签。 Judgment unit, used to determine that the lifting algorithm learning model training is completed when the error rate is lower than the learning threshold, and determine the trained lifting algorithm learning model as a text emotion recognition model; where m is the number of sample text in the verification subset, i ∈ [1, m]; E is the error rate of the learning model of the algorithm, D + is the positive emotion sample text subset of the verification subset D, D- is the negative emotion sample text subset of the verification subset D, and y i is the sample Sentiment classification label for text xi.
在一示例性实施例中,计算单元还可以用于分别根据公式(5)与公式(6)计算提升算法学习模型的正面样本错误率E+与负面样本错误率E-:In an exemplary embodiment, the calculation unit may also be used to calculate the positive sample error rate E + and the negative sample error rate E- of the boosting algorithm learning model according to equations (5) and (6), respectively:
Figure PCTCN2019089166-appb-000019
Figure PCTCN2019089166-appb-000019
Figure PCTCN2019089166-appb-000020
Figure PCTCN2019089166-appb-000020
以及,用于根据公式(7)计算提升算法学习模型的错误率比例:And, it is used to calculate the error rate ratio of the improved algorithm learning model according to formula (7):
Figure PCTCN2019089166-appb-000021
Figure PCTCN2019089166-appb-000021
判断单元还可以用于在错误率比例处于预设范围时,继续检测错误率是否低于学习阈值。其中,s为验证子集D的正面情感样本文本数量,v为验证子集D的负面情感样本文本数量,m=s+v。The judgment unit can also be used to continue to detect whether the error rate is lower than the learning threshold when the error rate ratio is within a preset range. Where, s is the number of positive emotion sample texts in the verification subset D, v is the number of negative emotion sample texts in the verification subset D, and m = s + v.
在一示例性实施例中,训练单元还可以用于如果错误率比例未处于预设范围,则再次利用训练子集T训练提升算法学习模型;计算单元还可以用于通过以下公式重新计算提升算法学习模型的错误率比例:In an exemplary embodiment, the training unit can also be used to train the lifting algorithm to learn the model again using the training subset T if the error rate ratio is not within the preset range; the computing unit can also be used to recalculate the lifting algorithm by the following formula The error rate ratio of the learning model:
Figure PCTCN2019089166-appb-000022
Figure PCTCN2019089166-appb-000022
Figure PCTCN2019089166-appb-000023
Figure PCTCN2019089166-appb-000023
Figure PCTCN2019089166-appb-000024
Figure PCTCN2019089166-appb-000024
判断单元还可以用于再次检测错误率比例是否处于预设范围。The judgment unit can also be used to detect again whether the error rate ratio is within a preset range.
在一示例性实施例中,情感分类标签可以包括1级正面情感文本、2级正面情感文本、……、n级正面情感文本和1级负面情感文本、2级负面情感文本、……、n级负面情感文本,n为大于1的整数。In an exemplary embodiment, the emotion classification tags may include level 1 positive emotion text, level 2 positive emotion text, ..., level n positive emotion text and level 1 negative emotion text, level 2 negative emotion text, ..., n Grade negative emotion text, n is an integer greater than 1.
在一示例性实施例中,提升算法学习模型可以包括梯度提升决策树模型、Adaboost模型或Xgboost模型。In an exemplary embodiment, the lifting algorithm learning model may include a gradient lifting decision tree model, an Adaboost model, or an Xgboost model.
上述各模块/单元的具体细节已经在对应的方法部分实施例中进行了详细的描述,因此此处不再赘述。The specific details of the above modules / units have been described in detail in the corresponding method embodiments, so they will not be repeated here.
本公开的示例性实施例还提供了一种能够实现上述方法的电子设备。Exemplary embodiments of the present disclosure also provide an electronic device capable of implementing the above method.
所属技术领域的技术人员能够理解,本公开的各个方面可以实现为系统、方法或程序产品。因此,本公开的各个方面可以具体实现为以下形式,即:完全的硬件实施方式、完全的软件实施方式(包括固件、微代码等),或硬件和软件方面结合的实施方式,这里可以统称为“电路”、“模块”或“系统”。Those skilled in the art can understand that various aspects of the present disclosure can be implemented as a system, method, or program product. Therefore, various aspects of the present disclosure may be specifically implemented in the form of: a complete hardware implementation, a complete software implementation (including firmware, microcode, etc.), or a combination of hardware and software implementations, which may be collectively referred to herein as "Circuit", "Module" or "System".
下面参照图5来描述根据本公开的这种示例性实施例的电子设备500。图5显示的电子设备500仅仅是一个示例,不应对本公开实施例的功能和使用范围带来任何限制。The electronic device 500 according to this exemplary embodiment of the present disclosure is described below with reference to FIG. 5. The electronic device 500 shown in FIG. 5 is only an example, and should not bring any limitation to the functions and use scope of the embodiments of the present disclosure.
如图5所示,电子设备500以通用计算设备的形式表现。电子设备500的组件可以包括但不限于:上述至少一个处理单元510、上述至少一个存储单元520、连接不同系统组件(包括存储单元520和处理单元510)的总线530、显示单元540。As shown in FIG. 5, the electronic device 500 is represented in the form of a general-purpose computing device. The components of the electronic device 500 may include, but are not limited to: the at least one processing unit 510, the at least one storage unit 520, a bus 530 connecting different system components (including the storage unit 520 and the processing unit 510), and the display unit 540.
其中,存储单元存储有程序代码,程序代码可以被处理单元510执行,使得处理单元510执行本说明书上述“示例性方法”部分中描述的根据本公开各种示例性实施方式的步骤。例如,处理单元510可以执行图1所示的步骤S110~S140,也可以执行图2所示的步骤S201~S205等。Wherein, the storage unit stores the program code, and the program code may be executed by the processing unit 510, so that the processing unit 510 executes the steps according to various exemplary embodiments of the present disclosure described in the "Exemplary Method" section of this specification. For example, the processing unit 510 may execute steps S110 to S140 shown in FIG. 1, or may execute steps S201 to S205 shown in FIG. 2, and so on.
存储单元520可以包括易失性存储单元形式的可读介质,例如随机存取存储单元(RAM)521和/或高速缓存存储单元522,还可以进一步包括只读存储单元(ROM)523。The storage unit 520 may include a readable medium in the form of a volatile storage unit, such as a random access storage unit (RAM) 521 and / or a cache storage unit 522, and may further include a read-only storage unit (ROM) 523.
存储单元520还可以包括具有一组(至少一个)程序模块525的程序/实用工具524,这样的程序模块525包括但不限于:操作系统、一个或者多个应用程序、其它程序模块以及程序数据,这些示例中的每一个或某种组合中可能包括网络环境的实现。The storage unit 520 may further include a program / utility tool 524 having a set of (at least one) program modules 525. Such program modules 525 include but are not limited to: an operating system, one or more application programs, other program modules, and program data. Each of these examples or some combination may include an implementation of the network environment.
总线530可以为表示几类总线结构中的一种或多种,包括存储单元总线或者存储单元控制器、外围总线、图形加速端口、处理单元或者使用多种总线结构中的任意总线结构的局域总线。The bus 530 may represent one or more of several types of bus structures, including a storage unit bus or a storage unit controller, a peripheral bus, a graphics acceleration port, a processing unit, or a local area using any of a variety of bus structures bus.
电子设备500也可以与一个或多个外部设备700(例如键盘、指向设备、蓝牙设备等) 通信,还可与一个或者多个使得用户能与该电子设备500交互的设备通信,和/或与使得该电子设备500能与一个或多个其它计算设备进行通信的任何设备(例如路由器、调制解调器等等)通信。这种通信可以通过输入/输出(I/O)接口550进行。并且,电子设备500还可以通过网络适配器560与一个或者多个网络(例如局域网(LAN),广域网(WAN)和/或公共网络,例如因特网)通信。如图所示,网络适配器560通过总线530与电子设备500的其它模块通信。应当明白,尽管图中未示出,可以结合电子设备500使用其它硬件和/或软件模块,包括但不限于:微代码、设备驱动器、冗余处理单元、外部磁盘驱动阵列、RAID系统、磁带驱动器以及数据备份存储系统等。The electronic device 500 may also communicate with one or more external devices 700 (eg, keyboard, pointing device, Bluetooth device, etc.), and may also communicate with one or more devices that enable a user to interact with the electronic device 500, and / or This enables the electronic device 500 to communicate with any device (eg, router, modem, etc.) that communicates with one or more other computing devices. Such communication may be performed through an input / output (I / O) interface 550. Moreover, the electronic device 500 can also communicate with one or more networks (such as a local area network (LAN), a wide area network (WAN), and / or a public network, such as the Internet) through a network adapter 560. As shown, the network adapter 560 communicates with other modules of the electronic device 500 through the bus 530. It should be understood that although not shown in the figure, other hardware and / or software modules may be used in conjunction with the electronic device 500, including but not limited to: microcode, device driver, redundant processing unit, external disk drive array, RAID system, tape drive And data backup storage system.
通过以上的实施方式的描述,本领域的技术人员易于理解,这里描述的示例实施方式可以通过软件实现,也可以通过软件结合必要的硬件的方式来实现。因此,根据本公开实施方式的技术方案可以以软件产品的形式体现出来,该软件产品可以存储在一个非易失性存储介质(可以是CD-ROM,U盘,移动硬盘等)中或网络上,包括若干指令以使得一台计算设备(可以是个人计算机、服务器、终端装置、或者网络设备等)执行根据本公开示例性实施例的方法。Through the description of the above embodiments, those skilled in the art can easily understand that the example embodiments described herein can be implemented by software, or can be implemented by software in combination with necessary hardware. Therefore, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (which may be a CD-ROM, U disk, mobile hard disk, etc.) or on a network , Including several instructions to cause a computing device (which may be a personal computer, server, terminal device, or network device, etc.) to perform the method according to an exemplary embodiment of the present disclosure.
本公开的示例性实施例还提供了一种计算机非易失性可读存储介质,其上存储有能够实现本说明书上述方法的程序产品。在一些可能的实施方式中,本公开的各个方面还可以实现为一种程序产品的形式,其包括程序代码,当程序产品在终端设备上运行时,程序代码用于使终端设备执行本说明书上述“示例性方法”部分中描述的根据本公开各种示例性实施方式的步骤。Exemplary embodiments of the present disclosure also provide a computer non-volatile readable storage medium on which is stored a program product capable of implementing the above method of this specification. In some possible implementations, various aspects of the present disclosure may also be implemented in the form of a program product, which includes program code, and when the program product runs on the terminal device, the program code is used to cause the terminal device to execute The steps according to various exemplary embodiments of the present disclosure described in the "Exemplary Method" section.
参考图6所示,描述了根据本公开的示例性实施例的用于实现上述方法的程序产品600,其可以采用便携式紧凑盘只读存储器(CD-ROM)并包括程序代码,并可以在终端设备,例如个人电脑上运行。然而,本公开的程序产品不限于此,在本文件中,可读存储介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。程序产品可以采用一个或多个可读介质的任意组合。可读介质可以是可读信号介质或者可读存储介质。可读存储介质例如可以为电、磁、光、电磁、红外线、或半导体的系统、装置或器件,或者任意以上的组合。可读存储介质更具体的例子(非穷举的列表)包括:具有一个或多个导线的电连接、便携式盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、光纤、便携式紧凑盘只读存储器(CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。计算机可读信号介质可以包括在基带中或者作为载波一部分传播的数据信号,其中承载了可读程序代码。这种传播的数据信号可以采用多种形式,包括但不限于电磁信号、光信号或上述的任意合适的组合。可读信号介质还可以是可读存储介质以外的任何可读介质,该可读介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合的程序。Referring to FIG. 6, a program product 600 for implementing the above method according to an exemplary embodiment of the present disclosure is described, which can adopt a portable compact disk read-only memory (CD-ROM) and include a program code, and can be used in a terminal Devices, such as personal computers. However, the program product of the present disclosure is not limited thereto, and in this document, the readable storage medium may be any tangible medium containing or storing a program, which may be used by or in combination with an instruction execution system, apparatus, or device. The program product may use any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium may be, for example, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, device, or device, or any combination of the above. More specific examples of readable storage media (non-exhaustive list) include: electrical connections with one or more wires, portable disks, hard disks, random access memory (RAM), read only memory (ROM), erasable Programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the foregoing. The computer-readable signal medium may include a data signal that is transmitted in baseband or as part of a carrier wave, in which readable program code is carried. This propagated data signal can take many forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the above. The readable signal medium may also be any readable medium other than a readable storage medium, and the readable medium may send, propagate, or transmit a program for use by or in combination with an instruction execution system, apparatus, or device.
应当理解的是,本公开并不局限于上面已经描述并在附图中示出的精确结构,并且可以在不脱离其范围进行各种修改和改变。本公开的范围仅由所附的权利要求来限。It should be understood that the present disclosure is not limited to the precise structure that has been described above and shown in the drawings, and that various modifications and changes can be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims (22)

  1. 一种文本情感识别方法,其特征在于,包括:A text sentiment recognition method, which includes:
    获取样本文本集,所述样本文本集包括多个样本文本以及各所述样本文本对应的情感分类标签;Acquiring a sample text set, the sample text set including a plurality of sample texts and emotion classification tags corresponding to the sample texts;
    根据所述样本文本集中的情感分类标签的数量分布对初始代价进行修正计算,获得修正代价;Modify the initial cost according to the number distribution of the sentiment classification tags in the sample text set to obtain the modified cost;
    通过所述样本文本集与所述修正代价训练一提升算法学习模型,得到文本情感识别模型;Training a lifting algorithm learning model through the sample text set and the modified cost to obtain a text emotion recognition model;
    通过所述文本情感识别模型对待识别文本进行识别,得到所述待识别文本的情感识别结果。Recognize the text to be recognized through the text emotion recognition model to obtain the emotion recognition result of the text to be recognized.
  2. 根据权利要求1所述的方法,其特征在于,所述情感分类标签包括正面情感文本与负面情感文本;The method according to claim 1, wherein the sentiment classification tags include positive sentiment text and negative sentiment text;
    所述根据所述样本文本集中的情感分类标签的数量分布对初始代价进行修正计算,获得修正代价包括:The modifying the initial cost according to the number distribution of the sentiment classification tags in the sample text set, and obtaining the modified cost includes:
    获取初始代价cost 10和cost 01,cost 10为将正面情感文本误认为负面情感文本的初始代价,cost 01为将负面情感文本误认为正面情感文本的初始代价; Obtain initial costs cost 10 and cost 01 , cost 10 is the initial cost of mistaken positive emotion text as negative emotion text, cost 01 is the initial cost of mistaken negative emotion text as positive emotion text;
    统计所述样本文本集中的正面情感文本数量Q1与负面情感文本数量Q0;Count the number of positive emotional text Q1 and the number of negative emotional text Q0 in the sample text set;
    通过以下公式对所述初始代价进行修正计算,获得所述修正代价:Modify the initial cost by the following formula to obtain the modified cost:
    Figure PCTCN2019089166-appb-100001
    Figure PCTCN2019089166-appb-100001
    Figure PCTCN2019089166-appb-100002
    Figure PCTCN2019089166-appb-100002
    Figure PCTCN2019089166-appb-100003
    Figure PCTCN2019089166-appb-100003
    其中,R 10为样本偏差比例,costm 10为将正面情感文本误认为负面情感文本的修正代价,costm 01为将负面情感文本误认为正面情感文本的修正代价,a为指数参数。 Among them, R 10 is the sample deviation ratio, costm 10 is the correction cost of mistaken positive emotion text as negative emotion text, costm 01 is the correction cost of mistaken negative emotion text as positive emotion text, and a is the exponential parameter.
  3. 根据权利要求1所述的方法,其特征在于,所述通过所述样本文本集与所述修正代价训练一提升算法学习模型,得到文本情感识别模型包括:The method according to claim 1, wherein the training a lifting algorithm learning model through the sample text set and the correction cost to obtain a text emotion recognition model includes:
    将所述样本文本集划分为训练子集T与验证子集D,D={x 1,x 2…x m}; Divide the sample text set into a training subset T and a verification subset D, D = {x 1 , x 2 … x m };
    利用所述训练子集T训练所述提升算法学习模型;Training the lifting algorithm learning model using the training subset T;
    通过所述提升算法学习模型获取所述验证子集D中每个样本文本xi的情感识别结果f(xi);Acquiring the emotion recognition result f (xi) of each sample text xi in the verification subset D through the lifting algorithm learning model;
    根据如下公式计算所述提升算法学习模型的错误率:Calculate the error rate of the improved algorithm learning model according to the following formula:
    Figure PCTCN2019089166-appb-100004
    Figure PCTCN2019089166-appb-100004
    如果所述错误率低于学习阈值,则判定所述提升算法学习模型训练完成,将训练后的 所述提升算法学习模型确定为所述文本情感识别模型;If the error rate is lower than the learning threshold, it is determined that the lifting algorithm learning model training is completed, and the trained training algorithm learning model is determined as the text emotion recognition model;
    其中,m为所述验证子集中的样本文本数量,i∈[1,m];E为所述提升算法学习模型的错误率,D+为所述验证子集D的正面情感样本文本子集,D-为所述验证子集D的负面情感样本文本子集,y i为样本文本xi的情感分类标签。 Where m is the number of sample texts in the verification subset, i∈ [1, m]; E is the error rate of the learning model of the lifting algorithm, and D + is the positive emotion sample text subset of the verification subset D D- is a negative emotion sample text subset of the verification subset D, and y i is an emotion classification label of the sample text xi.
  4. 根据权利要求3所述的方法,其特征在于,所述方法还包括:The method of claim 3, further comprising:
    分别根据如下两个公式计算所述提升算法学习模型的正面样本错误率E+与负面样本错误率E-:The positive sample error rate E + and negative sample error rate E- of the lifting algorithm learning model are calculated according to the following two formulas:
    Figure PCTCN2019089166-appb-100005
    Figure PCTCN2019089166-appb-100005
    Figure PCTCN2019089166-appb-100006
    Figure PCTCN2019089166-appb-100006
    根据如下公式计算所述提升算法学习模型的错误率比例:Calculate the error rate ratio of the learning model of the lifting algorithm according to the following formula:
    Figure PCTCN2019089166-appb-100007
    Figure PCTCN2019089166-appb-100007
    如果所述错误率比例处于预设范围,则继续检测所述错误率是否低于学习阈值。If the error rate ratio is within a preset range, continue to detect whether the error rate is lower than the learning threshold.
    其中,s为所述验证子集D的正面情感样本文本数量,v为所述验证子集D的负面情感样本文本数量,m=s+v。Where s is the number of positive sentiment sample texts in the verification subset D, v is the number of negative sentiment sample texts in the verification subset D, and m = s + v.
  5. 根据权利要求4所述的方法,其特征在于,所述方法还包括:The method according to claim 4, wherein the method further comprises:
    如果所述错误率比例未处于所述预设范围,则再次利用所述训练子集T训练所述提升算法学习模型;If the error rate ratio is not within the preset range, the training subset T is used again to train the lifting algorithm learning model;
    通过以下公式重新计算所述提升算法学习模型的错误率比例:Recalculate the error rate ratio of the learning model of the lifting algorithm by the following formula:
    Figure PCTCN2019089166-appb-100008
    Figure PCTCN2019089166-appb-100008
    Figure PCTCN2019089166-appb-100009
    Figure PCTCN2019089166-appb-100009
    Figure PCTCN2019089166-appb-100010
    Figure PCTCN2019089166-appb-100010
    再次检测所述错误率比例是否处于所述预设范围。It is checked again whether the error rate ratio is within the preset range.
  6. 根据权利要求1所述的方法,其特征在于,所述情感分类标签包括1级正面情感文本、2级正面情感文本、……、n级正面情感文本和1级负面情感文本、2级负面情感文本、……、n级负面情感文本,n为大于1的整数。The method according to claim 1, wherein the sentiment classification tags include level 1 positive emotion text, level 2 positive emotion text, ..., level n positive emotion text and level 1 negative emotion text, level 2 negative emotion text Text, ..., n-level negative emotional text, n is an integer greater than 1.
  7. 根据权利要求1所述的方法,其特征在于,所述提升算法学习模型包括梯度提升决策树模型、Adaboost模型或Xgboost模型。The method according to claim 1, wherein the learning model of the lifting algorithm comprises a gradient lifting decision tree model, an Adaboost model or an Xgboost model.
  8. 一种文本情感识别装置,其特征在于,包括:A text emotion recognition device, characterized in that it includes:
    样本获取模块,用于获取样本文本集,所述样本文本集包括多个样本文本以及各所述样本文本对应的情感分类标签;A sample acquisition module, configured to acquire a sample text set, the sample text set including a plurality of sample texts and emotion classification tags corresponding to the sample texts;
    代价修正模块,用于根据所述样本文本集中的情感分类标签的数量分布对初始代价进行修正计算,获得修正代价。The cost correction module is used for correcting and calculating the initial cost according to the number distribution of the sentiment classification tags in the sample text set, to obtain the correction cost.
    模型获取模块,用于通过所述样本文本集与所述修正代价训练一提升算法学习模型, 得到文本情感识别模型。A model acquisition module is used to train a lifting algorithm learning model through the sample text set and the correction cost to obtain a text emotion recognition model.
    目标识别模块,用于通过所述文本情感识别模型对待识别文本进行识别,得到所述待识别文本的情感识别结果。The target recognition module is used for recognizing the text to be recognized through the text emotion recognition model to obtain the emotion recognition result of the text to be recognized.
  9. 根据权利要求8所述的装置,所述情感分类标签包括正面情感文本与负面情感文本,所述代价修正模块包括:The apparatus according to claim 8, wherein the sentiment classification label includes positive sentiment text and negative sentiment text, and the cost correction module includes:
    初始代价获取单元,用于获取初始代价cost 10和cost 01,cost 10为将正面情感文本误认为负面情感文本的初始代价,cost 01为将负面情感文本误认为正面情感文本的初始代价; The initial cost acquisition unit is used to obtain initial costs cost 10 and cost 01 , cost 10 is the initial cost of mistaken positive emotion text as negative emotion text, and cost 01 is the initial cost of mistaken negative emotion text as positive emotion text;
    文本统计单元,用于统计所述样本文本集中的正面情感文本数量Q1与负面情感文本数量Q0;A text statistics unit, used to count the number Q1 of positive emotional text and the number Q0 of negative emotional text in the sample text set;
    代价修正单元,用于通过以下公式对所述初始代价进行修正计算,获得所述修正代价:The cost correction unit is used to modify and calculate the initial cost by the following formula to obtain the modified cost:
    Figure PCTCN2019089166-appb-100011
    Figure PCTCN2019089166-appb-100011
    Figure PCTCN2019089166-appb-100012
    Figure PCTCN2019089166-appb-100012
    Figure PCTCN2019089166-appb-100013
    Figure PCTCN2019089166-appb-100013
    其中,R 10为样本偏差比例,costm 10为将正面情感文本误认为负面情感文本的修正代价,costm 01为将负面情感文本误认为正面情感文本的修正代价,a为指数参数。 Among them, R 10 is the sample deviation ratio, costm 10 is the correction cost of mistaken positive emotion text as negative emotion text, costm 01 is the correction cost of mistaken negative emotion text as positive emotion text, and a is the exponential parameter.
  10. 根据权利要求8所述的装置,所述模型获取模块包括:The apparatus according to claim 8, the model acquisition module comprising:
    划分单元,用于将所述样本文本集划分为训练子集T与验证子集D,D={x 1,x 2…x m}; A dividing unit for dividing the sample text set into a training subset T and a verification subset D, D = {x 1 , x 2 … x m };
    训练单元,用于利用所述训练子集T训练所述提升算法学习模型;A training unit for training the lifting algorithm learning model using the training subset T;
    验证单元,用于通过所述提升算法学习模型获取所述验证子集D中每个样本文本xi的情感识别结果f(xi);A verification unit, for acquiring the emotion recognition result f (xi) of each sample text xi in the verification subset D through the lifting algorithm learning model;
    计算单元,用于根据公式如下公式计算所述提升算法学习模型的错误率:The calculation unit is used to calculate the error rate of the learning model of the lifting algorithm according to the following formula:
    Figure PCTCN2019089166-appb-100014
    Figure PCTCN2019089166-appb-100014
    判断单元,用于如果所述错误率低于学习阈值,则判定所述提升算法学习模型训练完成,将训练后的所述提升算法学习模型确定为所述文本情感识别模型;The judging unit is used to judge that the lifting algorithm learning model training is completed if the error rate is lower than the learning threshold, and determine the trained training algorithm learning model as the text emotion recognition model;
    其中,m为所述验证子集中的样本文本数量,i∈[1,m];E为所述提升算法学习模型的错误率,D+为所述验证子集D的正面情感样本文本子集,D-为所述验证子集D的负面情感样本文本子集,y i为样本文本xi的情感分类标签。 Where m is the number of sample texts in the verification subset, i ∈ [1, m]; E is the error rate of the learning model of the lifting algorithm, and D + is the positive emotion sample text subset of the verification subset D, D- is a negative emotion sample text subset of the verification subset D, and y i is an emotion classification label of the sample text xi.
  11. 根据权利要求10所述的装置,所述装置还包括:The device according to claim 10, further comprising:
    计算单元,还用于分别根据如下公式计算所述提升算法学习模型的正面样本错误率E+与负面样本错误率E-:The calculation unit is also used to calculate the positive sample error rate E + and the negative sample error rate E- of the boosting algorithm learning model according to the following formulas:
    Figure PCTCN2019089166-appb-100015
    Figure PCTCN2019089166-appb-100015
    Figure PCTCN2019089166-appb-100016
    Figure PCTCN2019089166-appb-100016
    以及根据如下公式计算所述提升算法学习模型的错误率比例:And calculate the error rate ratio of the learning model of the lifting algorithm according to the following formula:
    Figure PCTCN2019089166-appb-100017
    Figure PCTCN2019089166-appb-100017
    判断单元,还用于如果所述错误率比例处于预设范围,则继续检测所述错误率是否低于学习阈值。The judging unit is also used to continue to detect whether the error rate is lower than the learning threshold if the error rate ratio is within a preset range.
    其中,s为所述验证子集D的正面情感样本文本数量,v为所述验证子集D的负面情感样本文本数量,m=s+v。Where s is the number of positive sentiment sample texts in the verification subset D, v is the number of negative sentiment sample texts in the verification subset D, and m = s + v.
  12. 根据权利要求11所述的装置,所述装置还包括:The device according to claim 11, further comprising:
    训练单元,还用于如果所述错误率比例未处于所述预设范围,则再次利用所述训练子集T训练所述提升算法学习模型;The training unit is further used to train the lifting algorithm learning model again using the training subset T if the error rate ratio is not within the preset range;
    计算单元,还用于通过以下公式重新计算所述提升算法学习模型的错误率比例:The calculation unit is also used to recalculate the error rate ratio of the learning model of the lifting algorithm by the following formula:
    Figure PCTCN2019089166-appb-100018
    Figure PCTCN2019089166-appb-100018
    Figure PCTCN2019089166-appb-100019
    Figure PCTCN2019089166-appb-100019
    Figure PCTCN2019089166-appb-100020
    Figure PCTCN2019089166-appb-100020
    判断单元,还用于再次检测所述错误率比例是否处于所述预设范围。The judging unit is also used to detect again whether the error rate ratio is within the preset range.
  13. 一种文本情感识别装置,其特征在于,包括处理器及存储器,所述存储器上存储有计算机可读指令,所述计算机可读指令被所述处理器执行时,所述处理器用于执行以下处理:A text emotion recognition device, characterized in that it includes a processor and a memory, and the memory stores computer-readable instructions. When the computer-readable instructions are executed by the processor, the processor is used to perform the following processing :
    获取样本文本集,所述样本文本集包括多个样本文本以及各所述样本文本对应的情感分类标签;Acquiring a sample text set, the sample text set including a plurality of sample texts and emotion classification tags corresponding to the sample texts;
    根据所述样本文本集中的情感分类标签的数量分布对初始代价进行修正计算,获得修正代价;Modify the initial cost according to the number distribution of the sentiment classification tags in the sample text set to obtain the modified cost;
    通过所述样本文本集与所述修正代价训练一提升算法学习模型,得到文本情感识别模型;Training a lifting algorithm learning model through the sample text set and the modified cost to obtain a text emotion recognition model;
    通过所述文本情感识别模型对待识别文本进行识别,得到所述待识别文本的情感识别结果。Recognize the text to be recognized through the text emotion recognition model to obtain the emotion recognition result of the text to be recognized.
  14. 根据权利要求13所述的装置,其特征在于,所述情感分类标签包括正面情感文本与负面情感文本;The device according to claim 13, wherein the sentiment classification tags include positive sentiment text and negative sentiment text;
    所述根据所述样本文本集中的情感分类标签的数量分布对初始代价进行修正计算,获得修正代价包括:The modifying the initial cost according to the number distribution of the sentiment classification tags in the sample text set, and obtaining the modified cost includes:
    获取初始代价cost 10和cost 01,cost 10为将正面情感文本误认为负面情感文本的初始代价,cost 01为将负面情感文本误认为正面情感文本的初始代价; Obtain initial costs cost 10 and cost 01 , cost 10 is the initial cost of mistaken positive emotion text as negative emotion text, cost 01 is the initial cost of mistaken negative emotion text as positive emotion text;
    统计所述样本文本集中的正面情感文本数量Q1与负面情感文本数量Q0;Count the number of positive emotional text Q1 and the number of negative emotional text Q0 in the sample text set;
    通过以下公式对所述初始代价进行修正计算,获得所述修正代价:Modify the initial cost by the following formula to obtain the modified cost:
    Figure PCTCN2019089166-appb-100021
    Figure PCTCN2019089166-appb-100021
    Figure PCTCN2019089166-appb-100022
    Figure PCTCN2019089166-appb-100022
    Figure PCTCN2019089166-appb-100023
    Figure PCTCN2019089166-appb-100023
    其中,R 10为样本偏差比例,costm 10为将正面情感文本误认为负面情感文本的修正代价,costm 01为将负面情感文本误认为正面情感文本的修正代价,a为指数参数。 Among them, R 10 is the sample deviation ratio, costm 10 is the correction cost of mistaken positive emotion text as negative emotion text, costm 01 is the correction cost of mistaken negative emotion text as positive emotion text, and a is the exponential parameter.
  15. 根据权利要求13所述的装置,其特征在于,所述通过所述样本文本集与所述修正代价训练一提升算法学习模型,得到文本情感识别模型,所述处理器用于执行以下处理:The apparatus according to claim 13, wherein the training model for training a lifting algorithm through the sample text set and the correction cost is used to obtain a text emotion recognition model, and the processor is configured to perform the following processing:
    将所述样本文本集划分为训练子集T与验证子集D,D={x 1,x 2…x m}; Divide the sample text set into a training subset T and a verification subset D, D = {x 1 , x 2 … x m };
    利用所述训练子集T训练所述提升算法学习模型;Training the lifting algorithm learning model using the training subset T;
    通过所述提升算法学习模型获取所述验证子集D中每个样本文本xi的情感识别结果f(xi);Acquiring the emotion recognition result f (xi) of each sample text xi in the verification subset D through the lifting algorithm learning model;
    根据如下公式计算所述提升算法学习模型的错误率:Calculate the error rate of the improved algorithm learning model according to the following formula:
    Figure PCTCN2019089166-appb-100024
    Figure PCTCN2019089166-appb-100024
    如果所述错误率低于学习阈值,则判定所述提升算法学习模型训练完成,将训练后的所述提升算法学习模型确定为所述文本情感识别模型;If the error rate is lower than the learning threshold, it is determined that the lifting algorithm learning model training is completed, and the trained training algorithm learning model is determined as the text emotion recognition model;
    其中,m为所述验证子集中的样本文本数量,i∈[1,m];E为所述提升算法学习模型的错误率,D+为所述验证子集D的正面情感样本文本子集,D-为所述验证子集D的负面情感样本文本子集,y i为样本文本xi的情感分类标签。 Where m is the number of sample texts in the verification subset, i ∈ [1, m]; E is the error rate of the learning model of the lifting algorithm, and D + is the positive emotion sample text subset of the verification subset D, D- is a negative emotion sample text subset of the verification subset D, and y i is an emotion classification label of the sample text xi.
  16. 根据权利要求15所述的装置,其特征在于,所述处理器还用于执行以下处理:The apparatus according to claim 15, wherein the processor is further configured to perform the following processing:
    分别根据如下公式计算所述提升算法学习模型的正面样本错误率E+与负面样本错误率E-:Calculate the positive sample error rate E + and negative sample error rate E- of the lifting algorithm learning model according to the following formulas:
    Figure PCTCN2019089166-appb-100025
    Figure PCTCN2019089166-appb-100025
    Figure PCTCN2019089166-appb-100026
    Figure PCTCN2019089166-appb-100026
    根据公式(7)计算所述提升算法学习模型的错误率比例:Calculate the error rate ratio of the learning model of the lifting algorithm according to formula (7):
    Figure PCTCN2019089166-appb-100027
    Figure PCTCN2019089166-appb-100027
    如果所述错误率比例处于预设范围,则继续检测所述错误率是否低于学习阈值。If the error rate ratio is within a preset range, continue to detect whether the error rate is lower than the learning threshold.
    其中,s为所述验证子集D的正面情感样本文本数量,v为所述验证子集D的负面情感样本文本数量,m=s+v。Where s is the number of positive sentiment sample texts in the verification subset D, v is the number of negative sentiment sample texts in the verification subset D, and m = s + v.
  17. 根据权利要求16所述的装置,其特征在于,所述处理器还用于执行以下步骤:The apparatus according to claim 16, wherein the processor is further configured to perform the following steps:
    如果所述错误率比例未处于所述预设范围,则再次利用所述训练子集T训练所述提升算法学习模型;If the error rate ratio is not within the preset range, the training subset T is used again to train the lifting algorithm learning model;
    通过以下公式重新计算所述提升算法学习模型的错误率比例:Recalculate the error rate ratio of the learning model of the lifting algorithm by the following formula:
    Figure PCTCN2019089166-appb-100028
    Figure PCTCN2019089166-appb-100028
    Figure PCTCN2019089166-appb-100029
    Figure PCTCN2019089166-appb-100029
    Figure PCTCN2019089166-appb-100030
    Figure PCTCN2019089166-appb-100030
    再次检测所述错误率比例是否处于所述预设范围。It is checked again whether the error rate ratio is within the preset range.
  18. 一种计算机非易失性可读存储介质,其上存储有计算机程序,其特征在于,所述计算机程序被处理器执行时,使得所述处理器用于执行以下步骤:A computer non-volatile readable storage medium on which a computer program is stored, characterized in that, when the computer program is executed by a processor, the processor is used to perform the following steps:
    获取样本文本集,所述样本文本集包括多个样本文本以及各所述样本文本对应的情感分类标签;Acquiring a sample text set, the sample text set including a plurality of sample texts and emotion classification tags corresponding to the sample texts;
    根据所述样本文本集中的情感分类标签的数量分布对初始代价进行修正计算,获得修正代价;Modify the initial cost according to the number distribution of the sentiment classification tags in the sample text set to obtain the modified cost;
    通过所述样本文本集与所述修正代价训练一提升算法学习模型,得到文本情感识别模型;Training a lifting algorithm learning model through the sample text set and the modified cost to obtain a text emotion recognition model;
    通过所述文本情感识别模型对待识别文本进行识别,得到所述待识别文本的情感识别结果。Recognize the text to be recognized through the text emotion recognition model to obtain the emotion recognition result of the text to be recognized.
  19. 根据权利要求18所述的计算机非易失性可读存储介质,其特征在于,所述情感分类标签包括正面情感文本与负面情感文本;The computer non-volatile readable storage medium according to claim 18, wherein the emotion classification label includes positive emotional text and negative emotional text;
    所述根据所述样本文本集中的情感分类标签的数量分布对初始代价进行修正计算,获得修正代价包括:The modifying the initial cost according to the number distribution of the sentiment classification tags in the sample text set, and obtaining the modified cost includes:
    获取初始代价cost 10和cost 01,cost 10为将正面情感文本误认为负面情感文本的初始代价,cost 01为将负面情感文本误认为正面情感文本的初始代价; Obtain initial costs cost 10 and cost 01 , cost 10 is the initial cost of mistaken positive emotion text as negative emotion text, cost 01 is the initial cost of mistaken negative emotion text as positive emotion text;
    统计所述样本文本集中的正面情感文本数量Q1与负面情感文本数量Q0;Count the number of positive emotional text Q1 and the number of negative emotional text Q0 in the sample text set;
    通过以下公式对所述初始代价进行修正计算,获得所述修正代价:Modify the initial cost by the following formula to obtain the modified cost:
    Figure PCTCN2019089166-appb-100031
    Figure PCTCN2019089166-appb-100031
    Figure PCTCN2019089166-appb-100032
    Figure PCTCN2019089166-appb-100032
    Figure PCTCN2019089166-appb-100033
    Figure PCTCN2019089166-appb-100033
    其中,R 10为样本偏差比例,costm 10为将正面情感文本误认为负面情感文本的修正代价,costm 01为将负面情感文本误认为正面情感文本的修正代价,a为指数参数。 Among them, R 10 is the sample deviation ratio, costm 10 is the correction cost of mistaken positive emotion text as negative emotion text, costm 01 is the correction cost of mistaken negative emotion text as positive emotion text, and a is the exponential parameter.
  20. 根据权利要求18所述的计算机非易失性可读存储介质,其特征在于,所述通过所述样本文本集与所述修正代价训练一提升算法学习模型,得到文本情感识别模型,所述处理器用于执行以下步骤:The computer non-volatile readable storage medium according to claim 18, characterized in that the training model of a lifting algorithm is trained through the sample text set and the correction cost to obtain a text emotion recognition model, and The device is used to perform the following steps:
    将所述样本文本集划分为训练子集T与验证子集D,D={x 1,x 2…x m}; Divide the sample text set into a training subset T and a verification subset D, D = {x 1 , x 2 … x m };
    利用所述训练子集T训练所述提升算法学习模型;Training the lifting algorithm learning model using the training subset T;
    通过所述提升算法学习模型获取所述验证子集D中每个样本文本xi的情感识别结果f(xi);Acquiring the emotion recognition result f (xi) of each sample text xi in the verification subset D through the lifting algorithm learning model;
    根据如下公式计算所述提升算法学习模型的错误率:Calculate the error rate of the improved algorithm learning model according to the following formula:
    Figure PCTCN2019089166-appb-100034
    Figure PCTCN2019089166-appb-100034
    如果所述错误率低于学习阈值,则判定所述提升算法学习模型训练完成,将训练后的所述提升算法学习模型确定为所述文本情感识别模型;If the error rate is lower than the learning threshold, it is determined that the lifting algorithm learning model training is completed, and the trained training algorithm learning model is determined as the text emotion recognition model;
    其中,m为所述验证子集中的样本文本数量,i∈[1,m];E为所述提升算法学习模型的错误率,D+为所述验证子集D的正面情感样本文本子集,D-为所述验证子集D的负面情感样本文本子集,y i为样本文本xi的情感分类标签。 Where m is the number of sample texts in the verification subset, i ∈ [1, m]; E is the error rate of the learning model of the lifting algorithm, and D + is the positive emotion sample text subset of the verification subset D, D- is a negative emotion sample text subset of the verification subset D, and y i is an emotion classification label of the sample text xi.
  21. 根据权利要求20所述的计算机非易失性可读存储介质,其特征在于,所述计算机程序被处理器执行时还使得所述处理器执行如下处理:The computer non-volatile storage medium according to claim 20, wherein when the computer program is executed by the processor, the processor is further caused to perform the following processing:
    分别根据如下公式计算所述提升算法学习模型的正面样本错误率E+与负面样本错误率E-:Calculate the positive sample error rate E + and negative sample error rate E- of the lifting algorithm learning model according to the following formulas:
    Figure PCTCN2019089166-appb-100035
    Figure PCTCN2019089166-appb-100035
    Figure PCTCN2019089166-appb-100036
    Figure PCTCN2019089166-appb-100036
    根据公式(7)计算所述提升算法学习模型的错误率比例:Calculate the error rate ratio of the learning model of the lifting algorithm according to formula (7):
    Figure PCTCN2019089166-appb-100037
    Figure PCTCN2019089166-appb-100037
    如果所述错误率比例处于预设范围,则继续检测所述错误率是否低于学习阈值。If the error rate ratio is within a preset range, continue to detect whether the error rate is lower than the learning threshold.
    其中,s为所述验证子集D的正面情感样本文本数量,v为所述验证子集D的负面情感样本文本数量,m=s+v。Where s is the number of positive sentiment sample texts in the verification subset D, v is the number of negative sentiment sample texts in the verification subset D, and m = s + v.
  22. 根据权利要求21所述的计算机非易失性可读存储介质,其特征在于,所述计算机程序被处理器执行时还使得所述处理器执行如下处理:The computer non-volatile storage medium according to claim 21, wherein when the computer program is executed by the processor, the processor is further caused to perform the following processing:
    如果所述错误率比例未处于所述预设范围,则再次利用所述训练子集T训练所述提升算法学习模型;If the error rate ratio is not within the preset range, the training subset T is used again to train the lifting algorithm learning model;
    通过以下公式重新计算所述提升算法学习模型的错误率比例:Recalculate the error rate ratio of the learning model of the lifting algorithm by the following formula:
    Figure PCTCN2019089166-appb-100038
    Figure PCTCN2019089166-appb-100038
    Figure PCTCN2019089166-appb-100039
    Figure PCTCN2019089166-appb-100039
    Figure PCTCN2019089166-appb-100040
    Figure PCTCN2019089166-appb-100040
    再次检测所述错误率比例是否处于所述预设范围。It is checked again whether the error rate ratio is within the preset range.
PCT/CN2019/089166 2018-10-24 2019-05-30 Text emotion recognition method and apparatus, electronic device, and computer non-volatile readable storage medium WO2020082734A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201811244553.6 2018-10-24
CN201811244553.6A CN109344257A (en) 2018-10-24 2018-10-24 Text emotion recognition methods and device, electronic equipment, storage medium

Publications (1)

Publication Number Publication Date
WO2020082734A1 true WO2020082734A1 (en) 2020-04-30

Family

ID=65311430

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/089166 WO2020082734A1 (en) 2018-10-24 2019-05-30 Text emotion recognition method and apparatus, electronic device, and computer non-volatile readable storage medium

Country Status (2)

Country Link
CN (1) CN109344257A (en)
WO (1) WO2020082734A1 (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109344257A (en) * 2018-10-24 2019-02-15 平安科技(深圳)有限公司 Text emotion recognition methods and device, electronic equipment, storage medium
CN110069601A (en) * 2019-04-03 2019-07-30 平安科技(深圳)有限公司 Mood determination method and relevant apparatus
CN110351090B (en) * 2019-05-27 2021-04-27 平安科技(深圳)有限公司 Group signature digital certificate revoking method and device, storage medium and electronic equipment
CN110516416B (en) * 2019-08-06 2021-08-06 咪咕文化科技有限公司 Identity authentication method, authentication end and client
CN110909258B (en) * 2019-11-22 2023-09-29 上海喜马拉雅科技有限公司 Information recommendation method, device, equipment and storage medium
CN110910904A (en) * 2019-12-25 2020-03-24 浙江百应科技有限公司 Method for establishing voice emotion recognition model and voice emotion recognition method
CN112507082A (en) * 2020-12-16 2021-03-16 作业帮教育科技(北京)有限公司 Method and device for intelligently identifying improper text interaction and electronic equipment
CN113705206B (en) * 2021-08-13 2023-01-03 北京百度网讯科技有限公司 Emotion prediction model training method, device, equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103106211A (en) * 2011-11-11 2013-05-15 中国移动通信集团广东有限公司 Emotion recognition method and emotion recognition device for customer consultation texts
CN105975992A (en) * 2016-05-18 2016-09-28 天津大学 Unbalanced data classification method based on adaptive upsampling
CN106815369A (en) * 2017-01-24 2017-06-09 中山大学 A kind of file classification method based on Xgboost sorting algorithms
CN108460421A (en) * 2018-03-13 2018-08-28 中南大学 The sorting technique of unbalanced data
CN109344257A (en) * 2018-10-24 2019-02-15 平安科技(深圳)有限公司 Text emotion recognition methods and device, electronic equipment, storage medium

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107807914A (en) * 2016-09-09 2018-03-16 阿里巴巴集团控股有限公司 Recognition methods, object classification method and the data handling system of Sentiment orientation
CN107958292B (en) * 2017-10-19 2022-02-25 山东科技大学 Transformer fuzzy and cautious reasoning fault diagnosis method based on cost sensitive learning

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103106211A (en) * 2011-11-11 2013-05-15 中国移动通信集团广东有限公司 Emotion recognition method and emotion recognition device for customer consultation texts
CN105975992A (en) * 2016-05-18 2016-09-28 天津大学 Unbalanced data classification method based on adaptive upsampling
CN106815369A (en) * 2017-01-24 2017-06-09 中山大学 A kind of file classification method based on Xgboost sorting algorithms
CN108460421A (en) * 2018-03-13 2018-08-28 中南大学 The sorting technique of unbalanced data
CN109344257A (en) * 2018-10-24 2019-02-15 平安科技(深圳)有限公司 Text emotion recognition methods and device, electronic equipment, storage medium

Also Published As

Publication number Publication date
CN109344257A (en) 2019-02-15

Similar Documents

Publication Publication Date Title
WO2020082734A1 (en) Text emotion recognition method and apparatus, electronic device, and computer non-volatile readable storage medium
CN110188357B (en) Industry identification method and device for objects
CN112990294B (en) Training method and device of behavior discrimination model, electronic equipment and storage medium
WO2019242627A1 (en) Data processing method and apparatus
WO2022267454A1 (en) Method and apparatus for analyzing text, device and storage medium
US20180005248A1 (en) Product, operating system and topic based
CN111178537A (en) Feature extraction model training method and device
CN112883990A (en) Data classification method and device, computer storage medium and electronic equipment
CN111950647A (en) Classification model training method and device
WO2021174814A1 (en) Answer verification method and apparatus for crowdsourcing task, computer device, and storage medium
CN113657538A (en) Model training method, data classification method, device, equipment, storage medium and product
CN112214576B (en) Public opinion analysis method, public opinion analysis device, terminal equipment and computer readable storage medium
WO2021051957A1 (en) Judicial text recognition method, text recognition model obtaining method, and related device
US20210357699A1 (en) Data quality assessment for data analytics
CN113641823B (en) Text classification model training, text classification method, device, equipment and medium
CN116204624A (en) Response method, response device, electronic equipment and storage medium
CN116245630A (en) Anti-fraud detection method and device, electronic equipment and medium
CN113392920B (en) Method, apparatus, device, medium, and program product for generating cheating prediction model
CN113221519B (en) Method, apparatus, device, medium and product for processing form data
CN115527520A (en) Anomaly detection method, device, electronic equipment and computer readable storage medium
CN112463964B (en) Text classification and model training method, device, equipment and storage medium
CN114529191A (en) Method and apparatus for risk identification
CN112560437A (en) Text smoothness determination method and device and target model training method and device
CN110852392A (en) User grouping method, device, equipment and medium
CN115471717B (en) Semi-supervised training and classifying method device, equipment, medium and product of model

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19875214

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19875214

Country of ref document: EP

Kind code of ref document: A1