CN111523951A - Data enhancement method and device - Google Patents

Data enhancement method and device Download PDF

Info

Publication number
CN111523951A
CN111523951A CN201910038115.2A CN201910038115A CN111523951A CN 111523951 A CN111523951 A CN 111523951A CN 201910038115 A CN201910038115 A CN 201910038115A CN 111523951 A CN111523951 A CN 111523951A
Authority
CN
China
Prior art keywords
data
training data
training
subsets
data set
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910038115.2A
Other languages
Chinese (zh)
Inventor
杨牡丹
高维国
陈勇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Huijun Technology Co.,Ltd.
Original Assignee
Beijing Jingdong Century Trading Co Ltd
Beijing Jingdong Shangke Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Jingdong Century Trading Co Ltd, Beijing Jingdong Shangke Information Technology Co Ltd filed Critical Beijing Jingdong Century Trading Co Ltd
Priority to CN201910038115.2A priority Critical patent/CN111523951A/en
Publication of CN111523951A publication Critical patent/CN111523951A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/06Buying, selling or leasing transactions
    • G06Q30/0601Electronic shopping [e-shopping]
    • G06Q30/0613Third-party assisted

Landscapes

  • Business, Economics & Management (AREA)
  • Accounting & Taxation (AREA)
  • Finance (AREA)
  • Development Economics (AREA)
  • Economics (AREA)
  • Marketing (AREA)
  • Strategic Management (AREA)
  • Physics & Mathematics (AREA)
  • General Business, Economics & Management (AREA)
  • General Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Machine Translation (AREA)

Abstract

The invention provides a data enhancement method and device. The method comprises the following steps: n training data subsets are obtained, each training data subset corresponds to one emotion type, and N is a positive integer. And determining a data set to be processed from the N training data subsets according to the proportion between the data amount of the N training data subsets and the total data amount of the N training data subsets. And performing at least one reverse translation on the data set to be processed to obtain a reverse translation data set, wherein the reverse translation is a process of performing non-Chinese translation on any text data and then performing Chinese translation. And obtaining a target training data set according to the N training data subsets and the reverse translation data set, thereby expanding the data volume of the training data, weakening the problem of unbalanced distribution of the training data, improving the quality of the training data and improving the performance of the target training model and the classification effect of emotion types.

Description

Data enhancement method and device
Technical Field
The present invention relates to the field of computer technologies, and in particular, to a data enhancement method and apparatus.
Background
With the rapid development of electronic commerce, the number of online shopping groups increases exponentially, and customer service robots are in the process of transportation. In the service process, the client robot needs to accurately judge the emotion of the customer, at the moment, the text data of the customer can be converted into the emotion classification problem, and the emotion type corresponding to the text data can be determined and distinguished through a deep learning neural network model (training model for short).
It is well known that the performance of a training model depends heavily on the training data. In general, the larger the data volume of the training data, the better the performance of the training model; the more features of the training data, the better the performance of the training model.
However, in the field of actual emotion classification, the process of collecting training data is difficult, so that the data volume of the training data is insufficient, the training data corresponding to each emotion type is not balanced enough, the characteristics of the training data need to be manually labeled one by one, a large amount of financial resources, material resources and manpower are consumed, the performance of a training model is reduced, and a client robot cannot accurately determine the emotion type of a client through the training model.
Disclosure of Invention
The invention provides a data enhancement method and a data enhancement device, which are used for solving the problems that the total data amount of training data is insufficient or the data amount distribution of the training data corresponding to various emotion types is unbalanced in the prior art.
In a first aspect, the present invention provides a data enhancement method, including:
acquiring N training data subsets, wherein each training data subset corresponds to one emotion type, and N is a positive integer;
determining a data set to be processed from the N training data subsets according to the proportion between the data amount of the N training data subsets and the total data amount of the N training data subsets;
performing at least one reverse translation on the data set to be processed to obtain a reverse translation data set, wherein the reverse translation is a process of performing non-Chinese translation on any text data and then performing Chinese translation;
and obtaining a target training data set according to the N training data subsets and the reverse translation data set.
Optionally, the determining, according to the ratio between the data amount of the N training data subsets and the total data amount of the N training data subsets, a to-be-processed data set from the N training data subsets includes:
judging whether the proportion of the data quantity of the N training data subsets to the total data quantity of the N training data subsets is smaller than a preset proportion or not;
and determining the training data subset corresponding to the preset proportion as the data set to be processed.
Optionally, the determining, according to the ratio between the data amount of the N training data subsets and the total data amount of the N training data subsets, a to-be-processed data set from the N training data subsets includes:
sorting the N training data subsets according to the sequence that the proportion of the data quantity of the N training data subsets to the total data quantity of the N training data subsets is from small to large;
and determining the first M training data subsets as the data set to be processed, wherein N is more than or equal to M, and M is a positive integer.
Optionally, the performing at least one reverse translation on the data set to be processed to obtain a reverse translation data set includes:
and performing English translation on the data set to be processed and then performing Chinese translation to obtain the reverse translation data set.
Optionally, the obtaining N training data subsets includes:
and classifying the original training data set according to the N emotion types to obtain N original training data subsets.
Optionally, the N emotion types include at least one of happy thank you, angry, loss, anxiety, loss, worry, or others.
Optionally, the method further comprises:
and training the initial training model according to the target training data set to obtain a target training model, wherein the target training model is used for determining the emotion type corresponding to the data to be tested.
In a second aspect, the present invention provides a data enhancement apparatus, comprising:
the acquisition module is used for acquiring N training data subsets, each training data subset corresponds to one emotion type, and N is a positive integer;
a determining module, configured to determine a to-be-processed data set from the N training data subsets according to a ratio between a data amount of the N training data subsets and a total data amount of the N training data subsets, respectively;
the processing module is used for carrying out at least one reverse translation on the data set to be processed to obtain a reverse translation data set, and the reverse translation is a process of firstly carrying out non-Chinese translation on any text data and then carrying out Chinese translation on the any text data;
the processing module is further configured to obtain a target training data set according to the N training data subsets and the reverse translation data set.
Optionally, the determining module is specifically configured to determine whether a ratio between the data amount of the N training data subsets and the total data amount of the N training data subsets is smaller than a preset ratio; and determining the training data subset corresponding to the preset proportion as the data set to be processed.
Optionally, the determining module is specifically configured to sort the N training data subsets in an order that a ratio of data amounts of the N training data subsets to a total data amount of the N training data subsets increases from small to large; and determining the first M training data subsets as the data set to be processed, wherein N is more than or equal to M, and M is a positive integer.
Optionally, the processing module is specifically configured to perform english translation on the data set to be processed and then perform chinese translation on the data set to be processed to obtain the reverse translation data set.
Optionally, the obtaining module is specifically configured to classify the original training data sets according to the N emotion types to obtain the N original training data subsets.
Optionally, the N emotion types include at least one of happy thank you, angry, loss, anxiety, loss, worry, or others.
Optionally, the processing module is further configured to train the initial training model according to the target training data set to obtain a target training model, where the target training model is used to determine an emotion type corresponding to the data to be tested.
In a third aspect, the present invention provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the data enhancement method of the first aspect.
In a fourth aspect, the present invention provides an electronic device comprising:
a processor; and
a memory for storing executable instructions of the processor;
wherein the processor is configured to perform the data enhancement method of the first aspect via execution of the executable instructions.
According to the data enhancement method and device provided by the invention, the data set to be processed is determined from the N training data subsets according to the proportion between the data quantity of the N training data subsets and the total data quantity of the N training data subsets by acquiring the N training data subsets respectively corresponding to the N emotion types, wherein N is a positive integer. And then, performing at least one reverse translation on the data set to be processed to obtain a reverse translation data set, wherein the reverse translation is a process of performing non-Chinese translation on any text data and then performing Chinese translation, and then obtaining a target training data set according to the N training data subsets and the reverse translation data set. The invention can provide training data with similar meaning with the original training data and different word and phrase composition through reverse translation, increases the data volume and diversity of the training data, improves the quality of the training data, solves the problems of insufficient data total amount of the training data or unbalanced data volume distribution of the training data corresponding to various emotion types in the prior art, and trains the training model through the training data, thereby accurately determining the emotion type corresponding to the text data, improving the generalization capability of the training model, improving the performance of the training model and improving the classification effect of the emotion type.
Drawings
In order to more clearly illustrate the technical solutions of the present invention or the prior art, the following briefly introduces the drawings needed to be used in the description of the embodiments or the prior art, and obviously, the drawings in the following description are some embodiments of the present invention, and those skilled in the art can obtain other drawings according to the drawings without inventive labor.
FIG. 1 is a flow chart of a data enhancement method provided by the present invention;
FIG. 2 is a flow chart of a data enhancement method provided by the present invention;
FIG. 3 is a flow chart of a data enhancement method provided by the present invention;
FIG. 4 is a schematic structural diagram of a data enhancement apparatus provided in the present invention;
fig. 5 is a schematic diagram of a hardware structure of the electronic device provided by the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The embodiment of the application provides a data enhancement method and a data enhancement device, which are applied to various fields needing rich text data, and can realize Chinese expression of various similarities and/or differences of training data by performing reverse translation (back-translation) on the training data corresponding to various emotion types, so that the data volume of the training data can be increased, the training data are richer, and therefore, the training model is trained through the training data, the emotion types corresponding to the text data can be accurately determined, the generalization capability of the training model is improved, the performance of the training model is improved, and the classification effect of the emotion types is improved.
For example, in an actual application scenario, the customer service robot can determine the emotion types of the text data input by the customer through the training model, so that the identification capability of the emotion types is improved, the customer service robot can provide better service for the customer, and the shopping experience of the customer is improved.
The execution subject of the embodiment of the present application may include, but is not limited to, a terminal device or a server. Hereinafter, a specific implementation process of the data enhancement method is described in detail through a specific embodiment.
Fig. 1 is a flowchart of a data enhancement method provided by the present invention, and as shown in fig. 1, the data enhancement method of this embodiment may include:
s101, N training data subsets are obtained, each training data subset corresponds to one emotion type, and N is a positive integer.
The embodiment may collect training data corresponding to various emotion types, where the training data may be text data or text data obtained by converting other types of data, and the embodiment is not limited to this. For example, the training data is the word "good, thanks". Or the words "thank you" converted from the expression package expressing the thank you meaning, and the like.
Further, the present embodiment may adopt the existing technologies such as algorithm, etc. to place the training data corresponding to each emotion type into one training data subset, that is, each training data subset corresponds to one emotion type. Each training data subset may be expressed in the form of a sequence or a list, which is not limited in this embodiment.
S102, determining a data set to be processed from the N training data subsets according to the proportion between the data quantity of the N training data subsets and the total data quantity of the N training data subsets.
The present embodiment may calculate the data amount of N training data subsets, that is, the number of training data in any one training data subset. For example, the training data subset is a text set a { "thank you," "communicate with you very pleasantly," "the quality of the X item is good," "your guidance process is very effective" }, and since the file set a includes 4 pieces of training data, the data size of the training data subset is 4.
Further, in this embodiment, the data amounts of the N training data subsets may be added to obtain a total data amount of the N training data subsets, and then a ratio between the data amount of each training data subset and the total data amount of the N training data subsets is calculated.
For example, in the embodiment, there are three training data subsets 1-3, when the data size of the training data subset 1 is 4, the data size of the training data subset 2 is 5, and the data size of the training data subset 3 is 11, the total data size of the three training data subsets is 20, the ratio of the data size of the training data subset 1 to the total data size of the three training data subsets is 4/20, the ratio of the data size of the training data subset 2 to the total data size of the three training data subsets is 5/20, and the ratio of the data size of the training data subset 3 to the total data size of the three training data subsets is 11/20.
Further, in this embodiment, according to the ratio between the data amount of the N training data subsets and the total data amount of the N training data subsets, the distribution of the N training data subsets may be determined, and it is determined whether the training data subsets corresponding to each emotion type need to be subjected to data expansion, so that the to-be-processed data set may be determined from the N training data subsets.
On one hand, when the total amount of data of the N training data subsets is small, the embodiment may determine all of the N training data subsets as the data set to be processed, so as to increase the total amount of training data and enrich the emotion types corresponding to the training data.
On the other hand, when the data volume of a part of the N training data subsets is small, it indicates that the N training data has an unbalanced data volume distribution problem, and further, the embodiment may determine the training data subset satisfying the preset condition as the to-be-processed data set.
The embodiment does not limit the specific implementation form of the preset condition. In the following, two possible implementation manners are used to describe in detail a specific implementation process for determining a to-be-processed data set from N training data subsets.
Optionally, judging whether the ratio of the data amount of the N training data subsets to the total data amount of the N training data subsets is smaller than a preset ratio; and determining the training data subset corresponding to the preset proportion as a data set to be processed.
The preset condition in this embodiment is whether the ratio of the data amount of each training data subset to the data population of the N training data subsets is smaller than a preset ratio, so that the ratio of the data amount of each training data subset to the data population of the N training data subsets and the preset ratio can be compared, and according to the comparison result, it can be determined that the data amount of the training data subset corresponding to the preset ratio is sufficient or larger, and the data amount of the training data subset corresponding to the preset ratio is smaller, so that the training data subset corresponding to the preset ratio is determined to be the data set to be processed.
The preset ratio may be set according to an empirical value, which is not limited in this embodiment.
For example, the embodiment has three training data subsets 1-3, a predetermined ratio of 6/20, when the ratio of the amount of data of the training data subset 1 to the total amount of data of the three training data subsets is 1/20, the ratio of the amount of data of the training data subset 2 to the total amount of data of the three training data subsets is 2/20, the ratio of the amount of data of the training data subset 3 to the total amount of data of the three training data subsets is 17/20, since the ratio 1/20 of the data amount of the training data subset 1 to the total data amount of the three training data subsets is smaller than the preset ratio 6/20, the ratio 2/20 of the data amount of the training data subset 2 to the total data amount of the three training data subsets is smaller than the preset ratio 6/20, therefore, the present embodiment may determine the training data subset 1 and the training data subset 2 as the to-be-processed data set.
In another feasible implementation manner, optionally, the N training data subsets are sorted in an order that the proportion of the data amount of the N training data subsets to the total data amount of the N training data subsets is from small to large; and determining the first M training data subsets as to-be-processed data sets, wherein N is more than or equal to M, and M is a positive integer.
The preset condition in this embodiment is that the ratio of the data amount of any one training data subset to the data total amount of N training data subsets is within the first M of the ranking of the ratio of the data amount of each training data subset to the data total amount of N training data subsets, so that the N training data subsets may be ranked in order of the smaller ratio of the data amount of the N training data subsets to the data total amount of the N training data subsets, and the first M training data subsets are taken as the data set to be processed.
For example, the embodiment has three training data subsets 1-3, M is 2, when the ratio of the data amount of the training data subset 1 to the total data amount of the three training data subsets is 11/20, the ratio of the data amount of the training data subset 2 to the total data amount of the three training data subsets is 2/20, and the ratio of the data amount of the training data subset 3 to the total data amount of the three training data subsets is 7/20, since the ratio 2/20 of the data amount of the training data subset 2 to the total data amount of the three training data subsets is smaller than the ratio 7/20 of the data amount of the training data subset 3 to the total data amount of the three training data subsets, the ratio 7/20 of the data amount of the training data subset 3 to the total data amount of the three training data subsets is smaller than the ratio 11/20 of the data amount of the training data subset 1 to the total data amount of the three training data subsets, therefore, the three training data subsets 1-3 may be ranked to obtain the first 2 training data subsets as the training data subset 2 and the training data subset 3, and further determine the training data subset 2 and the training data subset 3 as the data set to be processed.
In addition, in this embodiment, the N training data subsets may be sorted according to the descending order of the ratio of the data amount of the N training data subsets to the data total amount of the N training data subsets, and the M training data subsets are taken as the data set to be processed.
S103, performing at least one reverse translation on the data set to be processed to obtain a reverse translation data set, wherein the reverse translation is a process of performing non-Chinese translation on any text data and then performing Chinese translation.
Because the non-Chinese expression of any text data is translated into the Chinese expression in a mode comprising a plurality of modes with similar semantics and different character words, each mode may have certain similarity or difference, and the training data in the data set to be processed is the text data of the Chinese expression, the embodiment can perform one or more times of reverse translation on each training data in the data set to be processed, and can determine each Chinese expression of the training data as a reverse translation data set.
In the embodiment, translation software may be adopted, and non-chinese translation and chinese translation of training data may also be performed in manners such as manual translation, and the specific implementation form is not limited in the embodiment.
In order to facilitate the acquisition process of the reverse translation data set, the embodiment may perform a reverse translation on the data set to be processed once, and select a non-chinese translation to use an english translation, so that, optionally, the data set to be processed is subjected to an english translation and then to a chinese translation, so as to obtain training data having similarity or difference with the training data in the data set to be processed, and then the training data is determined as the reverse translation data set.
For example, when the training data in the dataset to be processed is the text "yesterday received and today opened, it was found that it is mouldy at all. "when translated into the English text" Yesterday, I received the foods and found there was a move to ", the English text is translated back into Chinese" Yesterday, I received the good and found today mildewed ". It can be seen that the training data in the reverse translation data set is the training data having similarity with the training data in the data set to be processed.
When the training data in the data set to be processed is the text "parent! Ten days after my baby order has not been sent to the thank you for I check! ", translate it into English as" Dear! My baby has been seen from ordered for ten days and was found from sent to this you for checking me! When "it is translated in reverse to its English text into Chinese" parent! My baby ordered ten days, and has not been sent to thank you for me. It can be seen that the training data in the reverse translation data set is the training data having a difference from the training data in the data set to be processed.
And S104, obtaining a target training data set according to the N training data subsets and the reverse translation data set.
No matter the training data in the reverse translation data set is similar or different from the training data in the data set to be processed, the total data amount of the N training data subsets can be increased by obtaining the reverse translation data set, so that the target training data set can be obtained according to the N training data subsets and the reverse translation data set, the data amount of the training data is increased, the training data are richer, the quality of the training data is improved, the training model is trained through the training data, the emotion types corresponding to the text data can be accurately determined, the generalization capability of the training model is improved, the performance of the training model is improved, and the emotion type classification effect is improved.
In a specific embodiment, taking e-commerce scenario as an example, first, according to seven emotion types of the customer, 7 training data subsets corresponding to the seven emotion types are obtained, where the seven emotion types are happy, angry, lost, anxious, lost, worried, and others, that is, the seven emotion types may respectively correspond to and identify the 7 training data subsets. As shown in table 1, 7 training data subsets are respectively interpreted and illustrated by the emotion types corresponding to the chinese expression of one training data.
TABLE 1
Figure BDA0001946590000000091
Next, as shown in table 2, the total data size of the 7 training data subsets is 5.18 ten thousand, wherein the two emotion types "loss" and "worry" appear less frequently in shopping by customers and are less important. For cost and calculation, the training data subsets corresponding to the two emotion types are not translated in reverse.
Further, according to the ratio between the data amount of the 7 training data subsets and the total data amount of the 7 training data subsets, a first method and a second method are adopted, and the specific process of performing one-time reverse translation on the 7 training data subsets is as follows:
in the first mode, when the data set to be processed is 7 training data subsets, the 7 training data subsets are subjected to one-time reverse translation to obtain a reverse translation data set.
In the second mode, when the to-be-processed data set is a training data subset labeled "happy thank you", "angry", and "lost", the 7 training data subsets are subjected to one-time reverse translation to obtain a reverse translation data set.
TABLE 2
Figure BDA0001946590000000101
Further, no matter the problem that the data totality of the N training data subsets is insufficient or the problem that the data volumes of the N training data subsets are unbalanced, the data enhancement method of the embodiment is adopted to perform at least one reverse translation on the data sets to be processed in the N training data subsets, so that the data volumes of the training data can be expanded, and the training data is richer.
In the data enhancement method provided by this embodiment, the to-be-processed data set is determined from the N training data subsets by obtaining the N training data subsets corresponding to the N emotion types, where N is a positive integer, and according to the ratio between the data amount of the N training data subsets and the total data amount of the N training data subsets. And then, performing at least one reverse translation on the data set to be processed to obtain a reverse translation data set, wherein the reverse translation is a process of performing non-Chinese translation on any text data and then performing Chinese translation, and then obtaining a target training data set according to the N training data subsets and the reverse translation data set. This embodiment can provide the similar and word and expression of meaning with former training data through reverse translation and constitute the training data that differentiates, the data bulk and the variety of training data have been increased, the quality of training data has been improved, the problem of the data bulk uneven distribution of the training data that the data total amount of the training data of solving among the prior art is insufficient or various emotion types correspond, thereby train the training model through this training data, the emotion type that text data corresponds can be accurately confirmed, the generalization ability of training model has been promoted, the performance of training model has been improved, the classification effect of emotion type has been improved.
The following describes in detail the technical solution of the embodiment of the data enhancement method shown in fig. 1 by using several specific embodiments.
Fig. 2 is a flowchart of a data enhancement method provided by the present invention, and as shown in fig. 2, the data enhancement method of this embodiment may include:
s200, classifying the original training data set according to the N emotion types to obtain N original training data subsets.
In this embodiment, various manners may be adopted to collect the original training data, and then the original training data sets are classified according to the N emotion types to obtain N original training data subsets. The embodiment does not limit the specific implementation manner of the emotion types. Optionally, the N emotion types include at least one of happy thank, angry, loss, anxiety, loss, worry, or others.
S201, N training data subsets are obtained, each training data subset corresponds to one emotion type, and N is a positive integer.
S202, determining a data set to be processed from the N training data subsets according to the proportion between the data quantity of the N training data subsets and the total data quantity of the N training data subsets.
S203, at least one reverse translation is carried out on the data set to be processed to obtain a reverse translation data set, and the reverse translation is a process of firstly carrying out non-Chinese translation on any text data and then carrying out Chinese translation.
And S204, obtaining a target training data set according to the N training data subsets and the reverse translation data set.
S201, S202, S203, and S204 are similar to the implementation manners of S101, S102, S103, and S104 in the embodiment of fig. 1, and are not described herein again.
Fig. 3 is a flowchart of a data enhancement method provided by the present invention, and as shown in fig. 3, the data enhancement method of this embodiment may include:
s301, N training data subsets are obtained, each training data subset corresponds to one emotion type, and N is a positive integer.
S302, determining a data set to be processed from the N training data subsets according to the proportion between the data amount of the N training data subsets and the total data amount of the N training data subsets.
S303, performing at least one reverse translation on the data set to be processed to obtain a reverse translation data set, wherein the reverse translation is a process of performing non-Chinese translation on any text data and then performing Chinese translation.
S304, obtaining a target training data set according to the N training data subsets and the reverse translation data set.
S301, S302, S303, and S304 are similar to the implementation manners of S101, S102, S103, and S104 in the embodiment of fig. 1, and are not described herein again.
S305, training the initial training model according to the target training data set to obtain a target training model, wherein the target training model is used for determining the emotion type corresponding to the data to be tested.
Specifically, target training data can be obtained based on the above process, and the initial training model can be trained according to the target training data in this embodiment, so as to obtain a target training model capable of determining the emotion type corresponding to the data to be tested.
Wherein, the initial training model may adopt a Convolutional Neural Networks (CNN) model, which includes but is not limited to: RCNN (regions with CNN features), SSD (Single shot MultiBox Detector), Mask RCNN, and other models.
For example, continuing with Table 2, from the 7 training data subsets and the reverse translation data set, a target training data set may be obtained. And the initial training model can adopt a CNN model, and further, the embodiment can train the CNN model according to the target training data set to obtain a target training model.
Further, the abundance of the data volume and the improvement of the quality of the training data enable the performance of the target training model to be improved accordingly. The performance of the target training model has two important indexes, namely recall rate and accuracy rate. To facilitate the representation of the performance of the target training model, F1 values are usually used to indicate the performance of the target training model, where 1/F1 is 1/recall + 1/accuracy. In general, the larger the value of F1, the better the performance of the target training model.
For example, based on the customer service evaluation of an e-commerce (e.g., Kyoto), there are test data set 1 and test data set 2, as shown in Table 3.
In table 3, the test data set 1 and the test data 2 both include 7 emotion types, the data amount corresponding to each emotion type, the ratio (i.e., the proportion) of the data amount corresponding to each emotion type to the data summary of all emotion types is known, and the emotion type corresponding to any text data in the test data set 1 and the test data 2 is known.
TABLE 3
Figure BDA0001946590000000131
Next, three target training data sets are obtained in three ways.
In the first mode, the target training data set is the original data set.
And in the second mode, the target training data is the original data set and all the original data sets, and the training data sets after reverse translation are adopted.
In a third mode, considering that the customer service robot of the e-commerce only needs to make corresponding measures on the negative emotion of the customer in time in the service process, the target training data are training data sets obtained by performing reverse translation on training data subsets respectively corresponding to the emotion types of anxiety, confusion and anger in the original data set and the original data set.
Further, in this embodiment, the same initial training model may be trained according to three target training data sets obtained in the three ways, so as to obtain three corresponding target training models. And then, aiming at the data corresponding to the emotion types of anxiety, loss and anger in the test data set 1 and the test data set 2, the three target training models trained in the three modes are respectively tested to obtain F1 values of the three target training models, as shown in tables 4 and 5.
TABLE 4
Figure BDA0001946590000000132
TABLE 5
Figure BDA0001946590000000141
As can be seen from tables 4 and 5, in the test data set 1 and the test data set 2, the F1 value of the target training model obtained according to the second mode was improved or reduced compared to the target training model obtained according to the first mode, and the F1 value of the target training data set obtained according to the third mode was the highest, and was 0.820 and 0.716, respectively.
Furthermore, by adopting the method of the embodiment, higher-quality training data can be obtained, so that the performance of the target training model is improved, the classification of the emotion types by the target training model is more accurate, and further, the customer service robot can more accurately judge the emotion of a customer according to the expression of characters of the customer and timely make corresponding adjustment measures.
Fig. 4 is a schematic structural diagram of the data enhancement device provided in the present invention, and as shown in fig. 4, the data enhancement device 10 of the present embodiment includes:
an obtaining module 11, configured to obtain N training data subsets, where each training data subset corresponds to one emotion type, and N is a positive integer;
a determining module 12, configured to determine a to-be-processed data set from the N training data subsets according to a ratio between the data amount of the N training data subsets and the total data amount of the N training data subsets, respectively;
the processing module 13 is configured to perform at least one reverse translation on the data set to be processed to obtain a reverse translation data set, and the reverse translation is a process of performing non-chinese translation on any text data and then performing chinese translation on the text data;
the processing module 13 is further configured to obtain a target training data set according to the N training data subsets and the reverse translation data set.
Optionally, the determining module 12 is specifically configured to determine whether a ratio between the data amount of the N training data subsets and the total data amount of the N training data subsets is smaller than a preset ratio; and determining the training data subset corresponding to the preset proportion as a data set to be processed.
Optionally, the determining module 12 is specifically configured to sort the N training data subsets in an order that the data amounts of the N training data subsets are respectively in a smaller proportion to a larger proportion of the total data amount of the N training data subsets; and determining the first M training data subsets as to-be-processed data sets, wherein N is more than or equal to M, and M is a positive integer.
Optionally, the processing module 13 is specifically configured to perform english translation on the data set to be processed, and then perform chinese translation on the data set to be processed, so as to obtain a reverse translation data set.
Optionally, the obtaining module 11 is specifically configured to classify the original training data set according to the N emotion types to obtain N original training data subsets.
Optionally, the N emotion types include at least one of happy thank, angry, loss, anxiety, loss, worry, or others.
Optionally, the processing module 13 is further configured to train the initial training model according to the target training data set to obtain a target training model, where the target training model is used to determine an emotion type corresponding to the data to be detected.
The data enhancement device provided in this embodiment can be used to perform the data enhancement method, and its implementation and technical effects are similar, and this embodiment is not described herein again.
In the present invention, the data enhancement device may be divided into functional modules according to the above method examples, for example, each functional module may be divided corresponding to each function, or two or more functions may be integrated into one processing module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. It should be noted that the division of the modules in the embodiments of the present invention is schematic, and is only a logical function division, and there may be another division manner in actual implementation.
Fig. 5 is a schematic diagram of a hardware structure of the electronic device provided by the present invention. As shown in fig. 5, the electronic device 20 is configured to implement the operation corresponding to the server or the terminal device in any of the above method embodiments, where the electronic device 20 of this embodiment may include: a memory 21 and a processor 22;
a memory 21 for storing a computer program;
a processor 22 for executing the computer program stored in the memory to implement the data enhancement method in the above embodiments. Reference may be made in particular to the description relating to the method embodiments described above.
Alternatively, the memory 21 may be separate or integrated with the processor 22.
When the memory 21 is a device separate from the processor 22, the electronic device 20 may further include:
a bus 23 for connecting the memory 21 and the processor 22.
Optionally, this embodiment further includes: a communication interface 24, the communication interface 24 being connectable to the processor 22 via a bus 23. The processor 22 may control the communication interface 23 to implement the above-described receiving and transmitting functions of the electronic device 20.
The electronic device provided in this embodiment may be used to execute the data enhancement method, and the implementation manner and the technical effect thereof are similar, and this embodiment is not described herein again.
The present invention also provides a computer-readable storage medium including a computer program for implementing the data enhancement method as in the above embodiments.
In the embodiments provided in the present invention, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described embodiments of the apparatus are merely illustrative, and for example, the division of modules is only one logical division, and other divisions may be realized in practice, for example, a plurality of modules may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or modules, and may be in an electrical, mechanical or other form.
Modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.
In addition, functional modules in the embodiments of the present invention may be integrated into one processing unit, or each module may exist alone physically, or two or more modules are integrated into one unit. The unit formed by the modules can be realized in a hardware form, and can also be realized in a form of hardware and a software functional unit.
The integrated module implemented in the form of a software functional module may be stored in a computer-readable storage medium. The software functional module is stored in a storage medium and includes several instructions to enable a computer device (which may be a personal computer, a server, or a network device) or a processor (processor) to execute some steps of the methods according to the embodiments of the present application.
It should be understood that the Processor may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of a method disclosed in connection with the present invention may be embodied directly in a hardware processor, or in a combination of the hardware and software modules within the processor.
The memory may comprise a high-speed RAM memory, and may further comprise a non-volatile storage NVM, such as at least one disk memory, and may also be a usb disk, a removable hard disk, a read-only memory, a magnetic or optical disk, etc.
The bus may be an Industry Standard Architecture (ISA) bus, a Peripheral Component Interconnect (PCI) bus, an Extended ISA (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, the buses in the figures of the present application are not limited to only one bus or one type of bus.
The computer-readable storage medium may be implemented by any type or combination of volatile or non-volatile memory devices, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks. A storage media may be any available media that can be accessed by a general purpose or special purpose computer.
Those of ordinary skill in the art will understand that: all or a portion of the steps of implementing the above-described method embodiments may be performed by hardware associated with program instructions. The program may be stored in a computer-readable storage medium. When executed, the program performs steps comprising the method embodiments described above; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.
Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims (10)

1. A method of data enhancement, comprising:
acquiring N training data subsets, wherein each training data subset corresponds to one emotion type, and N is a positive integer;
determining a data set to be processed from the N training data subsets according to the proportion between the data amount of the N training data subsets and the total data amount of the N training data subsets;
performing at least one reverse translation on the data set to be processed to obtain a reverse translation data set, wherein the reverse translation is a process of performing non-Chinese translation on any text data and then performing Chinese translation;
and obtaining a target training data set according to the N training data subsets and the reverse translation data set.
2. The method according to claim 1, wherein the determining the dataset to be processed from the N training data subsets according to the ratio between the data amount of the N training data subsets and the total data amount of the N training data subsets comprises:
judging whether the proportion of the data quantity of the N training data subsets to the total data quantity of the N training data subsets is smaller than a preset proportion or not;
and determining the training data subset corresponding to the preset proportion as the data set to be processed.
3. The method according to claim 1, wherein the determining the dataset to be processed from the N training data subsets according to the ratio between the data amount of the N training data subsets and the total data amount of the N training data subsets comprises:
sorting the N training data subsets according to the sequence that the proportion of the data quantity of the N training data subsets to the total data quantity of the N training data subsets is from small to large;
and determining the first M training data subsets as the data set to be processed, wherein N is more than or equal to M, and M is a positive integer.
4. The method according to claim 1, wherein the at least one reverse translation of the data set to be processed to obtain a reverse translated data set comprises:
and performing English translation on the data set to be processed and then performing Chinese translation to obtain the reverse translation data set.
5. The method of claim 1, wherein the obtaining N subsets of training data comprises:
and classifying the original training data set according to the N emotion types to obtain N original training data subsets.
6. The method of claim 5, wherein the N emotion types include at least one of happy thank, angry, loss of life, anxiety, loss of consciousness, worry, or others.
7. The method according to any one of claims 1-6, further comprising:
and training the initial training model according to the target training data set to obtain a target training model, wherein the target training model is used for determining the emotion type corresponding to the data to be tested.
8. A data enhancement apparatus, comprising:
the acquisition module is used for acquiring N training data subsets, each training data subset corresponds to one emotion type, and N is a positive integer;
a determining module, configured to determine a to-be-processed data set from the N training data subsets according to a ratio between a data amount of the N training data subsets and a total data amount of the N training data subsets, respectively;
the processing module is used for carrying out at least one reverse translation on the data set to be processed to obtain a reverse translation data set, and the reverse translation is a process of firstly carrying out non-Chinese translation on any text data and then carrying out Chinese translation on the any text data;
the processing module is further configured to obtain a target training data set according to the N training data subsets and the reverse translation data set.
9. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the data enhancement method of any one of claims 1 to 7.
10. An electronic device, comprising:
a processor; and
a memory for storing executable instructions of the processor;
wherein the processor is configured to perform the data enhancement method of any one of claims 1-7 via execution of the executable instructions.
CN201910038115.2A 2019-01-16 2019-01-16 Data enhancement method and device Pending CN111523951A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910038115.2A CN111523951A (en) 2019-01-16 2019-01-16 Data enhancement method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910038115.2A CN111523951A (en) 2019-01-16 2019-01-16 Data enhancement method and device

Publications (1)

Publication Number Publication Date
CN111523951A true CN111523951A (en) 2020-08-11

Family

ID=71900025

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910038115.2A Pending CN111523951A (en) 2019-01-16 2019-01-16 Data enhancement method and device

Country Status (1)

Country Link
CN (1) CN111523951A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113762423A (en) * 2021-11-09 2021-12-07 北京世纪好未来教育科技有限公司 Data processing and model training method and device, electronic equipment and storage medium

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104536953A (en) * 2015-01-22 2015-04-22 苏州大学 Method and device for recognizing textual emotion polarity
CN105468731A (en) * 2015-11-20 2016-04-06 成都科来软件有限公司 Preprocessing method of text sentiment analysis characteristic verification
CN106294466A (en) * 2015-06-02 2017-01-04 富士通株式会社 Disaggregated model construction method, disaggregated model build equipment and sorting technique
US20170017899A1 (en) * 2015-07-16 2017-01-19 SparkBeyond Ltd. Systems and methods for secondary knowledge utilization in machine learning
CN107170453A (en) * 2017-05-18 2017-09-15 百度在线网络技术(北京)有限公司 Across languages phonetic transcription methods, equipment and computer-readable recording medium based on artificial intelligence
CN108536756A (en) * 2018-03-16 2018-09-14 苏州大学 Mood sorting technique and system based on bilingual information
CN108628971A (en) * 2018-04-24 2018-10-09 深圳前海微众银行股份有限公司 File classification method, text classifier and the storage medium of imbalanced data sets
CN109190768A (en) * 2018-08-09 2019-01-11 北京中关村科金技术有限公司 A kind of data enhancing corpus training method in neural network

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104536953A (en) * 2015-01-22 2015-04-22 苏州大学 Method and device for recognizing textual emotion polarity
CN106294466A (en) * 2015-06-02 2017-01-04 富士通株式会社 Disaggregated model construction method, disaggregated model build equipment and sorting technique
US20170017899A1 (en) * 2015-07-16 2017-01-19 SparkBeyond Ltd. Systems and methods for secondary knowledge utilization in machine learning
CN105468731A (en) * 2015-11-20 2016-04-06 成都科来软件有限公司 Preprocessing method of text sentiment analysis characteristic verification
CN107170453A (en) * 2017-05-18 2017-09-15 百度在线网络技术(北京)有限公司 Across languages phonetic transcription methods, equipment and computer-readable recording medium based on artificial intelligence
CN108536756A (en) * 2018-03-16 2018-09-14 苏州大学 Mood sorting technique and system based on bilingual information
CN108628971A (en) * 2018-04-24 2018-10-09 深圳前海微众银行股份有限公司 File classification method, text classifier and the storage medium of imbalanced data sets
CN109190768A (en) * 2018-08-09 2019-01-11 北京中关村科金技术有限公司 A kind of data enhancing corpus training method in neural network

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
刘秋慧;柴玉梅;刘箴;: "中文微博情感分析模型SR-CBOW", 小型微型计算机系统, no. 08 *
毕秋敏;李明;曾志勇;: "一种主动学习和协同训练相结合的半监督微博情感分类方法", 现代图书情报技术, no. 01 *
蔡子龙;杨明明;熊德意;: "基于数据增强技术的神经机器翻译", 中文信息学报, no. 07 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113762423A (en) * 2021-11-09 2021-12-07 北京世纪好未来教育科技有限公司 Data processing and model training method and device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN108073568B (en) Keyword extraction method and device
CN102576358B (en) Word pair acquisition device, word pair acquisition method, and program
CN107346433B (en) Text data classification method and server
CN110489449B (en) Chart recommendation method and device and electronic equipment
CN108090211B (en) Hot news pushing method and device
CN109325121B (en) Method and device for determining keywords of text
CN108959453B (en) Information extraction method and device based on text clustering and readable storage medium
CN109933648B (en) Real user comment distinguishing method and device
CN109446393B (en) Network community topic classification method and device
CN108052509A (en) A kind of Text similarity computing method, apparatus and server
CN107704869B (en) Corpus data sampling method and model training method
CN112527958A (en) User behavior tendency identification method, device, equipment and storage medium
CN104102662B (en) A kind of user interest preference similarity determines method and device
CN108470065B (en) Method and device for determining abnormal comment text
CN110555093B (en) Text matching method, device and equipment
CN111523951A (en) Data enhancement method and device
CN111400516B (en) Label determining method, electronic device and storage medium
CN107704763A (en) Multi-source heterogeneous leak information De-weight method, stage division and device
CN115687790B (en) Advertisement pushing method and system based on big data and cloud platform
CN109460555B (en) Document judgment method and device and electronic equipment
CN108985379B (en) Method and device for evaluating performance of classifier and computer readable storage medium
CN104615681B (en) Text selection method and device
CN114281983B (en) Hierarchical text classification method, hierarchical text classification system, electronic device and storage medium
CN103279549A (en) Method and device for acquiring target data of target objects
CN104809236A (en) Microblog-based user age classification method and Microblog-based user age classification system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20210519

Address after: 100176 room 1004, 10th floor, building 1, 18 Kechuang 11th Street, Beijing Economic and Technological Development Zone, Daxing District, Beijing

Applicant after: Beijing Huijun Technology Co.,Ltd.

Address before: 8 / F, 76 Zhichun Road, Haidian District, Beijing 100195

Applicant before: BEIJING JINGDONG SHANGKE INFORMATION TECHNOLOGY Co.,Ltd.

Applicant before: BEIJING JINGDONG CENTURY TRADING Co.,Ltd.

SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination