CN114239590A - Data processing method and device - Google Patents

Data processing method and device Download PDF

Info

Publication number
CN114239590A
CN114239590A CN202111456715.4A CN202111456715A CN114239590A CN 114239590 A CN114239590 A CN 114239590A CN 202111456715 A CN202111456715 A CN 202111456715A CN 114239590 A CN114239590 A CN 114239590A
Authority
CN
China
Prior art keywords
data
text
text data
characters
sensitive word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111456715.4A
Other languages
Chinese (zh)
Other versions
CN114239590B (en
Inventor
李长林
蒋宁
王洪斌
吴海英
权佳成
曹磊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Mashang Consumer Finance Co Ltd
Original Assignee
Mashang Consumer Finance Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Mashang Consumer Finance Co Ltd filed Critical Mashang Consumer Finance Co Ltd
Priority to CN202111456715.4A priority Critical patent/CN114239590B/en
Publication of CN114239590A publication Critical patent/CN114239590A/en
Application granted granted Critical
Publication of CN114239590B publication Critical patent/CN114239590B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a data processing method and a data processing device, relates to the technical field of data enhancement, and ensures the quality while enhancing the scale of a data sample so as to avoid pollution to an original data set. The main technical scheme of the invention is as follows: acquiring first text data, wherein the first text data comprises sensitive words; inserting characters into the non-sensitive word data of the first text data or performing synonym replacement on the non-sensitive word data of the first text data to obtain second text data; and if the emotion polarity corresponding to the second text data is the same as the emotion polarity corresponding to the first text data, determining the second text data as the enhancement data of the first text data. The method is mainly applied to data enhancement processing of the text data containing few sensitive words.

Description

Data processing method and device
Technical Field
The present invention relates to the field of data enhancement technologies, and in particular, to a data processing method and apparatus.
Background
In view of the situations of some scenes, the data samples are few or very few, for some non-pre-training language models such as a machine learning method, and a semantic model cannot be trained well only by the tiny data amount, the data sample size can be expanded by adopting a data enhancement method, so that the larger the data sample size is, the higher the quality is, the better the generalization capability of the trained model can be obtained.
Currently, the data enhancement method of text in Natural Language Processing (NLP) is roughly divided into two types: one is to inject noise into the text representation to expand the amount of data; and the other way is that before the text is represented, the original text is modified in the modes of synonym replacement, random insertion, random deletion and the like so as to achieve the purpose of expanding the data volume.
However, with these existing data enhancement methods, it is difficult to grasp control whether the semantics of the enhanced data samples are changed, and although the amount of the enhanced data is increased, if the semantics are also changed, the quality of the enhanced data samples is not high, which may have uncontrollable influence on the model training.
Disclosure of Invention
In view of this, the present invention provides a data processing method and apparatus, and mainly aims to obtain a larger-scale data sample by using enhancement processing, and simultaneously avoid the semantics of the enhancement data from being changed to the maximum extent, thereby ensuring the quality of the enhancement processing, and avoiding the pollution to the original data set, which are beneficial to bringing a positive influence to the subsequent model training.
In order to achieve the above purpose, the present invention mainly provides the following technical solutions:
a first aspect of the present application provides a data processing method, including:
acquiring first text data, wherein the first text data comprises sensitive words;
inserting characters into the non-sensitive word data of the first text data or performing synonym replacement on the non-sensitive word data of the first text data to obtain second text data;
and if the emotion polarity corresponding to the second text data is the same as the emotion polarity corresponding to the first text data, determining the second text data as the enhancement data of the first text data.
A second aspect of the present application provides a data processing apparatus comprising:
the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring first text data, and the first text data comprises sensitive words;
the processing unit is used for inserting characters into the non-sensitive word data of the first text data or performing synonym replacement on the non-sensitive word data of the first text data to obtain second text data;
and the determining unit is used for determining the second text data as the enhancement data of the first text data if the emotion polarity corresponding to the second text data is the same as the emotion polarity corresponding to the first text data.
A third aspect of the present application provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements a data processing method as described above.
A fourth aspect of the present application provides an electronic device, comprising: a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the data processing method as described above when executing the computer program.
By the technical scheme, the technical scheme provided by the invention at least has the following advantages:
the invention provides a data processing method and a data processing device, which are used for realizing enhancement processing on first text data containing sensitive words in a mode of inserting characters into non-sensitive word data of the first text data or replacing synonyms to obtain second text data, and further determining the second text data as the enhanced data of the first text data if the emotion polarities of the second text data and the first text data are judged to be the same. Compared with the prior art, the method and the device have the advantages that the data enhancement processing is carried out in consideration of the two aspects of keeping the emotion polarities of the sensitive words and the text data unchanged, so that the data semantics of the enhanced data are prevented from being changed to the maximum extent, the problem that the quality of the enhanced data sample is difficult to guarantee due to the fact that the semantics of the enhanced data sample are difficult to grasp and control in the prior art is solved, the quality is guaranteed while the scale of the data sample is enhanced, pollution to an original data set is avoided, and the method and the device are beneficial to achieving the benign influence on the subsequent model training.
The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.
Drawings
Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:
fig. 1 is a flowchart of a data processing method according to an embodiment of the present invention;
FIG. 2 is a flow chart of another data processing method according to an embodiment of the present invention;
FIG. 3 is a simplified flowchart of an exemplary data enhancement process according to an embodiment of the present invention;
FIG. 4 is a block diagram of a data processing apparatus according to an embodiment of the present invention;
fig. 5 is a block diagram of another data processing apparatus according to an embodiment of the present invention.
Detailed Description
Exemplary embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the invention are shown in the drawings, it should be understood that the invention can be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art.
An embodiment of the present invention provides a data processing method, as shown in fig. 1, which obtains a larger-scale data sample by using enhancement processing and simultaneously avoids the semantics of the enhanced data from being changed to the maximum extent, and for this, the embodiment of the present invention provides the following specific steps:
101. acquiring first text data, wherein the first text data comprises sensitive words.
In the embodiment of the present invention, the first text data refers to text data to be subjected to enhancement processing, and the text data may include one or more texts, each text including at least one sentence. The enhancement processing of the data in the embodiment of the present invention may also be understood as performing enhancement processing of the data, that is, expanding the data. However, it is preferable that the processing object for performing data enhancement in the embodiment of the present invention is text data containing fewer characters, and text data similar to this express a theme or surround a central semantic meaning, that is, text data carrying various or complex semantics is not used, so that the following factor that the semantic meaning is not changed is used to realize screening of text data after data enhancement processing.
The sensitive words are keywords predefined according to actual service scenario requirements, for example, for a customer service application scenario, similar to the first text data: mr. you are overdue for two months, please deal with as soon as possible, otherwise, the information is sent to the household location, and the word "household location" is a preset sensitive word.
It should be noted that the words "first" and "second" are only used to distinguish different text data, that is, text data without enhancement processing is identified as first text data, and text data with enhancement processing is identified as second text data.
Illustratively, for example, in the first text data, if the proportion of positive samples and negative samples is very different, that is, the number of positive samples is much greater than the number of negative samples, which may affect the model training index, it is necessary to perform enhancement processing on such data samples, that is: for negative samples, there is an enhanced processing requirement.
For example, in the customer service application scenario, the text data "you are asking for your residence is ×", so that the text data is a positive sample without negative emotional colors, and generally, the text data is more, but as the text data "mr", you have already expired for two months and ask you to deal with as soon as possible, otherwise, the side will send your information to the residence ", and similarly, the text data with negative emotional colors is used as a negative sample, during the text data collection process, such data samples are fewer and data enhancement is needed to enlarge the data size.
102. Inserting characters into the non-sensitive word data of the first text data or performing synonym replacement on the non-sensitive word data of the first text data to obtain second text data.
In the embodiment of the present invention, two data enhancement processing methods may be adopted: one way is to insert characters into non-sensitive word data of the first text data to realize data enhancement processing; another way is to perform data enhancement processing implemented by synonym replacement on the non-sensitive word data of the first text data. The text data obtained through the data enhancement processing is marked as second text data.
With both of the above data enhancement processing manners, regardless of the character insertion operation or the synonym replacement operation, the ordering of the adjacent literal characters within the first text data is also unchanged, and the non-sensitive word data within the first text data is preserved.
Then, in the embodiment of the present invention, two constraint rules are actually satisfied for the data enhancement processing of the first text data, that is: one constraint rule is a constraint rule with invariable character ordering according to characters in the first text data; another constraint rule is a constraint rule that keeps sensitive words unchanged.
For example, for the text data "mr. you have been older for two months, please deal with it as soon as possible, otherwise, the data is sent to the household, wherein the sensitive word is defined as" household ", it is necessary to ensure that the order of two adjacent characters in the original text data cannot be changed, and the word" household "needs to be reserved and cannot be disassembled no matter how the data enhancement is performed.
Above, it should be especially noted that the purpose of keeping the sensitive words unraveled is to: and the enhanced processed text data can still be kept around the sensitive words as semantic cores to the maximum extent, and the semantic change is avoided.
103. And if the emotion polarity corresponding to the second text data is the same as the emotion polarity corresponding to the first text data, determining the second text data as the enhanced data of the first text data.
In the embodiment of the present invention, the two constraint rules given in step 102 constitute a constraint condition of one aspect, and in addition, a constraint condition of another aspect is also required, that is: step 103, the emotion polarities of the second text data and the first text data are required to be the same. The second text data is determined as the enhancement data of the first text data only when the constraint conditions of the two aspects are met.
It should be noted that, a combination progressive relationship exists between the constraint conditions of the two aspects, and the operation executed by the combination progressive constraint condition can replace a method of judging whether the semantics of the first text data and the second text data are the same by using complex semantic analysis of natural language processing, and can indirectly measure whether the semantics of the second text data are changed compared with the semantics of the first text data. The efficiency of obtaining the second text data with unchanged semantics is improved because the intervention of complex semantic analysis operation is avoided.
In the above, an embodiment of the present invention provides a data processing method, where for first text data including a sensitive word, an enhancement processing is implemented to obtain second text data in a manner of inserting a character into non-sensitive word data of the first text data or performing synonym replacement, and further, if it is determined that the emotion polarities of the second text data and the first text data are the same, the second text data is determined as the enhancement data of the first text data. Compared with the prior art, the embodiment of the invention considers that the two aspects of keeping the emotion polarities of the sensitive words and the text data unchanged are carried out the data enhancement processing, thereby furthest avoiding the data semantics of the enhanced text from being changed, solving the problem that the quality of the enhanced data sample is difficult to ensure in the prior art because the semantics of the enhanced data sample are difficult to grasp and control, ensuring the quality while enhancing the scale of the data sample, further avoiding polluting the original data set, and being beneficial to realizing the benign influence on the subsequent model training.
In order to describe the above embodiment in more detail, another data processing method is provided in the embodiment of the present invention, as shown in fig. 2, which is a detailed statement and a supplementary statement of the above embodiment, and the following specific steps are provided for the embodiment of the present invention:
201. acquiring first text data, wherein the first text data comprises sensitive words.
In the embodiment of the present invention, for the explanation of this step, refer to step 101, which is not described herein again.
202. And acquiring the number of literal characters of all texts in the first text data.
And 203a, if the number of the character characters is larger than a preset threshold value, determining the text length of the first text data as a long text.
And 203b, if the number of the character characters is not larger than the preset threshold value, determining the text length of the first text data as a short text.
In the embodiment of the present invention, two different types of data enhancement methods are adopted according to the length of the first text data, so that how to measure the first text data as a long text or a short text is firstly obtained by comparing the number of text characters with a preset threshold, specifically, the method includes: and if the number of the character characters of all texts in the first text data is greater than a preset threshold value, determining the texts to be long texts, otherwise, determining the texts to be short texts, wherein the preset threshold value is preset according to actual requirements of different service scenes.
204a, if the text length of the first text data is a long text, inserting characters into the non-sensitive word data of the first text data to obtain second text data.
In the embodiment of the present invention, if the text length of the first text data is a long text, the data enhancement method is: characters are inserted into non-sensitive word data of the first text data, and characters are not inserted between sensitive words.
Example method 1, a specific implementation method for inserting characters into non-sensitive word data of first text data may include: acquiring the average length of the text in the first text data; determining the number of first characters corresponding to the average length of the text according to a preset first mapping relation; and inserting characters into the non-sensitive word data of the first text data according to the number of the first characters to obtain second text data.
The preset first mapping relationship is a mapping relationship between a preset average length of a text and the number of characters required to be inserted, and specifically, the preset first mapping relationship can be preset according to different actual application scene requirements. For the embodiments of the present invention, the original text may be split into multiple parts according to the characters inserted into the text, and the characters used are preferably commas or periods of punctuation marks.
Then, regardless of the number of texts included in the first text data, the average number of text characters included in each text, that is, the average length of the text, can be calculated by counting the number of text characters in each text. Then the number of characters (i.e. the number of characters identified as the first character) to be inserted for processing the text with such average length can be obtained by querying the first preset mapping relationship. Therefore, specifically, according to the number of the first characters, the first characters are inserted into the non-sensitive data of each text in the first text data, that is, characters are not inserted into the sensitive words, and the sensitive words are not detached.
Example method 2, a specific implementation method for inserting characters into non-sensitive word data of the first text data may include: acquiring a median of the text length in the first text data; determining the number of second characters corresponding to the median of the text length according to a preset second mapping relation; and inserting characters into the non-sensitive word data of the first text data according to the number of the second characters to obtain second text data.
The preset second mapping relationship is a mapping relationship between a preset median of the text length and the number of characters to be inserted, and specifically, the preset second mapping relationship can be preset according to different actual application scene requirements. For the embodiments of the present invention, the original text may be split into multiple parts according to the characters inserted into the text, and the characters used are preferably commas or periods of punctuation marks.
Then, no matter several texts are included in the first text data, the number of character characters in each text can be counted, and the text length and the median of the text length of each text are further determined. Then, by querying the second preset mapping relationship, the number of characters (i.e. the number of characters marked as the second number of characters) to be inserted for processing the text with the text length of the median can be obtained. Therefore, specifically, according to the number of the second characters, the second characters are inserted into the non-sensitive data of each text in the first text data, that is, characters are not inserted into the sensitive words, and the sensitive words are not detached.
It should be noted that, in the above example method 1 and example method 2, as for the manner of inserting the characters into the text, the first character or the second character may be inserted at a random position of the non-sensitive word data in the text, or the first character or the second character may be inserted at intervals of a fixed number of literal characters. For example, commas or periods are inserted randomly into the first text data, but not into sensitive words.
And for the operation of inserting characters into the text, the constraint rules according to which the character is actually inserted are still: constraint rule one, which is a constraint rule with invariable character ordering according to characters in the first text data; and a second constraint rule, namely a constraint rule that the sensitive words are not changed is reserved.
204b, if the text length of the first text data is a short text, performing synonym replacement on the non-sensitive word data in the first text data to obtain second text data.
In the embodiment of the present invention, if the text length of the first text data is a short text, the data enhancement method is: synonym replacement is performed on words contained in the first text data, but the replacement object is not a sensitive word.
Example method 3, a specific implementation method for performing synonym replacement on non-sensitive word data in first text data may include: acquiring the length of each text in the first text data; determining a synonym replacement ratio corresponding to the text according to a preset third mapping relation; and performing synonym replacement on the non-sensitive word data in the first text data according to the synonym replacement proportion to obtain second text data.
The preset third mapping relationship is a mapping relationship between a preset text length and a synonym replacement ratio, the synonym replacement ratio is a percentage of words in a text that need to be replaced by synonyms (namely, a percentage between the number of words in the text that need to be replaced by synonyms), and specifically, the preset third mapping relationship can be preset according to different practical application scene requirements.
Then, regardless of the number of texts included in the first text data, the number of character characters in each text may be counted to obtain the text length of each text, and then the synonym replacement ratio corresponding to each text is obtained by querying the preset third mapping relationship. Specifically, the synonym replacement mode may be that a certain word in the text is replaced randomly, or that the words are traversed and the synonym replacement operation is executed according to the first to last order of the literal character sequence.
It should be noted that, in the above example method 3, the constraint rule according to which the synonym replacement process is actually performed is still: constraint rule one, which is a constraint rule with invariable character ordering according to characters in the first text data; and a second constraint rule, namely a constraint rule that the sensitive words are not changed is reserved.
For example, for the embodiment of the present invention, the synonym replacement refers to replacing a word in a text with another synonym, and the position of the corresponding synonym literal character in the text is still the position of the originally replaced word, it needs to be noted that the sensitive word is not subjected to the synonym replacement process. Further, if the replaced synonym contains literal characters and the literal characters contained in the replaced words are not equal, it is only necessary to ensure that the position of the replaced synonym inserted into the text is the original replaced word position. It is noted, however, that the replaced word is not a sensitive word.
205. And if the emotion polarity corresponding to the second text data is the same as the emotion polarity corresponding to the first text data, determining the second text data as the enhanced data of the first text data.
Wherein, the emotional polarities include: positive polarity, neutral polarity, or negative polarity.
In the embodiment of the invention, a preset text emotion classification model can be trained in advance, so that the emotion polarity of the text data can be conveniently judged.
For example, inputting the first text data into a preset text emotion classification model, and outputting emotion polarities corresponding to the first text data; and inputting the second text data into a preset text emotion classification model, and outputting emotion polarities corresponding to the second text data. Then, if the emotion polarities of the second text data and the first text data are judged to be the same, the second text data is saved and serves as enhancement data.
It should be noted that, for the first text data and the second text data, before the first text data and the second text data are not input into the preset text sentiment classification model, it can be known by comparison that the ordering of the same character characters existing in the two text data is the same, and the same sensitive words exist in the two text data, so that the aspect of keeping the ordering of the character characters in the text data unchanged and keeping the sensitive words is equivalent to ensure that the semantic meaning of the second text data is unchanged as much as possible, and the quality of the text data input into the model for judgment is still higher, so as to avoid processing a large amount of worthless redundant data by using the model, and thus, to improve the efficiency of acquiring the second text data with the sentiment polarity unchanged from the first text data.
However, if the emotion polarities of the second text data and the first text data are determined to be different, it can be indirectly determined that the semantics of the second text data are changed from those of the first text data, and such second text data should be discarded in order to ensure the enhanced quality of the data. Further, steps 204a, 204b and 205 of the embodiment of the present invention may be executed again to execute the data enhancement processing on the first text data again, so that a plurality of rounds of data enhancement processing tasks may be implemented by using the repetitive processing.
Furthermore, the second text data (i.e. the enhanced samples with negative and positive text emotions) are labeled after being manually checked, and can be added into a text emotion labeling data set to perform iterative optimization of a preset text emotion classification model. The second text data is adopted to carry out iterative optimization on the preset text emotion classification model, so that on one hand, the identification accuracy of the text emotion classification model can be improved, and the quality of the augmented data is further improved; on the other hand, through improving the discernment rate of accuracy to can reduce the number of the cycle of data increase.
And further, the number of data enhancement processing task rounds or the number of target data enhancement results required to be achieved can be set, and when the number of task rounds reaches an upper limit or the number of target data enhancement results reaches an upper limit, the repeated execution of the data enhancement processing operation is stopped as task termination, so as to avoid redundant operation or waste of processing resources.
Illustratively, embodiments of the present invention also provide a simple flow chart of the data enhancement process as shown in fig. 3. As shown in fig. 3, for a first text data, a criterion for measuring whether the text data is a long text or a short text in different scenes is preset according to the "scene sentence length distribution". For long texts, data enhancement processing is realized by adopting a character inserting mode for first text data, and further, the processing condition I (namely characters are inserted, but the ordering of the characters is not changed and the characters are not inserted into sensitive words) is met; for short texts, data enhancement processing is realized by adopting a synonym replacement mode, and a second processing condition is further required to be met (namely, the replacement synonym is correspondingly placed to the position of the replaced original word and the sensitive word is ensured not to be replaced); for the first text data and the second text data, a preset text emotion classification model is needed for auxiliary judgment: and if not, the emotion polarity of the second text data is reserved as enhancement data, otherwise, the data enhancement processing of the first text data is executed again.
In summary, embodiments of the present invention provide a data processing method and apparatus, where, for first text data including a sensitive word, it is first determined whether the first text data is a long text or a short text, if the first text data is the long text, data enhancement processing is implemented in a manner of inserting characters into non-sensitive word data of the first text data, and if the first text data is the short text, data enhancement processing is implemented in a manner of performing synonym replacement on the non-sensitive word data of the first text data, so that more targeted data enhancement processing is implemented based on the long text and the short text, and then, if it is determined that emotion polarities of the second text data are the same as those of the first text data, the second text data is determined as enhancement data of the first text data. Compared with the prior art, the embodiment of the invention utilizes two modes of targeted data enhancement processing and ensuring that the emotion polarity of the second text data is unchanged after the data enhancement, realizes the enlargement of the scale of the data sample, simultaneously utilizes the targeted enhancement processing and ensures that the emotion polarity is unchanged, also improves the accuracy of the data enhancement processing, further improves the quality of the data enhancement processing, and solves the problem that the quality of the enhanced data sample is difficult to ensure due to the difficulty in mastering and controlling the semantics of the enhanced data sample in the prior art.
Further, as an implementation of the methods shown in fig. 1 and fig. 2, an embodiment of the present invention provides a data processing apparatus. The embodiment of the apparatus corresponds to the embodiment of the method, and for convenience of reading, details in the embodiment of the apparatus are not repeated one by one, but it should be clear that the apparatus in the embodiment can correspondingly implement all the contents in the embodiment of the method. The device is applied to implement data enhancement processing on text data containing few sensitive words, and specifically as shown in fig. 4, the device comprises:
the acquiring unit 31 is configured to acquire first text data, where the first text data includes a sensitive word;
a processing unit 32, configured to insert characters into the non-sensitive word data of the first text data or perform synonym replacement on the non-sensitive word data of the first text data, to obtain second text data;
a determining unit 33, configured to determine the second text data as the enhancement data of the first text data if the emotion polarity corresponding to the second text data is the same as the emotion polarity corresponding to the first text data.
Further, as shown in fig. 5, the processing unit 32 includes:
a first processing module 321, configured to insert characters into non-sensitive word data of the first text data to obtain second text data if the text length of the first text data is a long text;
the second processing module 322 is configured to, if the text length of the first text data is a short text, perform synonym replacement on non-sensitive word data in the first text data to obtain the second text data.
Further, as shown in fig. 5, the first processing module 321 includes;
the obtaining sub-module 3211 is configured to obtain an average length of a text in the first text data;
the determining submodule 3212 is configured to determine, according to a preset first mapping relationship, the number of first characters corresponding to the average length of the text;
the inserting sub-module 3213 is configured to insert characters into the non-sensitive word data of the first text data according to the number of the first characters to obtain the second text data.
Further, as shown in fig. 5, the first processing module 321 includes:
the obtaining submodule 3211 is further configured to obtain a median of a text length in the first text data;
the determining submodule 3212 is further configured to determine, according to a preset second mapping relationship, the number of second characters corresponding to the median of the text length;
the inserting sub-module 3213 is further configured to insert characters into the non-sensitive word data of the first text data according to the number of the second characters, so as to obtain the second text data.
Further, as shown in fig. 5, the second processing module 322 includes:
the obtaining sub-module 3221 is configured to obtain lengths of each non-sensitive word data in the first text data;
the determining sub-module 3222 is configured to determine, according to a preset third mapping relationship, a synonym replacement ratio corresponding to each piece of non-sensitive word data;
the replacing sub-module 3223 is configured to perform synonym replacement on the non-sensitive word data in the first text data according to the synonym replacement ratio, so as to obtain the second text data.
Further, as shown in fig. 5, the apparatus further includes:
the acquiring unit 31 is further configured to acquire the number of text characters of all texts in the first text data;
the determining unit 33 is further configured to determine the text length of the first text data as a long text if the number of the text characters is greater than a preset threshold;
the determining unit 33 is further configured to determine the text length of the first text data as a short text if the number of the text characters is not greater than the preset threshold.
In this embodiment, the emotion polarities include: positive polarity, neutral polarity, or negative polarity.
The data processing device comprises a processor and a memory, the acquisition unit, the processing unit, the determination unit and the like are stored in the memory as program units, and the processor executes the program units stored in the memory to realize corresponding functions.
The processor comprises a kernel, and the kernel calls the corresponding program unit from the memory. The kernel can be set to be one or more than one, and when a larger-scale data sample is obtained by utilizing enhancement processing through adjusting kernel parameters, the semantic change of the enhancement data is avoided to the maximum extent, so that the quality of the enhancement processing is ensured, the pollution to an original data set is avoided, and the method is beneficial to bringing a good influence on the subsequent model training.
An embodiment of the present invention provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the data processing method described above.
An embodiment of the present invention provides an electronic device, including: a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the data processing method as described above when executing the computer program.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
In a typical configuration, a device includes one or more processors (CPUs), memory, and a bus. The device may also include input/output interfaces, network interfaces, and the like.
The memory may include volatile memory in a computer readable medium, Random Access Memory (RAM) and/or nonvolatile memory such as Read Only Memory (ROM) or flash memory (flash RAM), and the memory includes at least one memory chip. The memory is an example of a computer-readable medium.
Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in the process, method, article, or apparatus that comprises the element.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The above are merely examples of the present application and are not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent insertion, improvement, etc. made within the spirit and principle of the present application should be included in the scope of claims of the present application.

Claims (10)

1. A method of data processing, the method comprising:
acquiring first text data, wherein the first text data comprises sensitive words;
inserting characters into the non-sensitive word data of the first text data or performing synonym replacement on the non-sensitive word data of the first text data to obtain second text data;
and if the emotion polarity corresponding to the second text data is the same as the emotion polarity corresponding to the first text data, determining the second text data as the enhancement data of the first text data.
2. The method of claim 1, wherein the inserting characters into the non-sensitive word data of the first text data or performing synonym replacement on the non-sensitive word data of the first text data to obtain second text data comprises:
if the text length of the first text data is a long text, inserting characters into the non-sensitive word data of the first text data to obtain second text data;
or if the text length of the first text data is a short text, performing synonym replacement on non-sensitive word data in the first text data to obtain the second text data.
3. The method of claim 2, wherein said inserting characters into said non-sensitive word data of said first text data results in said second text data comprising;
acquiring the average length of the text in the first text data;
determining the number of first characters corresponding to the average length of the text according to a preset first mapping relation;
and inserting characters into the non-sensitive word data of the first text data according to the number of the first characters to obtain the second text data.
4. The method of claim 2, wherein said inserting characters into said non-sensitive word data of said first text data to obtain said second text data comprises:
acquiring a median of the text length in the first text data;
determining the number of second characters corresponding to the median of the text length according to a preset second mapping relation;
and inserting characters into the non-sensitive word data of the first text data according to the number of the second characters to obtain the second text data.
5. The method of claim 2, wherein performing synonym replacement on the non-sensitive word data of the first text data to obtain second text data comprises:
acquiring the length of each text in the first text data;
determining a synonym replacement ratio corresponding to the text according to a preset third mapping relation;
and performing synonym replacement on the non-sensitive word data in the first text data according to the synonym replacement proportion to obtain the second text data.
6. The method according to any one of claims 2-5, further comprising:
acquiring the number of character characters of all texts in the first text data;
if the number of the character characters is larger than a preset threshold value, determining the text length of the first text data as a long text;
and if the number of the character characters is not larger than the preset threshold value, determining the text length of the first text data as a short text.
7. The method of any of claims 1-5, wherein the emotion polarities comprise: positive polarity, neutral polarity, or negative polarity.
8. A data processing apparatus, characterized in that the apparatus comprises:
the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring first text data, and the first text data comprises sensitive words;
the processing unit is used for inserting characters into the non-sensitive word data of the first text data or performing synonym replacement on the non-sensitive word data of the first text data to obtain second text data;
and the determining unit is used for determining the second text data as the enhancement data of the first text data if the emotion polarity corresponding to the second text data is the same as the emotion polarity corresponding to the first text data.
9. A computer-readable storage medium, characterized in that the computer-readable storage medium has stored thereon a computer program which, when being executed by a processor, carries out the data processing method of any one of claims 1 to 7.
10. An electronic device, comprising: memory, processor and computer program stored on the memory and executable on the processor, which when executed by the processor implements a data processing method as claimed in any one of claims 1 to 7.
CN202111456715.4A 2021-12-01 2021-12-01 Data processing method and device Active CN114239590B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111456715.4A CN114239590B (en) 2021-12-01 2021-12-01 Data processing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111456715.4A CN114239590B (en) 2021-12-01 2021-12-01 Data processing method and device

Publications (2)

Publication Number Publication Date
CN114239590A true CN114239590A (en) 2022-03-25
CN114239590B CN114239590B (en) 2023-09-19

Family

ID=80752650

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111456715.4A Active CN114239590B (en) 2021-12-01 2021-12-01 Data processing method and device

Country Status (1)

Country Link
CN (1) CN114239590B (en)

Citations (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CA2594073A1 (en) * 1993-03-19 1994-09-20 Nynex Science & Technology, Inc. Improved automated voice synthesis employing enhanced prosodic treatment of text, spelling of text and rate of annunciation
US20120259637A1 (en) * 2011-04-11 2012-10-11 Samsung Electronics Co., Ltd. Method and apparatus for receiving audio
CN103955451A (en) * 2014-05-15 2014-07-30 北京优捷信达信息科技有限公司 Method for judging emotional tendentiousness of short text
CN105574066A (en) * 2015-10-23 2016-05-11 青岛恒波仪器有限公司 Web page text extraction and comparison method and system thereof
CN106528583A (en) * 2015-11-14 2017-03-22 孙燕群 Method for extracting and comparing web page main body
CN107967337A (en) * 2017-12-05 2018-04-27 云南大学 A kind of cross-cutting sentiment analysis method semantic based on feeling polarities enhancing
CN108460015A (en) * 2018-02-08 2018-08-28 合肥工业大学 Text emotion grouped data enhances analysis method
CN111832283A (en) * 2020-06-19 2020-10-27 上海明略人工智能(集团)有限公司 Text generation method, storage medium and electronic device
CN112183074A (en) * 2020-09-27 2021-01-05 中国建设银行股份有限公司 Data enhancement method, device, equipment and medium
CN112580358A (en) * 2019-09-30 2021-03-30 北京国双科技有限公司 Text information extraction method, device, storage medium and equipment
CN112580337A (en) * 2020-12-29 2021-03-30 南京航空航天大学 Emotion classification model and emotion classification method based on data enhancement
CN112784041A (en) * 2021-01-06 2021-05-11 河海大学 Chinese short text emotion orientation analysis method
CN112860896A (en) * 2021-03-05 2021-05-28 三一重工股份有限公司 Corpus generalization method and man-machine conversation emotion analysis method for industrial field
CN113255365A (en) * 2021-05-28 2021-08-13 湖北师范大学 Text data enhancement method, device and equipment and computer readable storage medium
CN113297842A (en) * 2021-05-25 2021-08-24 湖北师范大学 Text data enhancement method
CN113505202A (en) * 2021-07-30 2021-10-15 中关村科学城城市大脑股份有限公司 Data enhancement method and system based on emotion analysis

Patent Citations (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CA2594073A1 (en) * 1993-03-19 1994-09-20 Nynex Science & Technology, Inc. Improved automated voice synthesis employing enhanced prosodic treatment of text, spelling of text and rate of annunciation
US20120259637A1 (en) * 2011-04-11 2012-10-11 Samsung Electronics Co., Ltd. Method and apparatus for receiving audio
CN103955451A (en) * 2014-05-15 2014-07-30 北京优捷信达信息科技有限公司 Method for judging emotional tendentiousness of short text
CN105574066A (en) * 2015-10-23 2016-05-11 青岛恒波仪器有限公司 Web page text extraction and comparison method and system thereof
CN106528583A (en) * 2015-11-14 2017-03-22 孙燕群 Method for extracting and comparing web page main body
CN107967337A (en) * 2017-12-05 2018-04-27 云南大学 A kind of cross-cutting sentiment analysis method semantic based on feeling polarities enhancing
CN108460015A (en) * 2018-02-08 2018-08-28 合肥工业大学 Text emotion grouped data enhances analysis method
CN112580358A (en) * 2019-09-30 2021-03-30 北京国双科技有限公司 Text information extraction method, device, storage medium and equipment
CN111832283A (en) * 2020-06-19 2020-10-27 上海明略人工智能(集团)有限公司 Text generation method, storage medium and electronic device
CN112183074A (en) * 2020-09-27 2021-01-05 中国建设银行股份有限公司 Data enhancement method, device, equipment and medium
CN112580337A (en) * 2020-12-29 2021-03-30 南京航空航天大学 Emotion classification model and emotion classification method based on data enhancement
CN112784041A (en) * 2021-01-06 2021-05-11 河海大学 Chinese short text emotion orientation analysis method
CN112860896A (en) * 2021-03-05 2021-05-28 三一重工股份有限公司 Corpus generalization method and man-machine conversation emotion analysis method for industrial field
CN113297842A (en) * 2021-05-25 2021-08-24 湖北师范大学 Text data enhancement method
CN113255365A (en) * 2021-05-28 2021-08-13 湖北师范大学 Text data enhancement method, device and equipment and computer readable storage medium
CN113505202A (en) * 2021-07-30 2021-10-15 中关村科学城城市大脑股份有限公司 Data enhancement method and system based on emotion analysis

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
王庆林;李晗;庞良健;徐新胜;: "基于全局语义学习的文本情感增强方法研究", 科学技术与工程, no. 21, pages 259 - 265 *
胡盛伟;李弼程;林孔杰;熊尧;: "MaskAE:基于无监督的短文本情感迁移方法", 中文信息学报, no. 02, pages 108 - 115 *

Also Published As

Publication number Publication date
CN114239590B (en) 2023-09-19

Similar Documents

Publication Publication Date Title
CN106610931B (en) Topic name extraction method and device
CN111143551A (en) Text preprocessing method, classification method, device and equipment
CN112083897A (en) Signal declaration system, method, equipment and medium in digital logic design
CN112667803A (en) Text emotion classification method and device
CN110489559A (en) A kind of file classification method, device and storage medium
CN110502614A (en) Text hold-up interception method, device, system and equipment
CN105989066A (en) Information processing method and device
CN115455166A (en) Method, device, medium and equipment for detecting abnormality of intelligent dialogue system
CN114359533B (en) Page number identification method based on page text and computer equipment
CN108804563B (en) Data labeling method, device and equipment
CN110008807A (en) A kind of training method, device and the equipment of treaty content identification model
CN108255891B (en) Method and device for judging webpage type
CN112560463A (en) Text multi-labeling method, device, equipment and storage medium
JP6508327B2 (en) Text visualization system, text visualization method, and program
CN114239590A (en) Data processing method and device
CN109558580B (en) Text analysis method and device
CN110019295B (en) Database retrieval method, device, system and storage medium
CN111400484B (en) Keyword extraction method and system
CN114118950A (en) Method and device for arranging consultation scheme based on project
US20210312223A1 (en) Automated determination of textual overlap between classes for machine learning
CN110188330B (en) Method and device for determining similar text information, electronic equipment and storage medium
CN112579768A (en) Emotion classification model training method, text emotion classification method and text emotion classification device
US20180052917A1 (en) Computer-implemented methods and systems for categorization and analysis of documents and records
CN116758565B (en) OCR text restoration method, equipment and storage medium based on decision tree
CN111581921B (en) Text editing method and device, computer storage medium and terminal

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant