CN114036907A

CN114036907A - Text data amplification method based on domain features

Info

Publication number: CN114036907A
Application number: CN202111371729.6A
Authority: CN
Inventors: 祝和明; 王德胜; 邓涛; 李岩松; 孙涛; 王存超; 梅文哲; 赵新冬; 郭韬; 何泽家; 唐锦; 崔林; 张力; 戴威; 罗珊珊; 刘媛; 卢茜; 于聪聪
Original assignee: State Grid Jiangsu Electric Power Co Ltd; Electric Power Research Institute of State Grid Jiangsu Electric Power Co Ltd
Current assignee: State Grid Jiangsu Electric Power Co Ltd; Electric Power Research Institute of State Grid Jiangsu Electric Power Co Ltd
Priority date: 2021-11-18
Filing date: 2021-11-18
Publication date: 2022-02-11
Anticipated expiration: 2041-11-18
Also published as: CN114036907B

Abstract

The application discloses a text data amplification method based on domain features, which comprises the following steps: acquiring a professional field data set, wherein the professional field data set comprises a plurality of texts; preprocessing each text to obtain a text to be amplified; the preprocessing comprises the steps of unifying text formats, segmenting text words, removing stop words and carrying out text word frequency statistics; aiming at the text to be amplified, acquiring the amplified text according to four amplification methods; obtaining an amplified professional domain data set, the amplified professional domain data set comprising a plurality of amplified texts. The application discloses four methods for obtaining amplified texts, which can reflect the field characteristics of texts while amplifying text data, improve the amplification quality of the text data and improve the service quality of an AI system constructed based on the texts.

Description

Text data amplification method based on domain features

Technical Field

The application relates to the technical field of text data amplification, in particular to a text data amplification method based on domain features.

Background

With the rapid development of artificial intelligence technology, the service quality requirements of people on artificial intelligence are also improved, and the artificial intelligence in different fields generally utilizes large-scale high-quality text data from different professional fields to construct a model through data set training, so the quality of the text data used by the artificial intelligence directly influences the service quality of the artificial intelligence.

In order to improve the quality of text data, the text data needs to be amplified, currently, various amplification methods are proposed in the field of text data amplification at home and abroad, such as retranslation, simple data amplification technology (EDA), random noise injection, GAN network-based amplification, unsupervised data amplification and the like, and the widely applied methods play an important role in reducing the data acquisition cost, inhibiting overfitting and improving the model generalization capability. However, most of these methods are single sentence character level processing on text, essentially deleting, replacing and exchanging the position of text words. In the task of text classification, the processing methods for the text character level are easy to influence words reflecting the characteristics of the text field and semantic structure information reflecting the characteristics of the field, so that the amplified text cannot well reflect the characteristics of the field where the amplified text is located, and the quality of the amplified text is low.

Disclosure of Invention

In order to solve the problem that the prior art cannot well embody the domain characteristics while amplifying the text data, the application discloses a text data amplification method based on the domain characteristics, which comprises the following steps:

acquiring a professional field data set, wherein the professional field data set comprises a plurality of texts;

preprocessing each text to obtain a text to be amplified; the preprocessing comprises the steps of unifying text formats, segmenting text words, removing stop words and carrying out text word frequency statistics;

aiming at a text to be amplified, acquiring the amplified text;

obtaining an amplified professional domain data set, the amplified professional domain data set comprising a plurality of amplified texts.

Optionally, the obtaining the amplified text for the text to be amplified includes:

acquiring a word set of the text to be amplified; the set of words comprises a plurality of words;

obtaining a dependency syntax tree of the text to be amplified; the dependency syntax tree comprises a father node and a son node, wherein the father node comprises the son node; each father node and all the child nodes contained in the father node form a branch, each father node and each child node respectively represent a word, and the relationship between the father node and the child nodes represents the dependency relationship between the words;

constructing a word frequency and reverse file frequency model according to the professional field data set;

acquiring the word frequency and the reverse file frequency of each word in the word set according to the word frequency and reverse file frequency models;

acquiring the sum of the word frequency and the reverse file frequency of each branch in the dependency syntax tree;

randomly deleting branches of the dependency syntax tree, wherein the sum of the word frequency and the reverse file frequency is lower than a preset value;

and acquiring an amplified text, wherein the amplified text comprises words corresponding to all father nodes and child nodes in the dependency syntax tree.

Optionally, after obtaining a sum of a word frequency and a reverse file frequency of each branch in the dependency syntax tree, the method further includes:

and arranging the sum of the word frequency and the reverse file frequency of each branch in a descending order.

Optionally, the word set includes stop words, numbers and special symbols, and the word frequency and the reverse file frequency of the stop words, the numbers and the special symbols are 0.

Optionally, the obtaining the amplified text for the text to be amplified further includes:

constructing an LDA model of the professional field data set;

according to the LDA model, a theme document table of the professional field data set is obtained, wherein the theme document table comprises different themes;

obtaining a plurality of subjects with the maximum possibility of the text to be amplified;

respectively obtaining cosine similarities of the texts to be amplified and a plurality of topics with the highest possibility of the texts to be amplified;

acquiring a target text according to the theme with the highest cosine similarity;

constructing a dependency syntax tree of the target text and the text to be amplified; the dependency syntax tree comprises a father node and a son node, wherein the father node comprises the son node; each father node and all the child nodes contained in the father node form a branch, each father node and each child node respectively represent a word, and the relationship between the father node and the child nodes represents the dependency relationship between the words;

replacing branches with the same dependency relationship in the dependency syntax trees of the target text and the text to be amplified;

and acquiring an amplified text, wherein the amplified text comprises words corresponding to all father nodes and child nodes in a dependency syntax tree of the text to be amplified.

Optionally, before the constructing the LDA model of the professional area data set, the method further includes:

acquiring the confusion degree of the professional field data set;

and acquiring the optimal number of subjects of the professional field data set.

merging branches with branch lengths larger than a preset length in the dependency syntax tree according to an inclusion relationship;

matching branches with branch lengths larger than a preset length in the dependency syntax tree according to a dependency relationship to obtain a branch pair set to be selected;

randomly exchanging branches in the branch pair set to be selected;

acquiring word frequency records of the professional field data set;

acquiring a training word vector model of the professional field data set;

performing word segmentation and part-of-speech tagging on the text to be amplified, wherein the part-of-speech tagging comprises tagging of proper nouns;

acquiring a word set to be replaced, wherein the word set to be replaced comprises a plurality of words, the words belong to high-frequency words in the word frequency record, and the part of speech is a proper noun;

obtaining approximate words of the word set to be replaced in the training word vector model;

randomly selecting words in the word set to be replaced, and replacing the words according to the approximate words;

and acquiring an amplified text, wherein the amplified text comprises all words replaced by the text to be amplified.

The application discloses a text data amplification method based on domain features, which comprises the following steps: acquiring a professional field data set, wherein the professional field data set comprises a plurality of texts; preprocessing each text to obtain a text to be amplified; the preprocessing comprises the steps of unifying text formats, segmenting text words, removing stop words and carrying out text word frequency statistics; aiming at a text to be amplified, acquiring the amplified text; obtaining an amplified professional domain data set, the amplified professional domain data set comprising a plurality of amplified texts. The application discloses four methods for obtaining amplified texts, which can reflect the field characteristics of texts while amplifying text data, improve the amplification quality of the text data and improve the service quality of an AI system constructed based on the texts.

Drawings

In order to more clearly explain the technical solution of the present application, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious to those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a schematic flowchart of a text data amplification method based on domain features disclosed in an embodiment of the present application;

FIG. 2 is a schematic flow chart of a first text data amplification method disclosed in an embodiment of the present application;

FIG. 3 is a schematic flow chart of a second text data amplification method disclosed in an embodiment of the present application;

FIG. 4 is a schematic flow chart of a third text data amplification method disclosed in the embodiments of the present application;

fig. 5 is a schematic flow chart of a fourth text data amplification method disclosed in the embodiment of the present application.

Detailed Description

In order to solve the problem that the prior art cannot well embody the domain characteristics while amplifying the text data, the application discloses a text data amplification method based on the domain characteristics, which is shown in a flow chart shown in fig. 1 and comprises the following steps:

a professional field data set is obtained, the professional field data set including a plurality of texts.

And preprocessing each text to obtain the text to be amplified. The preprocessing comprises text format unification, text word segmentation, word deactivation and text word frequency statistics. The purpose of text preprocessing is to store data in a structured form before character amplification, and simultaneously save preprocessing results (text word segmentation results and word frequency statistical results) of the text, so that the waste of computing resources caused by repeated processing of the same text in an expansion process is avoided. The pre-processing results are stored in json format.

And acquiring the amplified text aiming at the text to be amplified. The method comprises four methods, wherein the first method comprises the following steps: the feature-tailoring amplification method is shown in the schematic flow chart of FIG. 2.

The feature tailoring amplification method comprises the following steps:

and performing word segmentation on the text to be amplified to obtain a word set of the text to be amplified. The set of words includes a plurality of words.

And performing dependency syntax analysis on the text to be amplified to obtain a dependency syntax tree of the text to be amplified. The dependency syntax tree includes a parent node and a child node, and the parent node includes the child node. Each father node and all the child nodes contained in the father node form a branch, each father node and each child node respectively represent a word, and the relationship between the father node and the child nodes represents the dependency relationship between the words.

And constructing a word frequency and reverse file frequency model according to the professional field data set.

And acquiring the word frequency and the reverse file frequency of each word in the word set according to the word frequency and reverse file frequency model. The word set comprises stop words, numbers and special symbols, and the word frequency and the reverse file frequency of the stop words, the numbers and the special symbols are 0.

And acquiring the sum of the word frequency and the reverse file frequency of each branch in the dependency syntax tree.

And randomly deleting branches of the dependency syntax tree, wherein the sum of the word frequency and the reverse file frequency is lower than a preset value.

The sum of the word frequency and the reverse file frequency of each branch is calculated to evaluate the importance of each branch, and branches with smaller importance are deleted, so that the purpose of amplification is achieved.

The second method is a feature fusion amplification method, which is a method for selecting a target text with high feature similarity with a text to be amplified from a data set and extracting features in the text to replace each other, so as to realize amplification, referring to a flow diagram shown in fig. 3. The key for carrying out the feature fusion is to carry out screening recommendation and text feature extraction according to the text similarity. And when similar texts are screened according to the text similarity, an LDA topic model technology is used. The LDA topic model is a hidden Dirichlet distribution model, clusters texts in an unsupervised learning mode, and is a Bayesian probability model containing three-layer structures of words, documents and topics. The model can predict the theme of each text in the data set and also can give the characteristic words contained in each theme. The text screening recommendation by using the LDA topic model belongs to a recommendation method based on content, topics can be mined and extracted from data sets, and then texts with higher similarity to texts to be amplified are selected from the topics to which the texts to be amplified belong, so that screening recommendation with higher quality is realized. The text feature extraction analyzes the dependency relationship in the text by using a dependency syntax tree, so as to obtain the basic features of the text.

The feature fusion amplification method comprises the following steps:

and acquiring the confusion degree of the professional field data set.

And constructing an LDA model of the professional field data set.

And obtaining a theme document table of the professional field data set according to the LDA model, wherein the theme document table comprises different themes.

And acquiring a plurality of topics with the highest possibility of the texts to be amplified.

And respectively obtaining cosine similarities of the texts to be amplified and a plurality of topics with the highest possibility of the texts to be amplified.

And acquiring a target text according to the theme with the highest cosine similarity.

And constructing a dependency syntax tree of the target text and the text to be amplified. The dependency syntax tree includes a parent node and a child node, and the parent node includes the child node. Each father node and all the child nodes contained in the father node form a branch, each father node and each child node respectively represent a word, and the relationship between the father node and the child nodes represents the dependency relationship between the words.

And replacing branches with the same dependency relationship in the dependency syntax trees of the target text and the text to be amplified.

The third method is a feature transformation amplification method, which is shown in the schematic flow chart of fig. 4 and comprises the following steps:

and acquiring the dependency syntax tree of the text to be amplified. The dependency syntax tree includes a parent node and a child node, and the parent node includes the child node. Each father node and all the child nodes contained in the father node form a branch, each father node and each child node respectively represent a word, and the relationship between the father node and the child nodes represents the dependency relationship between the words.

And merging branches with branch length larger than the preset length in the dependency syntax tree according to the inclusion relationship.

And matching the branches with the branch length larger than the preset length in the dependency syntax tree according to the dependency relationship to obtain a branch pair set to be selected.

And randomly exchanging the branches in the branch pair set to be selected.

The feature transformation amplification method is different from the feature cutting and feature fusion amplification method, does not depend on a data set where a text is located, does not perform feature mining on the scale of the data set, but performs adjustment on a word order structure under the condition that the dependency relationship of sentences is not changed in the scale of the text, and keeps the basic features and semantic information of the text.

The fourth method is a feature replacement, and is shown in the schematic flow chart of fig. 5, and comprises the following steps:

and acquiring word frequency records of the professional field data set.

And acquiring a training word vector model of the professional field data set.

And performing word segmentation and part-of-speech tagging on the text to be amplified, wherein the part-of-speech tagging comprises tagging of proper nouns.

Acquiring a word set to be replaced, wherein the word set to be replaced comprises a plurality of words, the words belong to high-frequency words in the word frequency record, and the part of speech is a proper noun.

And obtaining approximate words of the word set to be replaced in the training word vector model.

And randomly selecting the words in the word set to be replaced, and replacing the words according to the similar words.

Obtaining an amplified professional domain data set, the amplified professional domain data set comprising a plurality of amplified texts. The feature substitution amplification method depends on the data set where the text is located, and word frequency and training word vectors need to be calculated by using the data set. Take the judicial official document data set as an example. In the text preprocessing stage, the word frequency statistical result of the data set is obtained, and it can be seen from the word cloud picture that words with higher word frequency can well reflect the field characteristics of the text, and relatively speaking, words with lower word frequency are lower in importance and cannot well reflect the field characteristics of the text.

The present application has been described in detail with reference to specific embodiments and illustrative examples, but the description is not intended to limit the application. Those skilled in the art will appreciate that various equivalent substitutions, modifications or improvements may be made to the presently disclosed embodiments and implementations thereof without departing from the spirit and scope of the present disclosure, and these fall within the scope of the present disclosure. The protection scope of this application is subject to the appended claims.

Claims

1. A text data amplification method based on domain features is characterized by comprising the following steps:

aiming at a text to be amplified, acquiring the amplified text;

2. The method for amplifying text data based on domain features of claim 1, wherein the obtaining the amplified text for the text to be amplified comprises:

3. The method for expanding text data based on domain features as claimed in claim 2, wherein after obtaining the sum of word frequency and inverse file frequency of each branch in the dependency syntax tree, the method further comprises:

4. The method of claim 2, wherein the word set comprises stop words, digits and special symbols, and the word frequency and the inverse file frequency of the stop words, the digits and the special symbols are 0.

5. The method for amplifying text data according to claim 1, wherein the obtaining of the amplified text for the text to be amplified further comprises:

constructing an LDA model of the professional field data set;

6. The method of claim 5, wherein before constructing the LDA model of the professional domain data set, the method further comprises:

acquiring the confusion degree of the professional field data set;

7. The method for amplifying text data according to claim 1, wherein the obtaining of the amplified text for the text to be amplified further comprises:

randomly exchanging branches in the branch pair set to be selected;

8. The method for amplifying text data according to claim 1, wherein the obtaining of the amplified text for the text to be amplified further comprises:

acquiring word frequency records of the professional field data set;

acquiring a training word vector model of the professional field data set;