CN114036907A - Text data amplification method based on domain features - Google Patents

Text data amplification method based on domain features Download PDF

Info

Publication number
CN114036907A
CN114036907A CN202111371729.6A CN202111371729A CN114036907A CN 114036907 A CN114036907 A CN 114036907A CN 202111371729 A CN202111371729 A CN 202111371729A CN 114036907 A CN114036907 A CN 114036907A
Authority
CN
China
Prior art keywords
text
amplified
word
words
acquiring
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111371729.6A
Other languages
Chinese (zh)
Other versions
CN114036907B (en
Inventor
祝和明
王德胜
邓涛
李岩松
孙涛
王存超
梅文哲
赵新冬
郭韬
何泽家
唐锦
崔林
张力
戴威
罗珊珊
刘媛
卢茜
于聪聪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
State Grid Jiangsu Electric Power Co Ltd
Electric Power Research Institute of State Grid Jiangsu Electric Power Co Ltd
Original Assignee
State Grid Jiangsu Electric Power Co Ltd
Electric Power Research Institute of State Grid Jiangsu Electric Power Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by State Grid Jiangsu Electric Power Co Ltd, Electric Power Research Institute of State Grid Jiangsu Electric Power Co Ltd filed Critical State Grid Jiangsu Electric Power Co Ltd
Priority to CN202111371729.6A priority Critical patent/CN114036907B/en
Publication of CN114036907A publication Critical patent/CN114036907A/en
Application granted granted Critical
Publication of CN114036907B publication Critical patent/CN114036907B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/247Thesauruses; Synonyms
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

The application discloses a text data amplification method based on domain features, which comprises the following steps: acquiring a professional field data set, wherein the professional field data set comprises a plurality of texts; preprocessing each text to obtain a text to be amplified; the preprocessing comprises the steps of unifying text formats, segmenting text words, removing stop words and carrying out text word frequency statistics; aiming at the text to be amplified, acquiring the amplified text according to four amplification methods; obtaining an amplified professional domain data set, the amplified professional domain data set comprising a plurality of amplified texts. The application discloses four methods for obtaining amplified texts, which can reflect the field characteristics of texts while amplifying text data, improve the amplification quality of the text data and improve the service quality of an AI system constructed based on the texts.

Description

Text data amplification method based on domain features
Technical Field
The application relates to the technical field of text data amplification, in particular to a text data amplification method based on domain features.
Background
With the rapid development of artificial intelligence technology, the service quality requirements of people on artificial intelligence are also improved, and the artificial intelligence in different fields generally utilizes large-scale high-quality text data from different professional fields to construct a model through data set training, so the quality of the text data used by the artificial intelligence directly influences the service quality of the artificial intelligence.
In order to improve the quality of text data, the text data needs to be amplified, currently, various amplification methods are proposed in the field of text data amplification at home and abroad, such as retranslation, simple data amplification technology (EDA), random noise injection, GAN network-based amplification, unsupervised data amplification and the like, and the widely applied methods play an important role in reducing the data acquisition cost, inhibiting overfitting and improving the model generalization capability. However, most of these methods are single sentence character level processing on text, essentially deleting, replacing and exchanging the position of text words. In the task of text classification, the processing methods for the text character level are easy to influence words reflecting the characteristics of the text field and semantic structure information reflecting the characteristics of the field, so that the amplified text cannot well reflect the characteristics of the field where the amplified text is located, and the quality of the amplified text is low.
Disclosure of Invention
In order to solve the problem that the prior art cannot well embody the domain characteristics while amplifying the text data, the application discloses a text data amplification method based on the domain characteristics, which comprises the following steps:
acquiring a professional field data set, wherein the professional field data set comprises a plurality of texts;
preprocessing each text to obtain a text to be amplified; the preprocessing comprises the steps of unifying text formats, segmenting text words, removing stop words and carrying out text word frequency statistics;
aiming at a text to be amplified, acquiring the amplified text;
obtaining an amplified professional domain data set, the amplified professional domain data set comprising a plurality of amplified texts.
Optionally, the obtaining the amplified text for the text to be amplified includes:
acquiring a word set of the text to be amplified; the set of words comprises a plurality of words;
obtaining a dependency syntax tree of the text to be amplified; the dependency syntax tree comprises a father node and a son node, wherein the father node comprises the son node; each father node and all the child nodes contained in the father node form a branch, each father node and each child node respectively represent a word, and the relationship between the father node and the child nodes represents the dependency relationship between the words;
constructing a word frequency and reverse file frequency model according to the professional field data set;
acquiring the word frequency and the reverse file frequency of each word in the word set according to the word frequency and reverse file frequency models;
acquiring the sum of the word frequency and the reverse file frequency of each branch in the dependency syntax tree;
randomly deleting branches of the dependency syntax tree, wherein the sum of the word frequency and the reverse file frequency is lower than a preset value;
and acquiring an amplified text, wherein the amplified text comprises words corresponding to all father nodes and child nodes in the dependency syntax tree.
Optionally, after obtaining a sum of a word frequency and a reverse file frequency of each branch in the dependency syntax tree, the method further includes:
and arranging the sum of the word frequency and the reverse file frequency of each branch in a descending order.
Optionally, the word set includes stop words, numbers and special symbols, and the word frequency and the reverse file frequency of the stop words, the numbers and the special symbols are 0.
Optionally, the obtaining the amplified text for the text to be amplified further includes:
constructing an LDA model of the professional field data set;
according to the LDA model, a theme document table of the professional field data set is obtained, wherein the theme document table comprises different themes;
obtaining a plurality of subjects with the maximum possibility of the text to be amplified;
respectively obtaining cosine similarities of the texts to be amplified and a plurality of topics with the highest possibility of the texts to be amplified;
acquiring a target text according to the theme with the highest cosine similarity;
constructing a dependency syntax tree of the target text and the text to be amplified; the dependency syntax tree comprises a father node and a son node, wherein the father node comprises the son node; each father node and all the child nodes contained in the father node form a branch, each father node and each child node respectively represent a word, and the relationship between the father node and the child nodes represents the dependency relationship between the words;
replacing branches with the same dependency relationship in the dependency syntax trees of the target text and the text to be amplified;
and acquiring an amplified text, wherein the amplified text comprises words corresponding to all father nodes and child nodes in a dependency syntax tree of the text to be amplified.
Optionally, before the constructing the LDA model of the professional area data set, the method further includes:
acquiring the confusion degree of the professional field data set;
and acquiring the optimal number of subjects of the professional field data set.
Optionally, the obtaining the amplified text for the text to be amplified further includes:
obtaining a dependency syntax tree of the text to be amplified; the dependency syntax tree comprises a father node and a son node, wherein the father node comprises the son node; each father node and all the child nodes contained in the father node form a branch, each father node and each child node respectively represent a word, and the relationship between the father node and the child nodes represents the dependency relationship between the words;
merging branches with branch lengths larger than a preset length in the dependency syntax tree according to an inclusion relationship;
matching branches with branch lengths larger than a preset length in the dependency syntax tree according to a dependency relationship to obtain a branch pair set to be selected;
randomly exchanging branches in the branch pair set to be selected;
and acquiring an amplified text, wherein the amplified text comprises words corresponding to all father nodes and child nodes in a dependency syntax tree of the text to be amplified.
Optionally, the obtaining the amplified text for the text to be amplified further includes:
acquiring word frequency records of the professional field data set;
acquiring a training word vector model of the professional field data set;
performing word segmentation and part-of-speech tagging on the text to be amplified, wherein the part-of-speech tagging comprises tagging of proper nouns;
acquiring a word set to be replaced, wherein the word set to be replaced comprises a plurality of words, the words belong to high-frequency words in the word frequency record, and the part of speech is a proper noun;
obtaining approximate words of the word set to be replaced in the training word vector model;
randomly selecting words in the word set to be replaced, and replacing the words according to the approximate words;
and acquiring an amplified text, wherein the amplified text comprises all words replaced by the text to be amplified.
The application discloses a text data amplification method based on domain features, which comprises the following steps: acquiring a professional field data set, wherein the professional field data set comprises a plurality of texts; preprocessing each text to obtain a text to be amplified; the preprocessing comprises the steps of unifying text formats, segmenting text words, removing stop words and carrying out text word frequency statistics; aiming at a text to be amplified, acquiring the amplified text; obtaining an amplified professional domain data set, the amplified professional domain data set comprising a plurality of amplified texts. The application discloses four methods for obtaining amplified texts, which can reflect the field characteristics of texts while amplifying text data, improve the amplification quality of the text data and improve the service quality of an AI system constructed based on the texts.
Drawings
In order to more clearly explain the technical solution of the present application, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious to those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.
Fig. 1 is a schematic flowchart of a text data amplification method based on domain features disclosed in an embodiment of the present application;
FIG. 2 is a schematic flow chart of a first text data amplification method disclosed in an embodiment of the present application;
FIG. 3 is a schematic flow chart of a second text data amplification method disclosed in an embodiment of the present application;
FIG. 4 is a schematic flow chart of a third text data amplification method disclosed in the embodiments of the present application;
fig. 5 is a schematic flow chart of a fourth text data amplification method disclosed in the embodiment of the present application.
Detailed Description
In order to solve the problem that the prior art cannot well embody the domain characteristics while amplifying the text data, the application discloses a text data amplification method based on the domain characteristics, which is shown in a flow chart shown in fig. 1 and comprises the following steps:
a professional field data set is obtained, the professional field data set including a plurality of texts.
And preprocessing each text to obtain the text to be amplified. The preprocessing comprises text format unification, text word segmentation, word deactivation and text word frequency statistics. The purpose of text preprocessing is to store data in a structured form before character amplification, and simultaneously save preprocessing results (text word segmentation results and word frequency statistical results) of the text, so that the waste of computing resources caused by repeated processing of the same text in an expansion process is avoided. The pre-processing results are stored in json format.
And acquiring the amplified text aiming at the text to be amplified. The method comprises four methods, wherein the first method comprises the following steps: the feature-tailoring amplification method is shown in the schematic flow chart of FIG. 2.
The feature tailoring amplification method comprises the following steps:
and performing word segmentation on the text to be amplified to obtain a word set of the text to be amplified. The set of words includes a plurality of words.
And performing dependency syntax analysis on the text to be amplified to obtain a dependency syntax tree of the text to be amplified. The dependency syntax tree includes a parent node and a child node, and the parent node includes the child node. Each father node and all the child nodes contained in the father node form a branch, each father node and each child node respectively represent a word, and the relationship between the father node and the child nodes represents the dependency relationship between the words.
And constructing a word frequency and reverse file frequency model according to the professional field data set.
And acquiring the word frequency and the reverse file frequency of each word in the word set according to the word frequency and reverse file frequency model. The word set comprises stop words, numbers and special symbols, and the word frequency and the reverse file frequency of the stop words, the numbers and the special symbols are 0.
And acquiring the sum of the word frequency and the reverse file frequency of each branch in the dependency syntax tree.
And arranging the sum of the word frequency and the reverse file frequency of each branch in a descending order.
And randomly deleting branches of the dependency syntax tree, wherein the sum of the word frequency and the reverse file frequency is lower than a preset value.
And acquiring an amplified text, wherein the amplified text comprises words corresponding to all father nodes and child nodes in the dependency syntax tree.
The sum of the word frequency and the reverse file frequency of each branch is calculated to evaluate the importance of each branch, and branches with smaller importance are deleted, so that the purpose of amplification is achieved.
The second method is a feature fusion amplification method, which is a method for selecting a target text with high feature similarity with a text to be amplified from a data set and extracting features in the text to replace each other, so as to realize amplification, referring to a flow diagram shown in fig. 3. The key for carrying out the feature fusion is to carry out screening recommendation and text feature extraction according to the text similarity. And when similar texts are screened according to the text similarity, an LDA topic model technology is used. The LDA topic model is a hidden Dirichlet distribution model, clusters texts in an unsupervised learning mode, and is a Bayesian probability model containing three-layer structures of words, documents and topics. The model can predict the theme of each text in the data set and also can give the characteristic words contained in each theme. The text screening recommendation by using the LDA topic model belongs to a recommendation method based on content, topics can be mined and extracted from data sets, and then texts with higher similarity to texts to be amplified are selected from the topics to which the texts to be amplified belong, so that screening recommendation with higher quality is realized. The text feature extraction analyzes the dependency relationship in the text by using a dependency syntax tree, so as to obtain the basic features of the text.
The feature fusion amplification method comprises the following steps:
and acquiring the confusion degree of the professional field data set.
And acquiring the optimal number of subjects of the professional field data set.
And constructing an LDA model of the professional field data set.
And obtaining a theme document table of the professional field data set according to the LDA model, wherein the theme document table comprises different themes.
And acquiring a plurality of topics with the highest possibility of the texts to be amplified.
And respectively obtaining cosine similarities of the texts to be amplified and a plurality of topics with the highest possibility of the texts to be amplified.
And acquiring a target text according to the theme with the highest cosine similarity.
And constructing a dependency syntax tree of the target text and the text to be amplified. The dependency syntax tree includes a parent node and a child node, and the parent node includes the child node. Each father node and all the child nodes contained in the father node form a branch, each father node and each child node respectively represent a word, and the relationship between the father node and the child nodes represents the dependency relationship between the words.
And replacing branches with the same dependency relationship in the dependency syntax trees of the target text and the text to be amplified.
And acquiring an amplified text, wherein the amplified text comprises words corresponding to all father nodes and child nodes in a dependency syntax tree of the text to be amplified.
The third method is a feature transformation amplification method, which is shown in the schematic flow chart of fig. 4 and comprises the following steps:
and acquiring the dependency syntax tree of the text to be amplified. The dependency syntax tree includes a parent node and a child node, and the parent node includes the child node. Each father node and all the child nodes contained in the father node form a branch, each father node and each child node respectively represent a word, and the relationship between the father node and the child nodes represents the dependency relationship between the words.
And merging branches with branch length larger than the preset length in the dependency syntax tree according to the inclusion relationship.
And matching the branches with the branch length larger than the preset length in the dependency syntax tree according to the dependency relationship to obtain a branch pair set to be selected.
And randomly exchanging the branches in the branch pair set to be selected.
And acquiring an amplified text, wherein the amplified text comprises words corresponding to all father nodes and child nodes in a dependency syntax tree of the text to be amplified.
The feature transformation amplification method is different from the feature cutting and feature fusion amplification method, does not depend on a data set where a text is located, does not perform feature mining on the scale of the data set, but performs adjustment on a word order structure under the condition that the dependency relationship of sentences is not changed in the scale of the text, and keeps the basic features and semantic information of the text.
The fourth method is a feature replacement, and is shown in the schematic flow chart of fig. 5, and comprises the following steps:
and acquiring word frequency records of the professional field data set.
And acquiring a training word vector model of the professional field data set.
And performing word segmentation and part-of-speech tagging on the text to be amplified, wherein the part-of-speech tagging comprises tagging of proper nouns.
Acquiring a word set to be replaced, wherein the word set to be replaced comprises a plurality of words, the words belong to high-frequency words in the word frequency record, and the part of speech is a proper noun.
And obtaining approximate words of the word set to be replaced in the training word vector model.
And randomly selecting the words in the word set to be replaced, and replacing the words according to the similar words.
And acquiring an amplified text, wherein the amplified text comprises all words replaced by the text to be amplified.
Obtaining an amplified professional domain data set, the amplified professional domain data set comprising a plurality of amplified texts. The feature substitution amplification method depends on the data set where the text is located, and word frequency and training word vectors need to be calculated by using the data set. Take the judicial official document data set as an example. In the text preprocessing stage, the word frequency statistical result of the data set is obtained, and it can be seen from the word cloud picture that words with higher word frequency can well reflect the field characteristics of the text, and relatively speaking, words with lower word frequency are lower in importance and cannot well reflect the field characteristics of the text.
The application discloses a text data amplification method based on domain features, which comprises the following steps: acquiring a professional field data set, wherein the professional field data set comprises a plurality of texts; preprocessing each text to obtain a text to be amplified; the preprocessing comprises the steps of unifying text formats, segmenting text words, removing stop words and carrying out text word frequency statistics; aiming at a text to be amplified, acquiring the amplified text; obtaining an amplified professional domain data set, the amplified professional domain data set comprising a plurality of amplified texts. The application discloses four methods for obtaining amplified texts, which can reflect the field characteristics of texts while amplifying text data, improve the amplification quality of the text data and improve the service quality of an AI system constructed based on the texts.
The present application has been described in detail with reference to specific embodiments and illustrative examples, but the description is not intended to limit the application. Those skilled in the art will appreciate that various equivalent substitutions, modifications or improvements may be made to the presently disclosed embodiments and implementations thereof without departing from the spirit and scope of the present disclosure, and these fall within the scope of the present disclosure. The protection scope of this application is subject to the appended claims.

Claims (8)

1. A text data amplification method based on domain features is characterized by comprising the following steps:
acquiring a professional field data set, wherein the professional field data set comprises a plurality of texts;
preprocessing each text to obtain a text to be amplified; the preprocessing comprises the steps of unifying text formats, segmenting text words, removing stop words and carrying out text word frequency statistics;
aiming at a text to be amplified, acquiring the amplified text;
obtaining an amplified professional domain data set, the amplified professional domain data set comprising a plurality of amplified texts.
2. The method for amplifying text data based on domain features of claim 1, wherein the obtaining the amplified text for the text to be amplified comprises:
acquiring a word set of the text to be amplified; the set of words comprises a plurality of words;
obtaining a dependency syntax tree of the text to be amplified; the dependency syntax tree comprises a father node and a son node, wherein the father node comprises the son node; each father node and all the child nodes contained in the father node form a branch, each father node and each child node respectively represent a word, and the relationship between the father node and the child nodes represents the dependency relationship between the words;
constructing a word frequency and reverse file frequency model according to the professional field data set;
acquiring the word frequency and the reverse file frequency of each word in the word set according to the word frequency and reverse file frequency models;
acquiring the sum of the word frequency and the reverse file frequency of each branch in the dependency syntax tree;
randomly deleting branches of the dependency syntax tree, wherein the sum of the word frequency and the reverse file frequency is lower than a preset value;
and acquiring an amplified text, wherein the amplified text comprises words corresponding to all father nodes and child nodes in the dependency syntax tree.
3. The method for expanding text data based on domain features as claimed in claim 2, wherein after obtaining the sum of word frequency and inverse file frequency of each branch in the dependency syntax tree, the method further comprises:
and arranging the sum of the word frequency and the reverse file frequency of each branch in a descending order.
4. The method of claim 2, wherein the word set comprises stop words, digits and special symbols, and the word frequency and the inverse file frequency of the stop words, the digits and the special symbols are 0.
5. The method for amplifying text data according to claim 1, wherein the obtaining of the amplified text for the text to be amplified further comprises:
constructing an LDA model of the professional field data set;
according to the LDA model, a theme document table of the professional field data set is obtained, wherein the theme document table comprises different themes;
obtaining a plurality of subjects with the maximum possibility of the text to be amplified;
respectively obtaining cosine similarities of the texts to be amplified and a plurality of topics with the highest possibility of the texts to be amplified;
acquiring a target text according to the theme with the highest cosine similarity;
constructing a dependency syntax tree of the target text and the text to be amplified; the dependency syntax tree comprises a father node and a son node, wherein the father node comprises the son node; each father node and all the child nodes contained in the father node form a branch, each father node and each child node respectively represent a word, and the relationship between the father node and the child nodes represents the dependency relationship between the words;
replacing branches with the same dependency relationship in the dependency syntax trees of the target text and the text to be amplified;
and acquiring an amplified text, wherein the amplified text comprises words corresponding to all father nodes and child nodes in a dependency syntax tree of the text to be amplified.
6. The method of claim 5, wherein before constructing the LDA model of the professional domain data set, the method further comprises:
acquiring the confusion degree of the professional field data set;
and acquiring the optimal number of subjects of the professional field data set.
7. The method for amplifying text data according to claim 1, wherein the obtaining of the amplified text for the text to be amplified further comprises:
obtaining a dependency syntax tree of the text to be amplified; the dependency syntax tree comprises a father node and a son node, wherein the father node comprises the son node; each father node and all the child nodes contained in the father node form a branch, each father node and each child node respectively represent a word, and the relationship between the father node and the child nodes represents the dependency relationship between the words;
merging branches with branch lengths larger than a preset length in the dependency syntax tree according to an inclusion relationship;
matching branches with branch lengths larger than a preset length in the dependency syntax tree according to a dependency relationship to obtain a branch pair set to be selected;
randomly exchanging branches in the branch pair set to be selected;
and acquiring an amplified text, wherein the amplified text comprises words corresponding to all father nodes and child nodes in a dependency syntax tree of the text to be amplified.
8. The method for amplifying text data according to claim 1, wherein the obtaining of the amplified text for the text to be amplified further comprises:
acquiring word frequency records of the professional field data set;
acquiring a training word vector model of the professional field data set;
performing word segmentation and part-of-speech tagging on the text to be amplified, wherein the part-of-speech tagging comprises tagging of proper nouns;
acquiring a word set to be replaced, wherein the word set to be replaced comprises a plurality of words, the words belong to high-frequency words in the word frequency record, and the part of speech is a proper noun;
obtaining approximate words of the word set to be replaced in the training word vector model;
randomly selecting words in the word set to be replaced, and replacing the words according to the approximate words;
and acquiring an amplified text, wherein the amplified text comprises all words replaced by the text to be amplified.
CN202111371729.6A 2021-11-18 2021-11-18 Text data amplification method based on field characteristics Active CN114036907B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111371729.6A CN114036907B (en) 2021-11-18 2021-11-18 Text data amplification method based on field characteristics

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111371729.6A CN114036907B (en) 2021-11-18 2021-11-18 Text data amplification method based on field characteristics

Publications (2)

Publication Number Publication Date
CN114036907A true CN114036907A (en) 2022-02-11
CN114036907B CN114036907B (en) 2024-06-25

Family

ID=80138117

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111371729.6A Active CN114036907B (en) 2021-11-18 2021-11-18 Text data amplification method based on field characteristics

Country Status (1)

Country Link
CN (1) CN114036907B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114637824A (en) * 2022-03-18 2022-06-17 马上消费金融股份有限公司 Data enhancement processing method and device
CN114724162A (en) * 2022-03-15 2022-07-08 平安科技(深圳)有限公司 Training method and device of text recognition model, computer equipment and storage medium

Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107704892A (en) * 2017-11-07 2018-02-16 宁波爱信诺航天信息有限公司 A kind of commodity code sorting technique and system based on Bayesian model
CN107797991A (en) * 2017-10-23 2018-03-13 南京云问网络技术有限公司 A kind of knowledge mapping extending method and system based on interdependent syntax tree
CN109298796A (en) * 2018-07-24 2019-02-01 北京捷通华声科技股份有限公司 A kind of Word association method and device
CN110852095A (en) * 2018-08-02 2020-02-28 中国银联股份有限公司 Statement hot spot extraction method and system
CN111783902A (en) * 2020-07-30 2020-10-16 腾讯科技(深圳)有限公司 Data augmentation and service processing method and device, computer equipment and storage medium
WO2020220539A1 (en) * 2019-04-28 2020-11-05 平安科技(深圳)有限公司 Data increment method and device, computer device and storage medium
CN111930792A (en) * 2020-06-23 2020-11-13 北京大米科技有限公司 Data resource labeling method and device, storage medium and electronic equipment
CN111950729A (en) * 2020-07-19 2020-11-17 中国建设银行股份有限公司 Knowledge base construction method and device, electronic equipment and readable storage device
CN112861739A (en) * 2021-02-10 2021-05-28 中国科学技术大学 End-to-end text recognition method, model training method and device
CN112883193A (en) * 2021-02-25 2021-06-01 中国平安人寿保险股份有限公司 Training method, device and equipment of text classification model and readable medium
CN112906392A (en) * 2021-03-23 2021-06-04 北京天融信网络安全技术有限公司 Text enhancement method, text classification method and related device
CN112989797A (en) * 2021-03-10 2021-06-18 北京百度网讯科技有限公司 Model training method, text extension method, model training device, text extension device, model training equipment and storage medium
CN113407842A (en) * 2021-06-28 2021-09-17 携程旅游信息技术(上海)有限公司 Model training method, method and system for obtaining theme recommendation reason and electronic equipment

Patent Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107797991A (en) * 2017-10-23 2018-03-13 南京云问网络技术有限公司 A kind of knowledge mapping extending method and system based on interdependent syntax tree
CN107704892A (en) * 2017-11-07 2018-02-16 宁波爱信诺航天信息有限公司 A kind of commodity code sorting technique and system based on Bayesian model
CN109298796A (en) * 2018-07-24 2019-02-01 北京捷通华声科技股份有限公司 A kind of Word association method and device
CN110852095A (en) * 2018-08-02 2020-02-28 中国银联股份有限公司 Statement hot spot extraction method and system
WO2020220539A1 (en) * 2019-04-28 2020-11-05 平安科技(深圳)有限公司 Data increment method and device, computer device and storage medium
CN111930792A (en) * 2020-06-23 2020-11-13 北京大米科技有限公司 Data resource labeling method and device, storage medium and electronic equipment
CN111950729A (en) * 2020-07-19 2020-11-17 中国建设银行股份有限公司 Knowledge base construction method and device, electronic equipment and readable storage device
CN111783902A (en) * 2020-07-30 2020-10-16 腾讯科技(深圳)有限公司 Data augmentation and service processing method and device, computer equipment and storage medium
CN112861739A (en) * 2021-02-10 2021-05-28 中国科学技术大学 End-to-end text recognition method, model training method and device
CN112883193A (en) * 2021-02-25 2021-06-01 中国平安人寿保险股份有限公司 Training method, device and equipment of text classification model and readable medium
CN112989797A (en) * 2021-03-10 2021-06-18 北京百度网讯科技有限公司 Model training method, text extension method, model training device, text extension device, model training equipment and storage medium
CN112906392A (en) * 2021-03-23 2021-06-04 北京天融信网络安全技术有限公司 Text enhancement method, text classification method and related device
CN113407842A (en) * 2021-06-28 2021-09-17 携程旅游信息技术(上海)有限公司 Model training method, method and system for obtaining theme recommendation reason and electronic equipment

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
机器之心PRO: "万字长文综述:给你的数据加上杠杆—文本增强技术研究进展及应用", Retrieved from the Internet <URL:https://baijiahao.baidu.com/s?id=1662747620821899959&wfr=spider&for=pc> *
雷朔 等: "基于词向量特征扩展的中文短文本分类研究", 计算机应用与软件, no. 08, 12 August 2018 (2018-08-12) *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114724162A (en) * 2022-03-15 2022-07-08 平安科技(深圳)有限公司 Training method and device of text recognition model, computer equipment and storage medium
CN114637824A (en) * 2022-03-18 2022-06-17 马上消费金融股份有限公司 Data enhancement processing method and device
CN114637824B (en) * 2022-03-18 2023-12-01 马上消费金融股份有限公司 Data enhancement processing method and device

Also Published As

Publication number Publication date
CN114036907B (en) 2024-06-25

Similar Documents

Publication Publication Date Title
CN111177365B (en) Unsupervised automatic abstract extraction method based on graph model
CN100595760C (en) Method for gaining oral vocabulary entry, device and input method system thereof
CN110750635B (en) French recommendation method based on joint deep learning model
CN111414479A (en) Label extraction method based on short text clustering technology
CN108920482B (en) Microblog short text classification method based on lexical chain feature extension and LDA (latent Dirichlet Allocation) model
CN108052630B (en) Method for extracting expansion words based on Chinese education videos
CN114036907B (en) Text data amplification method based on field characteristics
CN109902289A (en) A kind of news video topic division method towards fuzzy text mining
CN111460162B (en) Text classification method and device, terminal equipment and computer readable storage medium
CN108399157B (en) Dynamic extraction method of entity and attribute relationship, server and readable storage medium
CN109993216B (en) Text classification method and device based on K nearest neighbor KNN
CN116628173B (en) Intelligent customer service information generation system and method based on keyword extraction
CN101556596A (en) Input method system and intelligent word making method
CN110232127A (en) File classification method and device
CN111353045A (en) Method for constructing text classification system
CN113065349A (en) Named entity recognition method based on conditional random field
CN110019820B (en) Method for detecting time consistency of complaints and symptoms of current medical history in medical records
CN110705285B (en) Government affair text subject word library construction method, device, server and readable storage medium
CN114792092B (en) Text theme extraction method and device based on semantic enhancement
CN110633468A (en) Information processing method and device for object feature extraction
CN115617981A (en) Information level abstract extraction method for short text of social network
CN115455975A (en) Method and device for extracting topic keywords based on multi-model fusion decision
CN113934910A (en) Automatic optimization and updating theme library construction method and hot event real-time updating method
KR101240330B1 (en) System and method for mutidimensional document classification
CN113076468A (en) Nested event extraction method based on domain pre-training

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant