CN114595327A

CN114595327A - Data enhancement method and device, electronic equipment and storage medium

Info

Publication number: CN114595327A
Application number: CN202210163920.XA
Authority: CN
Inventors: 陶清; 王彦; 马骏; 王少军
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2022-02-22
Filing date: 2022-02-22
Publication date: 2022-06-07
Also published as: WO2023159758A1

Abstract

The embodiment of the invention provides a data enhancement method and device, electronic equipment and a storage medium, and relates to the technical field of artificial intelligence. The data enhancement method comprises the following steps: the method comprises the steps of obtaining an original text sample, inputting the original text sample into a pre-trained topic model, calculating the contribution value of each topic word in each sentence to the text sentence, obtaining a set of words to be replaced according to the contribution value of the topic word to the text sentence, selecting candidate words from a word vector set obtained through pre-training, and finally replacing the words to be replaced with the candidate words to obtain a data enhancement text sample. The topic distribution probability information corresponding to each sentence in the original text sample is obtained by utilizing the topic model, so that the contribution value of each word in the sentence to the text sentence topic is well measured, the data enhancement is completed under the condition that the sentence topic distribution is not influenced, meanwhile, the word with the similar meaning to the word to be replaced is selected as the replacement word by virtue of the pre-training word vector, and the semantic information of the sentence is ensured to the maximum extent.

Description

Data enhancement method and device, electronic equipment and storage medium

Technical Field

The present invention relates to the field of artificial intelligence technology, and in particular, to a data enhancement method and apparatus, an electronic device, and a storage medium.

Background

Data enhancement is a data processing method, and is generally applied to the fields of image processing, voice processing, and the like, for example, in the field of image processing, researchers generally use operations such as rotation and cropping to process picture data, so as to achieve the purposes of data enhancement and data sample enrichment. Unlike the image processing and speech processing fields, the enhanced processing of text data cannot simply use text conversion operations such as sequence swapping, discarding partial words, etc., because the word sequence in text forms strict syntax and semantics, and these simple operations cause loss of text semantic information.

In the related art, the data enhancement method used in the text classification task mainly includes: a data enhancement method based on a synonym table, a data enhancement method based on a translation back, or a data enhancement method based on a pre-training word vector. But the data enhancement method based on the synonym table cannot be effectively adapted to the texts in the specific field; the data enhancement method based on the translation back needs to be realized by an additional translation model, and a large amount of bilingual corpora are also seriously lacked in a specific field; when the pre-training word vector is used for data enhancement, the problem of how to select a proper word for replacement is also faced.

Disclosure of Invention

The embodiment of the invention mainly aims to provide a data enhancement method and device, electronic equipment and a storage medium, which can improve the accuracy of text sample data enhancement and expand the application range of the text sample data enhancement.

In order to achieve the above object, a first aspect of an embodiment of the present invention provides a data enhancement method, including:

obtaining an original text sample to be enhanced; wherein the original text sample comprises at least one text sentence, the original text sample comprises at least one subject word;

inputting the original text sample into a topic model obtained by pre-training to obtain topic distribution probability information corresponding to each text sentence, wherein the topic model is an invisible Dirichlet distributed topic model;

calculating the contribution value of each topic word to the text sentence according to the topic distribution probability information;

calculating to obtain the replacement probability of the subject word according to the contribution value of the subject word, and selecting a word to be replaced from the text sentence according to the replacement probability to obtain a word set to be replaced;

screening candidate words from a word vector set obtained through pre-training according to the words to be replaced in the word set to be replaced;

and replacing the word to be replaced by the candidate word to obtain a data enhanced text sample.

In some embodiments, before inputting the original text sample to the pre-trained topic model, the method further comprises:

acquiring a training sample set of a preset field, wherein the training sample set comprises unlabeled training text samples and corresponding probability labels;

inputting the training text sample into an initial theme model, and obtaining the distribution probability of the predicted theme of the training text sample according to the preset number of themes;

calculating to obtain a loss value according to the predicted topic distribution probability and the corresponding probability label;

and adjusting the model weight of the initial topic model by using the loss function according to the loss value until the loss function meets the convergence condition, and training to obtain the topic model.

In some embodiments, the topic distribution probability information comprises: calculating the contribution value of each topic word to the text sentence according to the topic distribution probability information, wherein the topic distribution probability of the topic word and the topic distribution probability of the text sentence comprises the following steps:

calculating the topic distribution probability of the text sentence according to a first formula;

calculating the topic distribution probability of the topic words;

multiplying a preset smoothing parameter, the theme distribution probability and the theme distribution probability to obtain the contribution value;

wherein the first formula is:

wherein, ω is_iDenotes a subject term, s ═ s (ω)₁,ω₂,...,ω_i,...,ω_N) Representing a text sentence containing N subject words, p (t | ω |)_i) Denotes the topic distribution probability of the topic word, and p (t | s) denotes the topic distribution probability of the text sentence.

In some embodiments, the obtaining, by calculation, a replacement probability of the subject word according to the contribution value of the subject word, and selecting a word to be replaced from the text sentence according to the replacement probability to obtain a set of words to be replaced includes:

calculating the replacement probability of the subject word according to the contribution value of the subject word to the text sentence;

sampling according to the number of preset replacement words and the replacement probability to obtain the words to be replaced;

and forming the word set to be replaced by using the words to be replaced.

In some embodiments, the calculating a replacement probability of the subject word according to the contribution value of the subject word to the text sentence includes:

calculating the maximum contribution value of all subject words in the text sentence;

calculating the difference between the contribution value of each subject term and the maximum contribution value, and summing all the differences to obtain the sum of the contribution values;

and calculating the ratio of each difference value to the sum of the contribution values to obtain the replacement probability of the subject term.

In some embodiments, before the filtering to obtain the candidate word from the pre-trained word vector set according to the word to be replaced in the word set to be replaced, the method further includes:

acquiring a training text sample of a preset field;

training the training text sample by using a Word2vec tool to obtain a pre-training Word vector;

and forming the word vector set by using the pre-training word vectors.

In some embodiments, the screening, according to the word to be replaced in the word set to be replaced, from a word vector set obtained by pre-training to obtain a candidate word includes:

calculating the distance between the word to be replaced in the word set to be replaced and the pre-training word vector in the word vector set in the vector space;

sorting the distances to obtain a distance sorting result;

and selecting a preset number of words from the word vector set as the candidate words according to the distance sorting result, wherein the position distribution of the candidate words in the word vector set obeys geometric distribution.

In order to achieve the above object, a second aspect of the present invention provides a text sample data enhancement apparatus, including:

the sample acquisition module is used for acquiring an original text sample to be enhanced; wherein the original text sample comprises at least one text sentence, the original text sample comprises at least one subject word;

the topic distribution probability calculation module is used for inputting the original text sample into a topic model obtained by pre-training to obtain topic distribution probability information corresponding to each text sentence, and the topic model is an invisible Dirichlet allocation topic model;

the contribution value calculating module is used for calculating the contribution value of each topic word to the text sentence according to the topic distribution probability information;

the to-be-replaced word selection module is used for calculating the replacement probability of the subject word according to the contribution value of the subject word, and selecting the to-be-replaced word from the text sentence according to the replacement probability to obtain a to-be-replaced word set;

the candidate word selection module is used for screening a word vector set obtained by pre-training according to the words to be replaced in the word set to be replaced to obtain candidate words;

and the data enhancement module is used for replacing the word to be replaced by the candidate word to obtain a data enhancement text sample.

To achieve the above object, a third aspect of the present invention provides an electronic apparatus comprising:

at least one memory;

at least one processor;

at least one program;

the programs are stored in a memory and a processor executes the at least one program to implement the method of the invention as described in the above first aspect.

To achieve the above object, a fourth aspect of the present invention proposes a storage medium which is a computer-readable storage medium storing computer-executable instructions for causing a computer to execute:

a method as described in the first aspect above.

The data enhancement method and device, the electronic device and the storage medium provided by the embodiment of the invention obtain the topic distribution probability information corresponding to each sentence in the original text sample by obtaining the original text sample and inputting the original text sample into the pre-trained topic model, the topic model is the topic model distributed based on LDA invisible Dirichlet, the contribution value of each topic word in each sentence to the text sentence is calculated according to the topic distribution probability information, then calculating the replacement probability of the subject word according to the contribution value of the subject word to the text sentence, selecting the words to be replaced in the text sentence according to the replacement probability to obtain a set of words to be replaced, and then selecting words similar to the words to be replaced in the word set to be replaced from the word vector set obtained through pre-training as candidate words, and finally replacing the words to be replaced by the candidate words to obtain a data enhanced text sample. In the embodiment, the topic distribution probability information corresponding to each sentence in the original text sample is obtained by using the topic model, so that the contribution value of each word in the sentence to the text sentence topic is well measured, the data enhancement can be ensured to be completed under the condition that the sentence topic distribution is not influenced, and meanwhile, by means of the pre-training word vector, a word with a similar meaning to the word to be replaced can be selected as a replacement word, so that the semantic information of the sentence is ensured to the maximum extent.

Drawings

Fig. 1 is a flowchart of a data enhancement method according to an embodiment of the present invention.

Fig. 2 is a partial flowchart of a data enhancement method according to another embodiment of the present invention.

Fig. 3 is a partial flowchart of a data enhancement method according to another embodiment of the present invention.

Fig. 4 is a partial flowchart of a data enhancement method according to another embodiment of the present invention.

Fig. 5 is a partial flowchart of a data enhancement method according to another embodiment of the present invention.

Fig. 6 is a flowchart of a data enhancement method according to another embodiment of the present invention.

Fig. 7 is a block diagram illustrating a structure of a text sample data enhancement apparatus according to yet another embodiment of the present invention.

Fig. 8 is a schematic diagram of a hardware structure of an electronic device according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

It should be noted that although functional blocks are partitioned in a schematic diagram of an apparatus and a logical order is shown in a flowchart, in some cases, the steps shown or described may be performed in a different order than the partitioning of blocks in the apparatus or the order in the flowchart.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used herein is for the purpose of describing embodiments of the invention only and is not intended to be limiting of the invention.

First, several terms related to the present invention are analyzed:

latent Dirichlet Allocation (LDA) model: the LDA model is an unsupervised machine learning technology, belongs to a model in text semantic analysis, and is used for inferring the theme distribution of a document. The topics of each document in the document set can be given in a probability distribution mode, and therefore after topic distribution is extracted through analysis of some documents, topic clustering or text classification can be carried out according to the topic distribution. The LDA model is a bag of words model that only considers whether a vocabulary appears in a document, and not the order in which it appears in a document, e.g., "i like you" and "you like me" are equivalent in the bag of words model, assuming independence between documents, and independence between vocabularies in a document.

Natural Language Processing (NLP): the method is an important direction in the fields of computer science and artificial intelligence, various theories and methods for realizing effective communication between people and computers by natural language are researched, and natural language processing is a science integrating linguistics, computer science and mathematics into a whole. In short, the computer accepts the input of the user in the form of natural language, and internally processes a series of operations such as processing, calculation and the like through an algorithm defined by human beings so as to simulate the understanding of the natural language by the human beings and return the result expected by the user. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic question and answer, knowledge mapping, and the like.

Gensim tool: the Python-based three-party theme vector representation system is an open-source third-party Python toolkit and is used for unsupervised learning of theme vector representations of text hidden layers from original unstructured texts. The method supports various topic model algorithms including TF-IDF, LSA, LDA and word2vec, supports streaming training, and provides API (application programming interface) interfaces of some common tasks such as similarity calculation, information retrieval and the like.

Word2 vec: word2Vec is one of Word Embedding (Word Embedding) modes, belongs to the field of NLP, is one of language models, is a tool for generating Word vectors, is a model for learning semantic knowledge in an unsupervised manner from a large amount of text predictions, and is widely applied to natural language processing. Word Embedding is the transformation of an uncomputable unstructured Word into a calculable structured vector.

Gibbs sampling: the method is a special markov chain algorithm, is often used for solving a series of problems including matrix decomposition, tensor decomposition and the like, and is also called alternating conditional sampling (alternating conditional sampling), wherein the term "alternating" means that Gibbs sampling is an iterative algorithm, and corresponding variables are alternately used in the process of iteration, and in addition, the term "condition" is added because the core of Gibbs sampling is bayes theory, and observation values are used as conditions around prior knowledge and observation data to infer posterior distribution.

Data enhancement is a commonly used data processing method, and is widely used in the field of image and voice processing, for example, in the field of images, researchers often process image data by simple operations such as rotation and cropping, so as to achieve the purpose of data enhancement and enrichment of data samples, and the data enhancement method is proved to be capable of effectively improving the generalization capability of a model on test data. Unlike the image and speech fields, when text data is enhanced, some simple text conversion, such as sequence exchange, partial word discarding, etc., cannot be used, because the word sequence in the text forms strict syntax and semantics, and these simple operations cause the loss of text semantic information.

The best data enhancement method in the text domain is to rewrite sentences manually, but this method is impractical and costly in view of the magnitude of the data set. In the related art, the data enhancement method used in the text classification task mainly includes: a data enhancement method based on a synonym table, a data enhancement method based on translation (for example, translating Chinese into English first and then translating English into Chinese), or a data enhancement method based on a pre-training word vector. However, the data enhancement method based on the synonym table generally uses the existing public and general-purpose word table for word replacement, and cannot effectively adapt to the text of a specific field, such as the financial or medical field, and the establishment of a synonym dictionary of one field from scratch is very costly; the data enhancement method based on the translation back needs to be realized by an additional translation model, and a large amount of bilingual corpora are also seriously lacked in a specific field; the data enhancement by using the pre-training word vector is a compromise between the first two methods, because it can utilize the unlabeled text of the specific field, but there is a problem when using this method: how to select which words in the text to replace, because the selected words minimally affect the semantics of the sentence, otherwise the classification model effect cannot be effectively improved.

Based on this, embodiments of the present invention provide a data enhancement method and apparatus, an electronic device, and a storage medium, where a topic model is used to obtain topic distribution probability information corresponding to each sentence in an original text sample, so as to well measure a contribution value of each word in a sentence to a text sentence topic, and ensure that data enhancement is completed without affecting sentence topic distribution, and meanwhile, with the help of a pre-training word vector, a word with a meaning similar to that of a word to be replaced can be selected as a replacement word, thereby ensuring semantic information of the sentence to the maximum extent.

Embodiments of the present invention provide a data enhancement method and apparatus, an electronic device, and a storage medium, which are specifically described with reference to the following embodiments, and first describe a data enhancement method in an embodiment of the present invention.

The embodiment of the invention can acquire and process related data based on an artificial intelligence technology. Among them, Artificial Intelligence (AI) is a theory, method, technique and application system that simulates, extends and expands human Intelligence using a digital computer or a machine controlled by a digital computer, senses the environment, acquires knowledge and uses the knowledge to obtain the best result.

The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a robot technology, a biological recognition technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

The embodiment of the invention provides a data enhancement method, which relates to the technical field of artificial intelligence, in particular to the technical field of data mining. The data enhancement method provided by the embodiment of the invention can be applied to a terminal, a server side and software running in the terminal or the server side. In some embodiments, the terminal may be a smartphone, tablet, laptop, desktop computer, smart watch, or the like; the server can be an independent server, and can also be a cloud server providing basic cloud computing services such as cloud service, a cloud database, cloud computing, a cloud function, cloud storage, Network service, cloud communication, middleware service, domain name service, security service, Content Delivery Network (CDN), big data and artificial intelligence platform and the like; the software may be an application or the like implementing the data enhancement method, but is not limited to the above form.

The invention is operational with numerous general purpose or special purpose computing system environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet-type devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like. The invention may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The invention may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

Fig. 1 is an alternative flowchart of a data enhancement method provided in an embodiment of the present invention, and the method in fig. 1 may include, but is not limited to, steps S110 to S160.

Step S110, an original text sample to be enhanced is obtained.

The text classification task is a basic task in natural language processing, and some deep learning models (such as (CNN, RNN, Transformer and the like)) have good effects on some classification tasks such as news classification and emotion analysis at present. However, these depth models require large-scale, high-quality annotation data, which is often difficult to obtain or expensive to annotate in real business scenarios.

In one embodiment, the original text sample includes at least one text sentence, the original text sample includes at least one subject word, and a sentence may or may not include different subject words. And the original text sample is a labeled sample in a preset field, wherein the preset field can be selected according to actual requirements, and the labeled text sample is reinforced to obtain a large-scale reinforced labeled text related to the preset field. The labeling information in the text sample in the preset field can be obtained not only in an automatic labeling mode based on a semi-supervised classification model, but also in a manual labeling mode, and the accuracy of the category label of the labeled text labeled manually is higher, so that the labeling precision of the labeled text can be improved

In an embodiment, after the original text sample is obtained, text preprocessing is further performed on the original text sample to obtain a preprocessed text corpus, where the text preprocessing includes, but is not limited to: removing illegal characters, removing stop words, removing redundant words and participles, etc.

In one embodiment, the original text sample may be participled using a preset dictionary, and stop words in each sample text in the sample set may be removed. The preset dictionary may specifically be a custom dictionary corresponding to a preset field, the custom dictionary may include a plurality of pre-defined participles, and when performing the participle processing on the original text sample, the custom dictionary may specifically be used to perform the participle processing on the original text sample, for example, obtaining the participle matched with each participle in the custom dictionary from the original text sample, so as to decompose the original text sample into a plurality of participles matched with the participle in the custom dictionary, so as to improve the accuracy of performing the participle on the original text sample.

In addition, in order to improve the search efficiency in information retrieval, some characters or words are automatically filtered before or after processing natural language data (or text), and these characters or words are called stop words, which can be roughly divided into the following two categories: one is words which are widely used, even too frequently, such as "I", "Y", etc.; the other is a word with a low practical meaning in the text, and the word includes words such as auxiliary words, adverbs, prepositions, conjunctions, etc., which generally have no definite meaning, and only put into a complete sentence can have a certain function, such as the common words "in", "and", "next", etc. Therefore, in this embodiment, after performing word segmentation processing on the original text sample, some nonsense words such as word-oriented words, exclamation words or illegal characters can be removed by removing stop words. The method comprises the steps of calculating importance values of different metric words by performing characteristic analysis on a large number of text samples in a preset field, for example, calculating the importance values in the tf-idf, information gain and other modes, generating a stop word set in the preset field according to the importance values, and combining stop words in the general field to obtain a final stop word set. The stop word removal in this embodiment is to delete the words belonging to the stop word set from the word sequence after word segmentation.

In one embodiment, after the original text sample is subjected to word segmentation, redundant words can be removed. The redundant word filtering is to remove semantically repeated redundant words, match each word obtained by segmenting two Chinese texts with a preset semantic template, identify upper and lower words appearing in a sentence at the same time through the matched semantic template, identify the upper words as semantically redundant words, and further filter, wherein the semantic template of the redundant words is not specifically limited, and can be calculated by adopting a conventional or well-known calculation method in the field, as long as the method can be applied to the invention.

Step S120, inputting the original text sample into a pre-trained topic model to obtain topic distribution probability information corresponding to each text sentence in the original text sample.

In one embodiment, the topic model is an invisible dirichlet allocation topic model, and the tool used is a Gensim tool. In this embodiment, a hidden dirichlet allocation topic model is used to obtain probability distribution from each basic word (for example, each topic word is obtained by word segmentation) to a topic in the original text sample after preprocessing, and probability distribution from the original text sample to the topic. It is understood that the latent dirichlet allocation topic model may be calculated using calculation methods existing or well known in the art, as long as they can be applied to the present invention. The calculation of the hidden dirichlet allocation topic model may include various ways, for example, it may be calculated by a single training thread of one processor, or may be calculated by multiple training threads of multiple processors, or even may be distributed.

In an embodiment, the invisible dirichlet allocation topic model is first trained, and referring to fig. 2, the process of training the invisible dirichlet allocation topic model includes, but is not limited to, steps S210 to S240:

step S210, a training sample set of a preset field is obtained.

In an embodiment, the training sample set includes training text samples that are not labeled in a preset field and corresponding probability labels. It is understood that the above text preprocessing operations may be performed on the training text samples as well, wherein the text preprocessing includes but is not limited to: and removing illegal characters and word segmentation and the like.

Step S220, inputting the training text sample into the initial theme model, and obtaining the predicted theme distribution probability of the training text sample according to the preset theme number.

And step S230, calculating to obtain a loss value according to the distribution probability of the predicted theme and the corresponding probability label.

In an embodiment, a preset number of topics is first set, and the preset number of topics may be set according to prior knowledge or actual requirements, which is not specifically limited herein. Inputting the training text sample into an initial topic model for iterative processing, calculating the initial topic model, and obtaining the predicted topic distribution probability of the training text sample, wherein the predicted topic distribution probability comprises: a base word to topic probability distribution of the training text sample and a training text sample to topic probability distribution.

In one embodiment, the probability distribution of the base words to the topics is a word-to-topic matrix, and the rows and columns of the matrix are implicitly calculated topics. The probability distribution of the training text samples to the topics is a text-to-topic matrix, the behavior of the matrix is that each training text sample, and the columns are implicitly calculated topics. The word vector for a topic is the column vector of the matrix from the word to the topic. The matrixes are initially random values, each value of the matrixes is gradually optimized and calculated through Gibbs sampling iteration, the distribution probability of the predicted topics is obtained, and finally the clustering of the words can be obtained through the matrixes from the words to the topics, so that the keywords are guided. In this embodiment, the loss value is calculated according to the distribution probability of the predicted topic and the corresponding probability label.

And S240, adjusting the model weight of the initial topic model according to the loss value by using the loss function until the loss function meets the convergence condition, and training to obtain the topic model.

In an embodiment, after iteration, the loss function is used for judging whether the basic LDA model reaches the convergence condition according to the loss value, so that the model weight of the initial topic model is adjusted until the loss function meets the convergence condition, and the topic model is obtained through training. For example, the convergence condition may be the number of iterations, and if the convergence condition is not reached, the initial topic model of each training text sample is continuously calculated; and if the convergence condition is reached, training to obtain a topic model.

In an embodiment, according to the pre-trained topic model, topic distribution probability information corresponding to each sentence in an original text sample is obtained, where the topic distribution probability information includes: the topic distribution probability of the topic word and the topic distribution probability of the sentence.

Step S130, calculating the contribution value of each topic word in each sentence to the text sentence according to the topic distribution probability information.

In one embodiment, the topic distribution probability of a text sentence is calculated according to a first formula, then the topic distribution probability of a topic word is calculated, and finally the preset smoothing parameter, the topic distribution probability and the topic distribution probability are multiplied according to a second formula to obtain a contribution value;

specifically, the first formula is expressed as:

the second formula is expressed as:

wherein, ω is_iDenotes a subject term, s ═ s (ω)₁,ω₂,...,ω_i,...,ω_N) Representing a text sentence, the text sentence containing N subject words,

represents the contribution of the topic word to the text sentence, p (t | ω_i) Denotes a topic distribution probability of a topic word, p (t | s) denotes a topic distribution probability of a text sentence, and τ denotes a smoothing parameter.

In one embodiment, the smoothing parameter τ is used to control the smoothness of the word replacement probability, which may typically be 0.75.

Step S140, calculating to obtain the replacement probability of the subject word according to the contribution value of the subject word, and selecting a word to be replaced from the text sentence according to the replacement probability to obtain a word set to be replaced.

In one embodiment, a contribution value of each topic word in the sentence to the topic of the text sentence is obtained, a replacement probability of the topic word is further calculated according to the contribution value of the topic word to the text sentence, and a word to be replaced in the text sentence is selected according to the replacement probability to obtain a word set to be replaced. In addition, if a text sentence does not contain a subject word, the contribution value of the subject word to the text sentence is zero. Referring to fig. 3, step S140 includes, but is not limited to, step S141 to step S143:

step S141, calculating a replacement probability of the subject word according to the contribution value of the subject word in the text sentence.

In one embodiment, the maximum contribution values of all the subject words in the text sentence are calculated according to a third formula, then the sum of the contribution values obtained by subtracting the maximum contribution values from the contribution values of all the subject words is calculated according to a fourth formula, and finally the replacement probability of the subject words is calculated according to a fifth formula by using the contribution values, the maximum contribution values and the sum of the contribution values of the subject words in the text sentence;

calculating the maximum contribution value of all subject words in the text sentence, and expressing as:

then calculating the difference between the contribution value of each subject term and the maximum contribution value, and summing all the differences to obtain a contribution value sum, wherein the contribution value sum is represented as:

and finally, calculating the ratio of each difference value to the sum of the contribution values to obtain the replacement probability of the subject term, wherein the replacement probability of the subject term is expressed as:

wherein the content of the first and second substances,

representing the contribution of the subject word to the text sentence,

representing the replacement probability of the subject word, M representing the maximum contribution value of all subject words in the text sentence, and Z representing the sum of the contribution values of all subject words minus the maximum contribution value.

In one embodiment, the contribution of a topic word to the topic of a text sentence

The higher the probability of replacement of the subject word

The lower the contribution value of a topic word to the topic of a sentence is, the higher the contribution value of the topic word is, the replacement of the topic word is not considered, and the mode can ensure the topic meaning of the sentence in the preset field.

And S142, sampling according to the number of preset replacement words and the replacement probability to obtain words to be replaced.

And step S143, forming a word set to be replaced by using the words to be replaced.

In one embodiment, a preset number r of replacement words is randomly selected, where the preset number r of replacement words obeys a geometric distribution and is represented as:

P[r]～p^r

P(X＝r)＝p(1-p)^r-1

in one embodiment, the geometric distribution is a discrete probability distribution defined as: in n Bernoulli tests, the first success probability is obtained only after k tests, namely the first k-1 times of failures and the k-th success probability. In the bernoulli test, the probability of success is p, and the value of p in this embodiment may be 0.5, which is not specifically limited herein.

And after the preset number r of the replacement words is obtained, randomly sampling contribution values of the text sentence topics based on the obtained topic words to obtain r words serving as words to be replaced, and combining the obtained words to be replaced to obtain a word set to be replaced.

And S150, screening candidate words from a word vector set obtained by pre-training according to the words to be replaced in the word set to be replaced.

In an embodiment, the Word vector set obtained by pre-training is obtained by training using a Word2Vec tool, Word2Vec is a tool that Google opens sources in 2013 and characterizes words as real-valued vectors, the processing of the text content of the training text sample can be simplified into vector operation in a K-dimensional vector space through training by using the idea of deep learning, and the similarity in the vector space can be used for representing the similarity of the training text sample in the text semantics. Referring to fig. 4, the step of pre-training the set of word vectors includes, but is not limited to, steps S410 to S430:

and step S410, acquiring training text samples in a preset field.

In one embodiment, the training text sample may utilize a training text sample of a preset domain used in training the invisible dirichlet allocation topic model.

And step S420, training the training text sample by using a Word2vec tool to obtain a pre-training Word vector.

Step S430, a word vector set is formed by using the pre-training word vectors.

In an embodiment, a pre-training word vector related to the preset field is obtained after training, and the word vector dimension setting during training can be determined according to the specific corpus magnitude, which is not specifically limited herein.

In an embodiment, after the word vector set is obtained, a word similar to the word to be replaced in the word set to be replaced is selected as a candidate word in the word vector set. Referring to fig. 5, step S150 includes, but is not limited to, step S151 to step S153:

step S151, calculating the distance between the word to be replaced in the word set to be replaced and the pre-training word vector in the word vector set in the vector space.

In an embodiment, the distance matrix may be obtained by calculating a distance between a word to be replaced in the word set to be replaced and a pre-training word vector in the word vector set, and the vector distance may be determined by using a method for determining a distance between two vectors in the prior art or a future-developed technology, which is not limited in this application. For example, the distance is calculated by using an euclidean distance formula or a cosine distance formula.

In an embodiment, the specific representation form of the distance between the word to be replaced in the word set to be replaced and the pre-training word vector in the word vector set in the distance matrix is as follows: d is a radical of_ijAnd i and j indicate that the distance is in the rows and columns of the distance matrix, respectively. In the embodiment, the synonym of the word to be replaced is selected from the word vector set through the distance, whether the two words are synonyms or not is related to the distance between the two word vectors in the matrix, and if the distance between the two word vectors is within the preset distance, the two words can be determined to be the synonym of each other, so that the synonym of the word to be replaced can be conveniently and quickly found through the distance matrix, and the reliability of replacement of the word to be replaced is improved.

Step S152, the distances are sorted to obtain a distance sorting result.

In one embodiment, the distances in the distance matrix are sorted to obtain a distance sorting result, and then a preset distance is used to determine which pre-training word vectors in the word vector set and the words to be replaced in the word set to be replaced belong to synonyms.

Step S153, selecting a preset number of words from the word vector set as candidate words according to the distance sorting result.

In an embodiment, the pre-training word vectors in the word vector set whose distance between two word vectors in the distance matrix is within the preset distance are selected as candidate words of the corresponding word to be replaced according to the distance sorting result, for example, k pre-training word vectors meet the criteria of the candidate words.

In an embodiment, if there are many candidate words in the word vector set, a required number of candidate words may be selected according to the position distribution of the candidate words. For example, the position s distribution of the candidate word in the word vector set may be set to follow a geometric distribution, which is expressed as:

P[r]～p^r

wherein, the value of q may be 0.5, which is not specifically limited herein.

And step S160, replacing the word to be replaced with the candidate word to obtain a data enhanced text sample.

In an embodiment, one or more updated sentences can be obtained by changing the words to be replaced in the sentences into candidate words, and one or more data-enhanced text samples can be obtained by correspondingly updating each sentence in the original text samples. It can be understood that the same sentence may be enhanced for a plurality of times, and the specific times are different in different data sets, for example, the same sentence may be enhanced for 2 to 4 times, and after each sentence in the original text sample is enhanced, one or more data-enhanced labeled corpora, that is, data-enhanced text samples, are obtained.

In addition, fig. 6 is a schematic flow chart of a data enhancement method according to an embodiment of the present application.

And S600, acquiring a large number of unlabeled text samples in a preset field as training text samples.

Step S610, performing text preprocessing on the training text sample, where the text preprocessing includes, but is not limited to: removing illegal characters, removing stop words, removing redundant words and participles, etc.

And S620, training the initial topic model by using the training text sample to obtain a topic model.

Step S630, training the training text sample by using a Word2vec tool to obtain a pre-training Word vector set.

And step S640, taking the marked text sample in the preset field as an original text sample to enhance the text sample data.

Step S650, performing text preprocessing on the original text sample, wherein the text preprocessing includes but is not limited to: removing illegal characters, removing stop words, removing redundant words and participles, etc.

Step S660, calculating a contribution value of each topic word in the original text sample to the text sentence, specifically: obtaining topic distribution probability information corresponding to each sentence in the original text sample by using the topic model obtained in the step S620, and calculating the contribution value of each topic word in each sentence to the text sentence according to the topic distribution probability information

Step S670, obtaining a set of words to be replaced according to the contribution value of the topic word to the text sentence, specifically: and calculating to obtain the replacement probability of the subject word according to the contribution value of the subject word to the text sentence, and selecting the words to be replaced in the text sentence according to the replacement probability to obtain a set of words to be replaced.

Step S680, selecting a candidate word from the word vector set obtained in step S630, specifically: and selecting words similar to the words to be replaced in the word set to be replaced as candidate words in the word vector set.

Step S690, obtaining a data enhanced text sample, specifically: and replacing the word to be replaced by the candidate word.

The data enhancement method provided by the embodiment of the invention comprises the steps of obtaining an original text sample, inputting the original text sample into a pre-trained topic model to obtain topic distribution probability information corresponding to each sentence in the original text sample, calculating the contribution value of each topic word in each sentence to a text sentence according to the topic distribution probability information, calculating the replacement probability of the topic word according to the contribution value of the topic word to the text sentence, selecting a text sentence to be replaced by using the replacement probability to obtain a word set to be replaced, selecting a word similar to the word to be replaced in the word set to be replaced from a word vector set obtained by pre-training as a candidate word, and finally replacing the word by using the candidate word to obtain the data enhancement text sample.

In the embodiment, the topic distribution probability information corresponding to each sentence in the original text sample is obtained by using the topic model, so that the contribution value of each word in the sentence to the text sentence topic is well measured, the data enhancement can be ensured to be completed under the condition that the sentence topic distribution is not influenced, and meanwhile, by means of the pre-training word vector, a word with a similar meaning to the word to be replaced can be selected as a replacement word, so that the semantic information of the sentence is ensured to the maximum extent.

In addition, an embodiment of the present invention further provides a text sample data enhancement device, which can implement the data enhancement method described above, and with reference to fig. 7, the device includes:

a sample obtaining module 710, configured to obtain an original text sample to be enhanced; the original text sample comprises at least one text sentence, and the original text sample comprises at least one subject word;

the topic distribution probability calculation module 720 is configured to input the original text sample into a topic model obtained through pre-training to obtain topic distribution probability information corresponding to each text sentence, where the topic model is an invisible dirichlet allocation topic model;

a contribution value calculating module 730, configured to calculate a contribution value of each topic word to the text sentence according to the topic distribution probability information;

the to-be-replaced word selection module 740 is configured to calculate a replacement probability of the subject word according to the contribution value of the subject word, and select a to-be-replaced word from the text sentence according to the replacement probability to obtain a to-be-replaced word set;

the candidate word selecting module 750 is configured to filter a word vector set obtained through pre-training according to a word to be replaced in the word set to be replaced to obtain a candidate word;

and the data enhancement module 760 is configured to replace the word to be replaced with the candidate word to obtain a data enhanced text sample.

In one embodiment, the topic distribution probability information includes: the topic distribution probability of the topic word and the topic distribution probability of the sentence, and the contribution value of each topic word to the text sentence in the contribution value calculation module 730 are represented as:

representing a topic word pairContribution value in text sentence, p (t | ω |)_i) Denotes a topic distribution probability of a topic word, p (t | s) denotes a topic distribution probability of a text sentence, and τ denotes a smoothing parameter.

In an embodiment, the to-be-replaced word selecting module 740 is further configured to calculate a replacement probability of the topic word according to the contribution value of the topic word to the text sentence, then sample to obtain the to-be-replaced word according to the preset number of the replacement words and the replacement probability, and finally form a to-be-replaced word set by using the to-be-replaced word.

In an embodiment, the candidate word selecting module 750 is further configured to calculate distances between words to be replaced in the word set to be replaced and pre-training word vectors in a word vector set obtained through pre-training in a vector space, then rank the distances to obtain a distance ranking result, and finally select a preset number of words in the word vector set as candidate words according to the distance ranking result, where the position distribution of the candidate words in the word vector set obeys geometric distribution.

The specific implementation of the text sample data enhancement apparatus of this embodiment is substantially the same as the specific implementation of the data enhancement method, and is not described herein again.

An embodiment of the present invention further provides an electronic device, including:

at least one memory;

at least one processor;

at least one program;

the programs are stored in a memory and a processor executes the at least one program to implement the present invention to implement the data enhancement methods described above. The electronic device can be any intelligent terminal including a mobile phone, a tablet computer, a Personal Digital Assistant (PDA for short), a vehicle-mounted computer and the like.

Referring to fig. 8, fig. 8 illustrates a hardware structure of an electronic device according to another embodiment, where the electronic device includes:

the processor 801 may be implemented by a general-purpose CPU (central processing unit), a microprocessor, an Application Specific Integrated Circuit (ASIC), or one or more integrated circuits, and is configured to execute a relevant program to implement the technical solution provided in the embodiment of the present invention;

the memory 802 may be implemented in a ROM (read only memory), a static memory device, a dynamic memory device, or a RAM (random access memory). The memory 802 may store an operating system and other application programs, and when the technical solution provided by the embodiments of the present disclosure is implemented by software or firmware, the relevant program codes are stored in the memory 802 and called by the processor 801 to execute the data enhancement method according to the embodiments of the present disclosure;

an input/output interface 803 for realizing input and output of information;

the communication interface 804 is used for realizing communication interaction between the device and other devices, and can realize communication in a wired manner (such as USB, network cable, and the like) or in a wireless manner (such as mobile network, WIFI, bluetooth, and the like); and

a bus 805 that transfers information between the various components of the device (e.g., the processor 801, memory 802, input/output interfaces 803, and communication interface 804);

wherein the processor 801, the memory 802, the input/output interface 803 and the communication interface 804 are communicatively connected to each other within the device via a bus 805.

The embodiment of the present invention also provides a storage medium, which is a computer-readable storage medium, and the computer-readable storage medium stores computer-executable instructions, and the computer-executable instructions are used for causing a computer to execute the data enhancement method.

According to the data enhancement method, the text sample data enhancement device, the electronic equipment and the storage medium, topic distribution probability information corresponding to each sentence in an original text sample is obtained through the topic model, so that the contribution value of each word in the sentence to the topic of the text sentence is well measured, the data enhancement can be finished under the condition that the topic distribution of the sentence is not influenced, meanwhile, the word with similar meaning to the word to be replaced can be selected as the replacement word by means of the pre-training word vector, and the semantic information of the sentence is guaranteed to the maximum extent.

The memory, which is a non-transitory computer readable storage medium, may be used to store non-transitory software programs as well as non-transitory computer executable programs. Further, the memory may include high speed random access memory, and may also include non-transitory memory, such as at least one disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory optionally includes memory located remotely from the processor, and these remote memories may be connected to the processor through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The non-transitory software program and the non-transitory computer executable program provided by this embodiment can be used to perform the following steps: the method comprises the steps of obtaining an original text sample, inputting the original text sample into a pre-trained topic model to obtain topic distribution probability information corresponding to each sentence in the original text sample, calculating the contribution value of each topic word in each sentence to a text sentence according to the topic distribution probability information, calculating the replacement probability of the topic word according to the contribution value of the topic word to the text sentence, selecting a text sentence to-be-replaced word by using the replacement probability to obtain a to-be-replaced word set, selecting a word similar to the to-be-replaced word in the to-be-replaced word set from a word vector set obtained by pre-training as a candidate word, and finally replacing the to-be-replaced word by using the candidate word to obtain a data enhanced text sample.

The embodiment described in the embodiment of the present invention is for more clearly illustrating the technical solution of the embodiment of the present invention, and does not constitute a limitation to the technical solution provided in the embodiment of the present invention, and it can be known by those skilled in the art that the technical solution provided in the embodiment of the present invention is also applicable to similar technical problems with the evolution of technology and the occurrence of new application scenarios.

It will be appreciated by those skilled in the art that the solutions shown in fig. 1-5 are not intended to limit the embodiments of the present invention, and may include more or less steps than those shown, or some steps in combination, or different steps.

The above-described embodiments of the apparatus are merely illustrative, wherein the units illustrated as separate components may or may not be physically separate, i.e. may be located in one place, or may also be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.

One of ordinary skill in the art will appreciate that all or some of the steps of the methods, systems, functional modules/units in the devices disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof.

The terms "first," "second," "third," "fourth," and the like in the description of the invention and in the drawings described above, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

It is to be understood that, in the present invention, "at least one" means one or more, "a plurality" means two or more. "and/or" for describing an association relationship of associated objects, indicating that there may be three relationships, e.g., "a and/or B" may indicate: only A, only B and both A and B are present, wherein A and B may be singular or plural. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. "at least one of the following" or similar expressions refer to any combination of these items, including any combination of single item(s) or plural items. For example, at least one (one) of a, b, or c, may represent: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", wherein a, b, c may be single or plural.

In the embodiments provided in the present invention, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes multiple instructions for enabling a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing programs, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The preferred embodiments of the present invention have been described above with reference to the accompanying drawings, and are not intended to limit the scope of the embodiments of the invention. Any modifications, equivalents and improvements that may occur to those skilled in the art without departing from the scope and spirit of the embodiments of the present invention are intended to be within the scope of the claims of the embodiments of the present invention.

Claims

1. A method of data enhancement, the method comprising:

2. The data enhancement method of claim 1, wherein before inputting the original text sample into a pre-trained topic model, further comprising:

3. The data enhancement method of claim 1, wherein the topic distribution probability information comprises: calculating the contribution value of each topic word to the text sentence according to the topic distribution probability information, wherein the topic distribution probability of the topic word and the topic distribution probability of the text sentence comprises the following steps:

calculating the topic distribution probability of the topic words;

wherein the first formula is:

wherein, ω is_iDenotes a subject term, s ═ s (ω)₁,ω₂,...,ω_i,...,ω_N) Representing a text sentence containing N subject words, p (t | ω |)_i) Topic distribution representing topic wordsThe probability, p (t | s), represents the topic distribution probability of a text sentence.

4. The method according to claim 3, wherein the calculating a replacement probability of the subject term according to the contribution value of the subject term, and selecting a word to be replaced from the text sentence according to the replacement probability to obtain a set of words to be replaced comprises:

and forming the word set to be replaced by using the words to be replaced.

5. The method of claim 4, wherein the calculating the replacement probability of the subject word according to the contribution value of the subject word to the text sentence comprises:

6. The data enhancement method according to claim 1, wherein before the candidate words are obtained by screening from a pre-trained word vector set according to the words to be replaced in the word set to be replaced, the method further comprises:

acquiring a training text sample of a preset field;

and forming the word vector set by using the pre-training word vectors.

7. The data enhancement method according to any one of claims 1 to 6, wherein the obtaining of candidate words by screening from a word vector set obtained by pre-training according to the words to be replaced in the word set to be replaced comprises:

sorting the distances to obtain a distance sorting result;

8. A text sample data enhancement apparatus, comprising:

9. An electronic device, comprising:

at least one memory;

at least one processor;

the memory stores a computer program that is executed by the processor to:

the method of any one of claims 1 to 7.

10. A storage medium that is a computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program for causing a computer to execute:

the method of any one of claims 1 to 7.