CN116383644A - Data enhancement method, device, equipment and storage medium - Google Patents

Data enhancement method, device, equipment and storage medium Download PDF

Info

Publication number
CN116383644A
CN116383644A CN202310265899.9A CN202310265899A CN116383644A CN 116383644 A CN116383644 A CN 116383644A CN 202310265899 A CN202310265899 A CN 202310265899A CN 116383644 A CN116383644 A CN 116383644A
Authority
CN
China
Prior art keywords
data set
texts
enhancement
text
preset
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310265899.9A
Other languages
Chinese (zh)
Inventor
赵天棋
谭瑞
罗川
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing Changan Automobile Co Ltd
Original Assignee
Chongqing Changan Automobile Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing Changan Automobile Co Ltd filed Critical Chongqing Changan Automobile Co Ltd
Priority to CN202310265899.9A priority Critical patent/CN116383644A/en
Publication of CN116383644A publication Critical patent/CN116383644A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Quality & Reliability (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Stored Programmes (AREA)

Abstract

The application relates to a data enhancement method, a device, equipment and a storage medium, and relates to the technical field of data processing. The method comprises the following steps: performing data preprocessing on the original data set to obtain a processed data set, wherein the data preprocessing is used for eliminating preset texts included in the original data set; respectively inputting a plurality of first texts included in the processed data set into a constructed target model, generating associated texts corresponding to each first text, and obtaining an enhanced data set including a plurality of second texts, wherein the plurality of second texts include a plurality of first texts and associated texts corresponding to each first text; performing data cleaning processing on a plurality of second texts included in the enhancement data set to obtain a cleaned enhancement data set, wherein the data cleaning processing is used for screening the plurality of second texts based on a preset threshold value; and combining the original data set and the cleaned enhanced data set to obtain a training data set with enhanced data.

Description

Data enhancement method, device, equipment and storage medium
Technical Field
The present disclosure relates to the field of data processing technologies, and in particular, to a data enhancement method, device, apparatus, and storage medium.
Background
In the field of natural language processing, the quality and diversity of training data is critical to the performance of the model. The translated data enhanced Text can be generated by training two sequence-to-sequence (Sequence to Sequence, seq2 seq) models, and the Text convolutional neural network (Convolutional Neural Networks, CNN) is used as a discriminator to optimize the generated result; alternatively, the enhanced text with labels may be obtained by performing operations such as word replacement, insertion and deletion, and by way of knowledge distillation.
In the method, the text generated by back translation often has insufficient statement sequence, and the text generated by back translation may have high repetition rate with the original text; the sentences generated by means of word replacement and the like often have semantics which are possibly inconsistent with the original sentences, and the generated enhanced corpus cannot effectively expand the data set in a semantic space. Thus, conventional data enhancement methods, such as transliteration, data rotation, substitution, deletion, or addition, often lose the data's authenticity or diversity, thereby degrading the performance of the model. Thus, the current expansion of data included in the data set is less effective and less efficient.
Disclosure of Invention
The application provides a data enhancement method, a device, equipment and a storage medium, which at least solve the technical problems of poor effect and low efficiency of expanding data included in a data set in the related technology. The technical scheme of the application is as follows:
according to a first aspect to which the present application relates, there is provided a data enhancement method comprising: performing data preprocessing on the original data set to obtain a processed data set, wherein the data preprocessing is used for eliminating preset texts included in the original data set, and the preset texts comprise: preset stop words and repeated texts; respectively inputting a plurality of first texts included in the processed data set into a constructed target model, generating associated texts corresponding to each first text, and obtaining an enhanced data set including a plurality of second texts, wherein the plurality of second texts include a plurality of first texts and associated texts corresponding to each first text; performing data cleaning processing on a plurality of second texts included in the enhancement data set to obtain a cleaned enhancement data set, wherein the data cleaning processing is used for screening the plurality of second texts based on a preset threshold value; and combining the original data set and the cleaned enhanced data set to obtain a training data set with enhanced data.
According to the technical means, the method can be used for removing preset texts included in an original data set by carrying out data preprocessing on the original data set to obtain a processed data set, and respectively inputting a plurality of first texts included in the processed data set into a constructed target model to generate associated texts corresponding to each first text to obtain an enhanced data set including a plurality of second texts; further, data cleaning processing can be performed on a plurality of second texts included in the enhancement data set, so that the plurality of second texts are screened based on a preset threshold value, and a cleaned enhancement data set is obtained; and combining the original data set and the cleaned enhanced data set to obtain a training data set with enhanced data. Thereby improving the effect and efficiency of expanding the data included in the data set through the model.
In a possible implementation manner, the data preprocessing is performed on the original data set in the method to obtain a processed data set, which includes: acquiring a preset stop word list, and performing word segmentation on texts included in an original data set based on the preset stop word list to obtain a segmented data set; and judging the similarity of any two texts in the segmented data set, and removing repeated texts in the segmented data set based on a preset similarity threshold value to obtain a processed data set, wherein the repeated texts are texts with similarity larger than the preset similarity threshold value.
According to the technical means, the text included in the original data set can be subjected to word segmentation processing based on the obtained preset stop word list, so that a segmented data set is obtained, and further similarity judgment is performed on any two texts in the text included in the segmented data set, so that repeated texts in the segmented data set are removed based on a preset similarity threshold value, and the processed data set is obtained. By performing word segmentation processing on texts included in the original data set and eliminating repeated texts, the workload of subsequent steps can be reduced, and the data processing efficiency can be improved.
In one possible embodiment, the method further comprises: determining a target text from the processed data set, and generating a target prompt based on the target text and a preset model, wherein the target text is a document theme and/or a high-frequency word with frequency greater than a preset frequency included in the processed data set; and constructing and obtaining a target model based on the preset model and the target template.
According to the technical means, the target template can be generated from the target text and the preset model determined in the processed data set, so that the target model is constructed based on the preset model and the target template, the text is processed through the constructed target model, the associated text corresponding to the text is obtained, and therefore the effect and the efficiency of expanding the data can be improved.
In a possible implementation manner, the data cleaning process is performed on a plurality of second texts included in the enhancement data set in the method to obtain a cleaned enhancement data set, and the method includes: performing similarity judgment on any two second texts in a plurality of second texts included in the enhancement data set, and removing repeated texts in the plurality of second texts based on a preset similarity threshold value to obtain an enhancement data set after duplication removal, wherein the repeated texts are texts with similarity larger than the preset similarity threshold value; and performing data cleaning processing on the text included in the enhanced data set after the duplication removal to obtain a cleaned enhanced data set.
According to the technical means, similarity judgment can be performed on any two second texts in the plurality of second texts included in the enhancement data set, repeated texts in the plurality of second texts are removed based on a preset similarity threshold value, and the enhancement data set after duplication removal is obtained, so that data cleaning processing can be performed on the texts included in the enhancement data set after duplication removal, and the enhancement data set after cleaning is obtained. By eliminating repeated texts in the plurality of second texts, the workload of subsequent steps can be reduced, and the data processing efficiency can be improved.
In a possible implementation manner, in the above method, performing data cleaning processing on text included in the deduplicated enhancement data set to obtain a cleaned enhancement data set, including: inputting the text included in the enhanced data set after the duplication removal into a preset judging model, and determining the confidence level of each text included in the enhanced data set after the duplication removal; and deleting the text with the confidence coefficient smaller than the preset confidence coefficient threshold value from the enhancement data set after the duplication removal based on the confidence coefficient of each text included in the enhancement data set after the duplication removal and the preset confidence coefficient threshold value, and obtaining the enhancement data set after the washing.
According to the technical means, the text included in the enhancement data set after the duplication removal can be input into a preset judging model, and the confidence level of each text included in the enhancement data set after the duplication removal can be determined; and deleting the text with the confidence coefficient smaller than the preset confidence coefficient threshold value from the enhancement data set after the duplication is removed based on the confidence coefficient of each text included in the enhancement data set after the duplication is removed and the preset confidence coefficient threshold value, so as to obtain the enhancement data set after the washing. The confidence degree of each text is determined, and the text included in the enhanced data set after the duplicate removal is cleaned based on the confidence degree of the text, so that the efficiency and the accuracy of data cleaning can be improved.
According to a second aspect provided herein, there is provided a data enhancement device comprising a processing unit; the processing unit is used for carrying out data preprocessing on the original data set to obtain a processed data set, wherein the data preprocessing is used for eliminating preset texts included in the original data set, and the preset texts comprise: preset stop words and repeated texts; the processing unit is used for respectively inputting a plurality of first texts included in the processed data set into the constructed target model, generating associated texts corresponding to each first text, and obtaining an enhanced data set including a plurality of second texts, wherein the plurality of second texts include a plurality of first texts and associated texts corresponding to each first text; the processing unit is used for carrying out data cleaning processing on a plurality of second texts included in the enhancement data set to obtain a cleaned enhancement data set, and the data cleaning processing is used for screening the plurality of second texts based on a preset threshold value; and the processing unit is used for combining the original data set and the cleaned enhanced data set to obtain a training data set with enhanced data.
In one possible implementation, the data enhancement device further includes an acquisition unit; the acquisition unit is used for acquiring a preset stop word list; the processing unit is used for performing word segmentation on texts included in the original data set based on a preset stop word list to obtain a segmented data set; the processing unit is used for judging the similarity of any two texts in the segmented data set, eliminating repeated texts in the segmented data set based on a preset similarity threshold value, and obtaining a processed data set, wherein the repeated texts are texts with similarity larger than the preset similarity threshold value.
In a possible implementation manner, the processing unit is used for determining a target text from the processed data set and generating a target campt based on the target text and a preset model, wherein the target text is a document theme and/or a high-frequency word with a frequency greater than a preset frequency included in the processed data set; the processing unit is used for constructing and obtaining a target model based on the preset model and the target template.
In a possible implementation manner, the processing unit is configured to perform similarity judgment on any two second texts in the plurality of second texts included in the enhancement data set, reject repeated texts in the plurality of second texts based on a preset similarity threshold, and obtain a de-duplicated enhancement data set, where the repeated texts are texts with similarity greater than the preset similarity threshold; and the processing unit is used for carrying out data cleaning processing on the text included in the duplicate-removed enhancement data set to obtain a cleaned enhancement data set.
In a possible implementation manner, the processing unit is configured to input the text included in the enhanced data set after the duplication removal into a preset discrimination model, and determine a confidence level of each text included in the enhanced data set after the duplication removal; and the processing unit is used for deleting the text with the confidence coefficient smaller than the preset confidence coefficient threshold value from the enhancement data set after the duplication removal based on the confidence coefficient of each text and the preset confidence coefficient threshold value included in the enhancement data set after the duplication removal, so as to obtain the enhancement data set after the washing.
According to a third aspect provided by the present application, there is provided an electronic device comprising: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to execute instructions to implement the method of the first aspect and any of its possible embodiments described above.
According to a fourth aspect provided herein, there is provided a computer readable storage medium, which when executed by a processor of an electronic device, enables the electronic device to perform the method of any one of the above-mentioned first aspects and any one of its possible embodiments.
According to a fifth aspect provided herein, there is provided a computer program product comprising computer instructions which, when run on an electronic device, cause the electronic device to perform the method of the first aspect and any of its possible embodiments.
Therefore, the technical characteristics of the application have the following beneficial effects:
(1) The method comprises the steps that data preprocessing is conducted on an original data set to remove preset texts contained in the original data set, a processed data set is obtained, a plurality of first texts contained in the processed data set are respectively input into a constructed target model, so that associated texts corresponding to the first texts are generated, and an enhanced data set containing a plurality of second texts is obtained; further, data cleaning processing can be performed on a plurality of second texts included in the enhancement data set, so that the plurality of second texts are screened based on a preset threshold value, and a cleaned enhancement data set is obtained; and combining the original data set and the cleaned enhanced data set to obtain a training data set with enhanced data. Thereby improving the effect and efficiency of expanding the data included in the data set through the model.
(2) The method comprises the steps of performing word segmentation on texts included in an original data set based on an acquired preset stop word list to obtain a segmented data set, and further performing similarity judgment on any two texts in the texts included in the segmented data set to reject repeated texts in the segmented data set based on a preset similarity threshold value to obtain the processed data set. By performing word segmentation processing on texts included in the original data set and eliminating repeated texts, the workload of subsequent steps can be reduced, and the data processing efficiency can be improved.
(3) The target template can be generated by the target text and the preset model determined from the processed data set, so that the target model is constructed based on the preset model and the target template, the text is processed by the constructed target model, and the associated text corresponding to the text is obtained, so that the effect and the efficiency of expanding the data can be improved.
(4) And judging the similarity of any two second texts in the plurality of second texts included in the enhancement data set, so as to eliminate repeated texts in the plurality of second texts based on a preset similarity threshold value, and obtain the enhancement data set after the duplication removal, so that the data cleaning treatment can be performed on the texts included in the enhancement data set after the duplication removal, and the enhancement data set after the cleaning is obtained. By eliminating repeated texts in the plurality of second texts, the workload of subsequent steps can be reduced, and the data processing efficiency can be improved.
(5) The text included in the enhancement data set after the duplication removal can be input into a preset judging model, and the confidence level of each text included in the enhancement data set after the duplication removal is determined; and deleting the text with the confidence coefficient smaller than the preset confidence coefficient threshold value from the enhancement data set after the duplication is removed based on the confidence coefficient of each text included in the enhancement data set after the duplication is removed and the preset confidence coefficient threshold value, so as to obtain the enhancement data set after the washing. The confidence degree of each text is determined, and the text included in the enhanced data set after the duplicate removal is cleaned based on the confidence degree of the text, so that the efficiency and the accuracy of data cleaning can be improved.
It should be noted that, the technical effects caused by any implementation manner of the second aspect to the fifth aspect may refer to the technical effects caused by the corresponding implementation manner in the first aspect, which are not described herein.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application and do not constitute an undue limitation on the application.
FIG. 1 is a schematic diagram of a data enhancement system, shown in accordance with an exemplary embodiment;
FIG. 2 is a flow chart illustrating a method of data enhancement according to an exemplary embodiment;
FIG. 3 is a flowchart illustrating yet another data enhancement method according to an exemplary embodiment;
FIG. 4 is a flowchart illustrating yet another data enhancement method according to an exemplary embodiment;
FIG. 5 is a flowchart illustrating yet another data enhancement method according to an exemplary embodiment;
FIG. 6 is a flowchart illustrating yet another data enhancement method according to an exemplary embodiment;
FIG. 7 is a flowchart illustrating yet another data enhancement method according to an exemplary embodiment;
FIG. 8 is a flowchart illustrating yet another data enhancement method according to an exemplary embodiment;
FIG. 9 is a block diagram of a data enhancement device, according to an example embodiment;
fig. 10 is a block diagram of an electronic device, according to an example embodiment.
Detailed Description
In order to enable those skilled in the art to better understand the technical solutions of the present application, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings.
It should be noted that the terms "first," "second," and the like in the description and claims of the present application and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that embodiments of the present application described herein may be implemented in sequences other than those illustrated or otherwise described herein. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present application as detailed in the accompanying claims.
The data enhancement method provided by the embodiment of the application can be applied to a data enhancement system. Fig. 1 shows a schematic diagram of a structure of the data enhancement system. As shown in fig. 1, the data enhancement system 10 includes: an electronic device 11 and a server 12.
The data enhancement system 10 may be used in the internet of things, and may include a plurality of central processing units (central processing unit, CPU), a plurality of memories, a storage device storing a plurality of operating systems, and other hardware.
The electronic device 11 can be used for the internet of things and can be used for executing the data enhancement method provided by the application, and the electronic device 11 realizes enhancement of a data set through data interaction with the server 12.
The server 12 may be used for the internet of things, and is used for storing data, for example, the server 12 is a server corresponding to a data enhancement method, and is used for storing a data set to be processed, and performing data interaction with the electronic device 11, so as to implement the data enhancement method.
For ease of understanding, the data enhancement method provided in the present application is specifically described below with reference to the accompanying drawings.
Fig. 2 is a flowchart illustrating a data enhancement method according to an exemplary embodiment, including the following S201-S204, as shown in fig. 2:
s201, data preprocessing is carried out on the original data set, and a processed data set is obtained.
The data preprocessing is used for eliminating preset texts included in the original data set, wherein the preset texts comprise: preset stop words and repeated text.
It should be noted that, the above-mentioned original data set is a training data set for training a certain model, and by using the method in the embodiment of the present application, the original data set may be expanded and enhanced, so as to improve accuracy of a training result obtained by training a certain model.
Optionally, the method in the embodiment of the application mainly includes two parts of data preparation and data enhancement, the data preparation mainly includes preparation work such as data deduplication and data cleaning on an original data set, and the data enhancement mainly includes generating enhancement data on the original data set through prompt.
In one possible implementation manner, after a training data set (i.e., an original data set) for training a certain model is obtained, data preprocessing may be performed on the original data set to remove preset stop words and repeated text included in the original data set, so as to obtain a processed data set.
S202, respectively inputting a plurality of first texts included in the processed data set into a constructed target model, and generating associated texts corresponding to each first text to obtain an enhanced data set including a plurality of second texts.
The plurality of second texts comprise a plurality of first texts and associated texts corresponding to the first texts.
Optionally, in the process of data enhancement, a target model needs to be built in advance, and for a specific step of building the target model, reference may be made to descriptions in steps S401 to S402 below, which are not described herein.
It should be noted that, the plurality of first texts included in the processed data set may be modified by the target model, the first texts in the processed data set may be input into the target model, and the text data may be modified by using the target model, so as to generate the associated text corresponding to each first text.
Alternatively, the associated text corresponding to the first text may be understood as: text having the same meaning as the first text, text having a similarity to the first text greater than a preset similarity, text having the same format as the first text, and the like.
For example, assuming that the first text is "xiaoming is running", the associated text corresponding to the first text may be generated by the object model as "xiaoming is running", "xiaoming is moving", or the like.
It will be appreciated that after obtaining the associated text corresponding to each first text, the plurality of first texts and the associated text corresponding to each first text need to be combined to obtain an enhanced data set comprising a plurality of second texts.
And S203, performing data cleaning processing on a plurality of second texts included in the enhancement data set to obtain a cleaned enhancement data set.
The data cleaning process is used for screening the plurality of second texts based on a preset threshold value.
Optionally, in the process of enhancing the data, the data is specifically required to be cleaned so as to remove repeated texts in the plurality of second texts, and according to the confidence level of the texts, the texts with lower confidence level are removed.
S204, combining the original data set and the cleaned enhancement data set to obtain a training data set with enhanced data.
It can be appreciated that after the cleaned enhancement data set is obtained, the cleaned enhancement data set and the original data set are combined, so that the original data set can be expanded, and the training data set after data enhancement can be obtained, so that a larger and richer training data set can be formed, and a certain model can be trained through the training data set after data enhancement.
In the embodiment of the application, the preset text included in the original data set can be removed by carrying out data preprocessing on the original data set to obtain a processed data set, and a plurality of first texts included in the processed data set are respectively input into a constructed target model to generate associated texts corresponding to each first text to obtain an enhanced data set comprising a plurality of second texts; further, data cleaning processing can be performed on a plurality of second texts included in the enhancement data set, so that the plurality of second texts are screened based on a preset threshold value, and a cleaned enhancement data set is obtained; and combining the original data set and the cleaned enhanced data set to obtain a training data set with enhanced data. Thereby improving the effect and efficiency of expanding the data included in the data set through the model.
In some embodiments, in order to perform data preprocessing on an original data set to obtain a processed data set, as shown in fig. 3, in a data enhancement method provided in the embodiments of the present application, the above S201 methods may specifically include S301 to S302:
s301, acquiring a preset stop word list, and performing word segmentation on texts included in an original data set based on the preset stop word list to obtain a segmented data set.
Optionally, as shown in fig. 4, during the data preparation stage, a preset Stop word list needs to be obtained in the process of preprocessing the original data set, where certain Words or Words need to be automatically filtered when natural language data (or text) is processed, and these Words or Words are called Stop Words.
It should be noted that, the stop word is manually input and is not automatically generated, and the generated stop word forms a stop word list. For a given purpose, any type of word may be selected as a stop word. In general terms, the disuse words fall broadly into two categories. One type is the functional words contained in human language, which are extremely common and have no actual meaning compared to other words. However, for search engines, the use of stop words can cause problems when the phrase to be searched contains functional words, compound nouns. Another class of words includes vocabulary words that are widely used, but there is no guarantee that truly relevant search results will be given to such word search engines, which is difficult to help narrow the search, and which also reduces the efficiency of the search, so these words are typically removed from the problem, thereby improving search performance.
Optionally, word segmentation of the text can be achieved based on the stop word list, whether the text is similar or not is judged, and repeated samples in the original data are removed based on a threshold value. Specifically, word segmentation is carried out on each text to obtain a word sequence, words in the word sequence are compared with a stop word list, stop words included in the word sequence are removed, and a segmented data set is obtained.
S302, similarity judgment is carried out on any two texts in the segmented data set, repeated texts in the segmented data set are removed based on a preset similarity threshold value, and the processed data set is obtained.
The repeated text is text with similarity larger than a preset similarity threshold value.
Optionally, a distance measurement mode (such as cosine distance, edit distance, etc.) may be selected, and based on a preset similarity threshold, repeated text in the segmented data set is deleted.
It should be noted that, for a plurality of texts included in the segmented data set, it is necessary to determine, one by one, a similarity between each text and the text included in the de-duplicated data set.
In this embodiment of the present application, word segmentation may be performed on a text included in an original data set based on an obtained preset stop word list, so as to obtain a data set after word segmentation, and further similarity judgment may be performed on any two texts in the text included in the data set after word segmentation, so as to reject repeated texts in the data set after word segmentation based on a preset similarity threshold, so as to obtain a processed data set. By performing word segmentation processing on texts included in the original data set and eliminating repeated texts, the workload of subsequent steps can be reduced, and the data processing efficiency can be improved.
In some embodiments, in order to construct and obtain the target model, as shown in fig. 5, in a data enhancement method provided in the embodiments of the present application, S401 to S402 may further include:
s401, determining a target text from the processed data set, and generating a target prompt based on the target text and a preset model.
The target text is a document theme and/or a high-frequency word with frequency greater than a preset frequency included in the processed data set.
S402, constructing a target model based on a preset model and a target template.
Alternatively, for the processed dataset, text such as high frequency words or document topics may be extracted from the processed dataset.
Optionally, in the data enhancement stage, a suitable preset model (i.e. a pre-training language model) can be selected according to the requirement of the task, and a corresponding target template is generated by combining the high-frequency word and the document theme, so that the preset model is guided by using the target template to obtain the target model.
It should be noted that, a specific template should be determined according to a specific task type and data, for example, the classification task may use "generate text consistent with the current text topic a", and determine, by predetermined high-frequency words and document topics, that the current text topic a is replaced and added to the MB template.
Alternatively, determining the text topic may be determined manually by an algorithmic person who knows the data or by natural language processing (Natural Language Processing, NLP) algorithms, such as: extracting implicit dirichlet allocation (Latent Dirichlet Allocation, LDA) and then carrying out certain screening to obtain the product.
In the embodiment of the application, the target template can be generated by the target text and the preset model determined from the processed data set, so that the target model is constructed and obtained based on the preset model and the target template, the text is processed through the constructed target model, and the associated text corresponding to the text is obtained, so that the effect and the efficiency of expanding the data can be improved.
In some embodiments, in order to perform data cleaning processing on a plurality of second texts included in the enhancement data set to obtain a cleaned enhancement data set, as shown in fig. 6, in a data enhancement method provided in the embodiment of the present application, the method in S203 may specifically include S501-S502:
s501, similarity judgment is carried out on any two second texts in a plurality of second texts included in the enhancement data set, repeated texts in the plurality of second texts are removed based on a preset similarity threshold value, and the enhancement data set after duplicate removal is obtained.
The repeated text is text with similarity larger than a preset similarity threshold value.
Optionally, in the data enhancement stage, the enhancement data set needs to be de-duplicated first to obtain a de-duplicated enhancement data set, and then the enhancement data set is further subjected to data cleaning.
S502, performing data cleaning processing on texts included in the enhancement data set after the duplication removal to obtain a cleaned enhancement data set.
Optionally, the enhanced data set is subjected to data cleaning, and because the enhanced data may have partial noise data, the data can be screened by an artificial intelligence algorithm in combination with the current task.
In the embodiment of the present application, similarity determination may be performed on any two second texts in a plurality of second texts included in an enhancement data set, so as to reject repeated texts in the plurality of second texts based on a preset similarity threshold, and obtain a enhancement data set after deduplication, so that data cleaning processing may be performed on texts included in the enhancement data set after deduplication, and a cleaned enhancement data set may be obtained. By eliminating repeated texts in the plurality of second texts, the workload of subsequent steps can be reduced, and the data processing efficiency can be improved.
In some embodiments, in order to perform data cleaning processing on text included in the enhancement data set after de-duplication, to obtain a cleaned enhancement data set, as shown in fig. 7, in a data enhancement method provided in the embodiments of the present application, the method in S502 may specifically include S601-S602:
s601, inputting texts included in the enhancement data set after the duplication removal into a preset judging model, and determining the confidence level of each text included in the enhancement data set after the duplication removal.
Alternatively, as shown in FIG. 8, a discriminant model may be trained by combining the original data set with an existing task, such as a text classification task.
Further, the text included in the enhanced data set after the duplication removal is input into a preset discrimination model, so as to obtain a prediction result (i.e. the confidence level of each text).
S602, deleting the text with the confidence coefficient smaller than the preset confidence coefficient threshold value from the enhancement data set after the duplication removal based on the confidence coefficient of each text and the preset confidence coefficient threshold value included in the enhancement data set after the duplication removal, and obtaining the enhancement data set after the washing.
Optionally, the confidence level of each text is compared with a preset confidence level threshold value, so that the text with the confidence level smaller than the preset confidence level threshold value is deleted from the reinforced data set after the duplication removal, and the reinforced data set after the washing is obtained.
It should be noted that, the preset confidence threshold should be determined according to different tasks to obtain the final enhanced data set corresponding to the different tasks.
In the embodiment of the application, the text included in the enhancement data set after the duplication removal can be input into a preset discrimination model, and the confidence level of each text included in the enhancement data set after the duplication removal is determined; and deleting the text with the confidence coefficient smaller than the preset confidence coefficient threshold value from the enhancement data set after the duplication is removed based on the confidence coefficient of each text included in the enhancement data set after the duplication is removed and the preset confidence coefficient threshold value, so as to obtain the enhancement data set after the washing. The confidence degree of each text is determined, and the text included in the enhanced data set after the duplicate removal is cleaned based on the confidence degree of the text, so that the efficiency and the accuracy of data cleaning can be improved.
The embodiment of the invention provides a method for enhancing text data by using a large-scale pre-training language model, which is characterized in that the enhanced data set is generated by using data in an original data set after the pre-training language model is guided by using a specific prompt through the capability of the large-scale pre-training language model in text generation, and the enhanced data set is finally obtained by performing operations such as cleaning, de-duplication and the like on the enhanced data set through an algorithm. The method has the advantages that the pre-training language model is used, so that the diversity and quality of data can be improved, the performance of the model is improved, meanwhile, the cleaned enhanced data can be obtained semi-automatically, and the labor cost is saved. In practical application, a proper pre-training language model is required to be selected and the prompt is required to be adjusted according to the requirements of tasks and the characteristics of a data set, and an enhanced data cleaning scheme is required to be determined according to different tasks so as to obtain a better effect.
The foregoing description of the solution provided in the embodiments of the present application has been mainly presented in terms of a method. In order to achieve the above functions, the data enhancement device or the electronic apparatus includes a hardware structure and/or a software module that perform respective functions. Those of skill in the art will readily appreciate that the elements and algorithm steps of the examples described in connection with the embodiments disclosed herein may be implemented as hardware or combinations of hardware and computer software. Whether a function is implemented as hardware or computer software driven hardware depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
In the embodiment of the present application, according to the above method, the data enhancement device or the electronic device may be exemplarily divided into functional modules, for example, the data enhancement device or the electronic device may include each functional module corresponding to each functional division, or two or more functions may be integrated into one processing module. The integrated modules may be implemented in hardware or in software functional modules. It should be noted that, in the embodiment of the present application, the division of the modules is schematic, which is merely a logic function division, and other division manners may be implemented in actual implementation.
Fig. 9 is a block diagram illustrating a data enhancement device according to an exemplary embodiment. Referring to fig. 9, the data enhancement device 900 includes: a processing unit 901.
The processing unit 901 is configured to perform data preprocessing on an original data set to obtain a processed data set, where the data preprocessing is used to reject a preset text included in the original data set, and the preset text includes: preset stop words and repeated texts;
the processing unit 901 is configured to input a plurality of first texts included in the processed dataset into a constructed target model, generate an associated text corresponding to each first text, and obtain an enhanced dataset including a plurality of second texts, where the plurality of second texts include a plurality of first texts and associated texts corresponding to each first text;
a processing unit 901, configured to perform data cleaning processing on a plurality of second texts included in the enhancement data set, to obtain a cleaned enhancement data set, where the data cleaning processing is configured to screen the plurality of second texts based on a preset threshold;
and the processing unit 901 is configured to combine the original data set and the cleaned enhanced data set to obtain a training data set with enhanced data.
In a possible implementation manner, as shown in fig. 9, the data enhancement device 900 further includes an obtaining unit 902; an obtaining unit 902, configured to obtain a preset stop word list; the processing unit 901 is configured to perform word segmentation on a text included in an original data set based on a preset stop word list, so as to obtain a segmented data set; the processing unit 901 is configured to perform similarity judgment on any two texts in the text included in the segmented data set, reject repeated texts in the segmented data set based on a preset similarity threshold, and obtain a processed data set, where the repeated texts are texts with similarity greater than the preset similarity threshold.
In a possible implementation manner, the processing unit 901 is configured to determine a target text from the processed dataset, and generate a target campt based on the target text and a preset model, where the target text is a document theme and/or a high-frequency word with a frequency greater than a preset frequency included in the processed dataset; the processing unit 901 is configured to construct a target model based on the preset model and the target template.
In a possible implementation manner, the processing unit 901 is configured to perform similarity judgment on any two second texts in the plurality of second texts included in the enhancement data set, and reject repeated texts in the plurality of second texts based on a preset similarity threshold, so as to obtain an enhancement data set after deduplication, where the repeated texts are texts with similarity greater than the preset similarity threshold; and the processing unit 901 is configured to perform data cleaning processing on the text included in the enhancement data set after the duplication removal, so as to obtain a cleaned enhancement data set.
In a possible implementation manner, the processing unit 901 is configured to input the text included in the enhanced data set after the duplication removal into a preset discrimination model, and determine a confidence level of each text included in the enhanced data set after the duplication removal; the processing unit 901 is configured to delete, from the deduplicated enhanced data set, a text with a confidence level smaller than a preset confidence level threshold value, based on the confidence level of each text included in the deduplicated enhanced data set and the preset confidence level threshold value, and obtain a cleaned enhanced data set.
The specific manner in which the various modules perform the operations in the apparatus of the above embodiments have been described in detail in connection with the embodiments of the method, and will not be described in detail herein.
Fig. 10 is a block diagram of an electronic device, according to an example embodiment. As shown in fig. 10, electronic device 1000 includes, but is not limited to: a processor 1101 and a memory 1102.
The memory 1102 is used for storing executable instructions of the processor 1101. It will be appreciated that the processor 1101 described above is configured to execute instructions to implement the data enhancement method of the above embodiments.
It should be noted that the electronic device structure shown in fig. 10 is not limited to the electronic device, and the electronic device may include more or less components than those shown in fig. 10, or may combine some components, or may have different arrangements of components, as will be appreciated by those skilled in the art.
The processor 1101 is a control center of the electronic device, connects various parts of the entire electronic device using various interfaces and lines, and performs various functions of the electronic device and processes data by running or executing software programs and/or modules stored in the memory 1102, and invoking data stored in the memory 1102, thereby performing overall monitoring of the electronic device. The processor 1101 may include one or more processing units. Alternatively, the processor 1101 may integrate an application processor that primarily handles operating systems, user interfaces, applications, etc., with a modem processor that primarily handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 1101.
Memory 1102 may be used to store software programs as well as various data. The memory 1102 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, application programs (such as an acquisition unit, a processing unit, etc.) required for at least one functional module, and the like. In addition, memory 1102 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device.
In an exemplary embodiment, a computer readable storage medium is also provided, such as a memory 1102, comprising instructions executable by the processor 1101 of the electronic device 1100 to implement the data enhancement method of the above-described embodiments.
In actual implementation, the functions of the processing unit 901 and the acquiring unit 902 in fig. 9 may be implemented by the processor 1101 in fig. 10 calling a computer program stored in the memory 1102. For specific implementation, reference may be made to the description of the data enhancement method in the above embodiment, and details are not repeated here.
Alternatively, the computer readable storage medium may be a non-transitory computer readable storage medium, for example, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.
In an exemplary embodiment, the present application also provides a computer program product comprising one or more instructions executable by the processor 1101 of the electronic device 1100 to perform the data enhancement method of the above-described embodiment.
It should be noted that, when the instructions in the computer readable storage medium or one or more instructions in the computer program product are executed by the processor of the electronic device, the respective processes of the foregoing data enhancement method embodiment are implemented, and the technical effects that are the same as those of the foregoing data enhancement method can be achieved, so that repetition is avoided, and no further description is provided herein.
From the foregoing description of the embodiments, it will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of functional modules is illustrated, and in practical application, the above-described functional allocation may be performed by different functional modules according to needs, i.e. the internal structure of the apparatus is divided into different functional modules, so as to perform all the classification parts or part of the functions described above.
In the several embodiments provided in this application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of modules or units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another apparatus, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and the parts shown as units may be one physical unit or a plurality of physical units, may be located in one place, or may be distributed in a plurality of different places. The purpose of the embodiment scheme can be achieved by selecting part or all of the classification part units according to actual needs.
In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.
The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a readable storage medium. Based on such understanding, the technical solution of the embodiments of the present application may be essentially or partly contributing to the prior art or the whole classification part or part of the technical solution may be embodied in the form of a software product stored in a storage medium, including several instructions to cause a device (may be a single-chip microcomputer, a chip or the like) or a processor (processor) to perform the whole classification part or part of the steps of the methods of the embodiments of the present application. And the aforementioned storage medium includes: a usb disk, a removable hard disk, a ROM, a RAM, a magnetic disk, or an optical disk, etc.
The foregoing is merely a specific embodiment of the present application, but the protection scope of the present application is not limited thereto, and any changes or substitutions within the technical scope of the present disclosure should be covered in the protection scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (8)

1. A method of data enhancement, comprising:
performing data preprocessing on an original data set to obtain a processed data set, wherein the data preprocessing is used for eliminating preset texts included in the original data set, and the preset texts comprise: preset stop words and repeated texts;
respectively inputting a plurality of first texts included in the processed data set into a constructed target model, and generating associated texts corresponding to each first text to obtain an enhanced data set including a plurality of second texts, wherein the plurality of second texts include the plurality of first texts and associated texts corresponding to each first text;
performing data cleaning processing on the plurality of second texts included in the enhancement data set to obtain a cleaned enhancement data set, wherein the data cleaning processing is used for screening the plurality of second texts based on a preset threshold value;
And merging the original data set and the cleaned enhanced data set to obtain a training data set with enhanced data.
2. The method of claim 1, wherein the performing data preprocessing on the original data set to obtain a processed data set comprises:
acquiring a preset stop word list, and performing word segmentation on texts included in the original data set based on the preset stop word list to obtain a segmented data set;
and judging the similarity of any two texts in the segmented data set, and removing repeated texts in the segmented data set based on a preset similarity threshold value to obtain the processed data set, wherein the repeated texts are texts with similarity larger than the preset similarity threshold value.
3. The method according to claim 2, wherein the method further comprises:
determining a target text from the processed data set, and generating a target prompt based on the target text and a preset model, wherein the target text is a document theme and/or a high-frequency word with frequency greater than a preset frequency included in the processed data set;
and constructing and obtaining the target model based on the preset model and the target template.
4. A method according to any one of claims 1 to 3, wherein said performing a data cleansing process on said plurality of second texts included in said enhancement data set to obtain a cleansed enhancement data set comprises:
performing similarity judgment on any two second texts in the plurality of second texts included in the enhancement data set, and removing repeated texts in the plurality of second texts based on a preset similarity threshold value to obtain an enhancement data set after duplication removal, wherein the repeated texts are texts with similarity larger than the preset similarity threshold value;
and carrying out data cleaning treatment on the text included in the duplicate-removed enhancement data set to obtain the cleaned enhancement data set.
5. The method of claim 4, wherein performing a data cleaning process on text included in the de-duplicated enhancement data set to obtain the cleaned enhancement data set, comprises:
inputting the text included in the enhanced data set after the duplication removal into a preset judging model, and determining the confidence level of each text included in the enhanced data set after the duplication removal;
and deleting the text with the confidence coefficient smaller than the preset confidence coefficient threshold value from the enhancement data set after the duplication removal based on the confidence coefficient of each text and the preset confidence coefficient threshold value included in the enhancement data set after the duplication removal, so as to obtain the enhancement data set after the washing.
6. A data enhancement device, the data enhancement device comprising: a processing unit;
the processing unit is configured to perform data preprocessing on an original data set to obtain a processed data set, where the data preprocessing is used to reject a preset text included in the original data set, and the preset text includes: preset stop words and repeated texts;
the processing unit is used for respectively inputting a plurality of first texts included in the processed data set into the constructed target model, generating associated texts corresponding to each first text, and obtaining an enhanced data set including a plurality of second texts, wherein the plurality of second texts include the plurality of first texts and the associated texts corresponding to each first text;
the processing unit is used for performing data cleaning processing on the plurality of second texts included in the enhancement data set to obtain a cleaned enhancement data set, and the data cleaning processing is used for screening the plurality of second texts based on a preset threshold value;
and the processing unit is used for merging the original data set and the cleaned enhancement data set to obtain a training data set with enhanced data.
7. An electronic device, comprising: a processor and a memory; wherein the memory is configured to store one or more programs, the one or more programs comprising computer-executable instructions that, when executed by the electronic device, cause the electronic device to perform the method of any of claims 1-5.
8. A computer readable storage medium, characterized in that, when computer-executable instructions stored in the computer readable storage medium are executed by a processor of an electronic device, the electronic device is capable of performing the method of any one of claims 1 to 5.
CN202310265899.9A 2023-03-17 2023-03-17 Data enhancement method, device, equipment and storage medium Pending CN116383644A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310265899.9A CN116383644A (en) 2023-03-17 2023-03-17 Data enhancement method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310265899.9A CN116383644A (en) 2023-03-17 2023-03-17 Data enhancement method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN116383644A true CN116383644A (en) 2023-07-04

Family

ID=86964884

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310265899.9A Pending CN116383644A (en) 2023-03-17 2023-03-17 Data enhancement method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN116383644A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117041168A (en) * 2023-10-09 2023-11-10 常州楠菲微电子有限公司 QoS queue scheduling realization method and device, storage medium and processor

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117041168A (en) * 2023-10-09 2023-11-10 常州楠菲微电子有限公司 QoS queue scheduling realization method and device, storage medium and processor

Similar Documents

Publication Publication Date Title
CN107729392B (en) Text structuring method, device and system and non-volatile storage medium
CN107301170B (en) Method and device for segmenting sentences based on artificial intelligence
CN107463548B (en) Phrase mining method and device
US11907659B2 (en) Item recall method and system, electronic device and readable storage medium
JP2007531948A (en) Search method for content, especially extracted parts common to two computer files
CN110334343B (en) Method and system for extracting personal privacy information in contract
CN110851714A (en) Text recommendation method and system based on heterogeneous topic model and word embedding model
CN112784009B (en) Method and device for mining subject term, electronic equipment and storage medium
CN111198946A (en) Network news hotspot mining method and device
CN116383644A (en) Data enhancement method, device, equipment and storage medium
Ju et al. Leveraging information bottleneck for scientific document summarization
CN111950261B (en) Method, device and computer readable storage medium for extracting text keywords
CN110929022A (en) Text abstract generation method and system
CN116580283B (en) Image prompt word generation method and device, electronic equipment and storage medium
CN108475265B (en) Method and device for acquiring unknown words
CN117057349A (en) News text keyword extraction method, device, computer equipment and storage medium
CN110597982A (en) Short text topic clustering algorithm based on word co-occurrence network
CN114610576A (en) Log generation monitoring method and device
CN111339287B (en) Abstract generation method and device
CN108256055B (en) Topic modeling method based on data enhancement
CN111079448A (en) Intention identification method and device
CN111159996A (en) Short text set similarity comparison method and system based on improved text fingerprint algorithm
CN111488432A (en) Sentiment analysis method, equipment and storage medium based on user comments
CN117473983B (en) Unknown word collection method and device based on fuzzy matching and mutual information
CN115577069A (en) Text processing method, server and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination