CN111859953B - Training data mining method and device, electronic equipment and storage medium - Google Patents
Training data mining method and device, electronic equipment and storage medium Download PDFInfo
- Publication number
- CN111859953B CN111859953B CN202010576205.XA CN202010576205A CN111859953B CN 111859953 B CN111859953 B CN 111859953B CN 202010576205 A CN202010576205 A CN 202010576205A CN 111859953 B CN111859953 B CN 111859953B
- Authority
- CN
- China
- Prior art keywords
- data
- training data
- training
- original
- data set
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000012549 training Methods 0.000 title claims abstract description 269
- 238000000034 method Methods 0.000 title claims abstract description 57
- 238000007418 data mining Methods 0.000 title claims abstract description 29
- 238000012216 screening Methods 0.000 claims abstract description 62
- 238000005065 mining Methods 0.000 claims abstract description 29
- 230000015654 memory Effects 0.000 claims description 19
- 238000004422 calculation algorithm Methods 0.000 claims description 4
- 238000010845 search algorithm Methods 0.000 claims description 2
- 238000003058 natural language processing Methods 0.000 abstract description 11
- 238000002372 labelling Methods 0.000 abstract description 5
- 238000013473 artificial intelligence Methods 0.000 abstract description 2
- 238000001914 filtration Methods 0.000 description 17
- 238000010586 diagram Methods 0.000 description 12
- 230000008451 emotion Effects 0.000 description 8
- 238000011156 evaluation Methods 0.000 description 8
- 239000013598 vector Substances 0.000 description 8
- 238000004458 analytical method Methods 0.000 description 6
- 238000004590 computer program Methods 0.000 description 6
- 238000004891 communication Methods 0.000 description 4
- 230000000694 effects Effects 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 3
- 230000006870 function Effects 0.000 description 3
- 238000012986 modification Methods 0.000 description 3
- 230000004048 modification Effects 0.000 description 3
- 238000012545 processing Methods 0.000 description 3
- 230000003993 interaction Effects 0.000 description 2
- 239000004973 liquid crystal related substance Substances 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 238000012795 verification Methods 0.000 description 2
- 230000004308 accommodation Effects 0.000 description 1
- 230000006978 adaptation Effects 0.000 description 1
- 230000002776 aggregation Effects 0.000 description 1
- 238000004220 aggregation Methods 0.000 description 1
- 230000002457 bidirectional effect Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 230000001413 cellular effect Effects 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 239000003999 initiator Substances 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 238000010295 mobile communication Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
- 230000001953 sensory effect Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2458—Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
- G06F16/2465—Query processing support for facilitating data mining operations in structured databases
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Data Mining & Analysis (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Databases & Information Systems (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Mathematical Physics (AREA)
- Probability & Statistics with Applications (AREA)
- Software Systems (AREA)
- Fuzzy Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The application discloses a training data mining method, a training data mining device, electronic equipment and a storage medium, and relates to the technical field of natural language processing based on artificial intelligence. The specific implementation scheme is as follows: collecting a plurality of pieces of unsupervised text serving as original data to form an original data set; acquiring a preconfigured data screening rule set, wherein the data screening rule set comprises a plurality of preconfigured data screening rules; and mining a plurality of pieces of training data from the original data set according to each data screening rule in the data screening rule set to form a training data set. Compared with the manual labeling training data in the prior art, the method and the device can automatically and intelligently mine the training data without manually labeling the training data, can effectively save the acquisition cost of the training data and improve the acquisition efficiency of the training data.
Description
Technical Field
The present application relates to the field of computer technologies, and in particular, to the field of natural language processing technologies based on artificial intelligence, and in particular, to a training data mining method, device, electronic equipment, and storage medium.
Background
In recent years, a Pre-training model represented by a transform-based bi-directional encoder representation (Bidirectional Encoder Representation from Transformers; BERT) model provides a training paradigm of two stages of Pre-training (Pre-training) +fine-tuning (Fine-tuning), so as to train the model, and greatly improve the effects of various natural language processing (Natural Language Processing; NLP) tasks. The BERT model adopts a deep-layer transducer model structure, uses massive unsupervised text to learn context related representations, and uses a general unified mode to solve various NLP tasks such as text matching, text generation, emotion classification, text abstract, question and answer, retrieval and the like.
Wherein, pre-training refers to constructing self-supervised learning tasks, such as shape filling, sentence sorting, etc., by using massive unlabeled text as training data. The Fine-tuning refers to performing task adaptation by using a small amount of task text with manual labels as training data, and obtaining a specific natural language processing task model.
In the existing training process of the Fine-tuning stage, manually marked training data are used. However, the manually labeled training data is expensive, and often requires a technical expert with abundant experience to label, so that the manually labeled training data in the existing Fine-tuning stage has higher acquisition cost and very low acquisition efficiency.
Disclosure of Invention
In order to solve the technical problems, the application provides a training data mining method, a training data mining device, electronic equipment and a storage medium.
According to an aspect of the present application, there is provided a training data mining method, wherein the method includes:
collecting a plurality of pieces of unsupervised text serving as original data to form an original data set;
acquiring a preconfigured data screening rule set, wherein the data screening rule set comprises a plurality of preconfigured data screening rules;
and mining a plurality of pieces of training data from the original data set according to each data screening rule in the data screening rule set to form a training data set.
According to another aspect of the present application, there is provided an apparatus for mining training data, wherein the apparatus includes:
the acquisition module is used for acquiring a plurality of pieces of unsupervised text serving as original data to form an original data set;
the acquisition module is used for acquiring a preset data screening rule set, wherein the data screening rule set comprises a plurality of preset data screening rules;
and the mining module is used for mining a plurality of pieces of training data from the original data set according to each data screening rule in the data screening rule set to form a training data set.
According to still another aspect of the present application, there is provided an electronic apparatus including:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein,,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method as described above.
According to yet another aspect of the present application, there is provided a non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method as described above.
According to yet another aspect of the present application, there is provided a computer program product comprising a computer program which, when executed by a processor, implements a method as described above.
According to the technology of the application, compared with the manual annotation training data in the prior art, the training data can be automatically and intelligently mined without manual annotation training data, so that the acquisition cost of the training data can be effectively saved, and the acquisition efficiency of the training data can be improved.
It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.
Drawings
The drawings are included to provide a better understanding of the present application and are not to be construed as limiting the application. Wherein:
FIG. 1 is a schematic diagram of a first embodiment according to the present application;
FIG. 2 is a schematic diagram of a second embodiment according to the present application;
FIG. 3 is an exemplary diagram of the embodiment shown in FIG. 2;
FIG. 4 is a schematic semantic representation of the semantic representation model of the present application;
FIG. 5 is a training schematic of the semantic representation model of the present application;
FIG. 6 is a schematic diagram of a third embodiment according to the present application;
FIG. 7 is a schematic diagram of a fourth embodiment according to the application;
fig. 8 is a block diagram of an electronic device for implementing a training data mining method of an embodiment of the present application.
Detailed Description
Exemplary embodiments of the present application will now be described with reference to the accompanying drawings, in which various details of the embodiments of the present application are included to facilitate understanding, and are to be considered merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
FIG. 1 is a schematic diagram of a first embodiment according to the present application; as shown in fig. 1, the present application provides a training data mining method, which specifically includes the following steps:
s101, collecting a plurality of pieces of unsupervised text serving as original data to form an original data set;
s102, acquiring a preconfigured data screening rule set, wherein the data screening rule set comprises a plurality of preconfigured data screening rules;
s103, mining a plurality of pieces of training data from the original data set according to each data screening rule in the data screening rule set to form a training data set.
The execution main body of the training data mining method of the embodiment may be a training data mining device, which may be an electronic entity, or may also be an application adopting software integration, and when in use, the device is run on a computer device to implement mining of training data.
In this embodiment, a large number of unsupervised texts can be collected from the network, and each unsupervised text corresponds to one piece of original data, so that several pieces of original data can be obtained and added into one data set to form the original data set. Alternatively, a large amount of unlabeled text in a certain field provided by the model user may be collected and added as raw data to the raw data set. The model user of the present embodiment may be a user of a model to be trained by the mined training data.
The data filtering rules of the data filtering rule set in this embodiment may be configured by a model user summarizing rules of training data to be mined based on his own experience. I.e. each pre-configured data screening rule may also be referred to as an artificial a priori rule.
Further, in this embodiment, according to each data filtering rule in the data filtering rule set, a plurality of pieces of training data may be mined from the original data set to form a training data set. Wherein each piece of training data mined is in accordance with a data screening rule. In this way, by adopting each data screening rule in the data screening rule set, a plurality of pieces of training data can be mined from the original data set to form a training data set.
According to the mining method of the training data, an original data set is formed by collecting a plurality of pieces of unsupervised text serving as the original data; acquiring a preconfigured data screening rule set, wherein the data screening rule set comprises a plurality of preconfigured data screening rules; according to each data screening rule in the data screening rule set, a plurality of pieces of training data are mined from the original data set to form a training data set, compared with the manually marked training data in the prior art, the training data can be automatically and intelligently mined, the manually marked training data is not needed, the acquisition cost of the training data can be effectively saved, and the acquisition efficiency of the training data is improved.
FIG. 2 is a schematic diagram of a second embodiment according to the present application; as shown in fig. 2, the training data mining method of the present embodiment further describes the technical solution of the present application in more detail on the basis of the technical solution of the embodiment shown in fig. 1. As shown in fig. 2, the training data mining method of the present embodiment may specifically include the following steps:
s201, collecting a plurality of pieces of unsupervised text serving as original data to form an original data set U;
that is, in this embodiment, each piece of original data corresponds to one piece of unsupervised text. And the original data and the training data in this embodiment are text data.
S202, acquiring a preconfigured data screening rule set P, wherein the data screening rule set P comprises a plurality of preconfigured data screening rules;
optionally, in this embodiment, each data filtering rule may be expressed by using a regular expression, and/or each data filtering rule carries a corresponding tag. The label in this embodiment is used to label the training data of the filtering, for example, when performing emotion analysis task, the label may be used as a label of emotion tendency. Such as positive or negative evaluation. That is, in this embodiment, each data filtering rule included in the data filtering rule set is used to filter supervised annotation data from the original dataset as training data.
For example, in the emotion analysis task, when training data for evaluation of user's tendency to lodge is mined, data screening rules for forward evaluation that can be adopted may be set to rules including "quite quiet", "quite affordable", "yet come again", "early\middle\late dining very good, or" very near to [ place ], very convenient ", and so on. While the data screening rules for the corresponding negative evaluations may be set to include rules of "no longer ordered", "bad accommodation", "too noisy", "price [ number ] element, too expensive", etc. In practical application, according to similar rules, data filtering rules of response training data in various NLP tasks such as semantic matching, machine translation, dialogue understanding and the like can be configured, and are not described in detail herein.
S203, mining a plurality of pieces of training data from the original data set according to each data screening rule in the data screening rule set P to form a training data set A;
specifically, when the training data is mined from the original data set by adopting the data screening rule, whether the original data hits the corresponding data screening rule can be specifically judged, and if the original data hits, the original data can be extracted as the training data. And meanwhile, the label of the data screening rule can be used as the label of the training data. For example, for a piece of original data "i liked the hotel very much, quite quiet", hit the data screening rule including the "quite quiet" forward evaluation, at this time, the original data and the corresponding tag "forward evaluation" together form a piece of training data. In a similar manner, each data filtering rule in the data filtering rule set may be used to mine a plurality of pieces of training data from the original data set to form a training data set a.
Optionally, in order to ensure the accuracy of the mined training data, a manual verification manner may be used herein to verify the accuracy of the mined training data, so as to ensure the accuracy of the training data in the training data set a.
S204, acquiring similar data most similar to each training data in a plurality of pieces of training data from an original data set U by adopting a pre-trained semantic representation model and an approximate nearest neighbor (Approximate Nearest Neighbors; ANN) retrieval algorithm, and adding the similar data as extended training data into a training data set A;
the Pre-trained semantic representation model adopted in the embodiment may be a semantic representation model obtained through Pre-training, and the semantic representation model may be trained by a large amount of non-labeling data, so that semantic representation can be accurately performed. For example, the semantic representation in this embodiment may take the form of a vector.
For each training data in the training data set a obtained, a semantic representation model is used to obtain semantic representations of each training data, and it should be noted that, when the semantic representations are obtained here, only the data portions other than the labels in the training data are subjected to semantic representation. Then, semantic representation of each piece of original data in the original data set U is obtained by adopting a semantic representation model. At this time, each piece of training data may be represented as a vector, each piece of original data in the original data set U may also be represented as a vector, and then, based on a calculation manner of similarity of the vectors, an ANN search algorithm may be adopted to calculate similar data in the original data set U that is most similar to each piece of training data. For each piece of training data in the pieces of training data, a piece of similar data which is the closest can be obtained in this way, and a plurality of pieces of similar data can be obtained altogether. The most similar data are used as extended training data and added into the training data set A, so that the extension and enrichment of the training data set A are realized.
Further alternatively, it is considered that the most similar data of some training data is not very similar to the training data, and adding such most similar data to the training data set a may result in a decrease in accuracy of the added training data. Therefore, before the most similar data of each training data is used as the expanded training data and added into the training data set A, whether the similarity between the most similar data and the corresponding training data is larger than a preset similarity threshold value or not can be judged first, if so, the most similar data is used as the expanded training data and added into the training data set A; otherwise, the most similar data is discarded and not added to training data set A.
Further alternatively, in this embodiment, a manual verification manner may be also adopted to verify each piece of extended training data added into the training data set a, so as to ensure accuracy of the newly added extended training data.
For example, FIG. 3 is an exemplary diagram of the embodiment shown in FIG. 2. As shown in fig. 3, taking training data required for mining emotion analysis tasks as an example, a flowchart of mining training data is adopted by the mining method of training data in steps S201 to S204 described above in this embodiment. As shown in fig. 3, a plurality of pieces of raw data are acquired, constituting a raw data set U. Each data screening rule in the set of data screening rules P is configured by a person skilled in the art based on his experience. And then, screening a plurality of pieces of training data from the original data set U by adopting each data screening rule to form a training data set A. As shown in fig. 3, the training data set a obtained at this time includes training data for positive evaluation and training data for negative evaluation. And then searching the most similar data of each training data in the training data set A from the original data set U by adopting a semantic representation model and an ANN retrieval algorithm, and adding the most similar data with similarity larger than a preset similarity threshold value into the training data set A as expanded training data to obtain a final expanded training data set A.
The step S204 is an expansion method of the training data set a, which can add more accurate training data into the training data set a, and compared with the training data obtained in the step S203, the method can further enrich the number of training data in the training data set a, and can effectively ensure the accuracy of the added training data.
S205, training a target model M by adopting each training data in the training data set A;
s206, predicting labels and prediction probabilities of all original data in the rest data sets except the training data set A in the original data set U by adopting a target model M;
s207, according to each original data, the labels of each original data and the corresponding prediction probability and the preset probability threshold value in the residual data set, the original data with the prediction probability larger than the preset probability threshold value is mined from the residual data set, and the original data and the labels of the original data are used as extended training data to be added into the training data set A.
Steps S205-S207 of the present embodiment are one way of optionally expanding the training data set a. Alternatively, the steps S205 to S207 may be performed after the step S204. Steps S205-S207 may also be combined directly with steps S201-S203 without step S204 described above, constituting an alternative embodiment of the present application.
Since the training data set a obtained in the step S204 already includes more accurate training data, the training data set a may be used to train a target model M of a corresponding task. At this time, the target model M may be used to predict each original data in the remaining data sets except the training data set a in the original data set U, and predict the label and the prediction probability of each original data, where the prediction probability indicates the probability that the original data belongs to the label. For example, in the emotion analysis task described above, the trained target model M can predict whether each original data in the remaining data set tends to be positively evaluated or negatively evaluated, and what the corresponding prediction probability is. And then, the original data with the predicted probability larger than the preset probability threshold and the corresponding labels can be further combined with the preset probability threshold to serve as extended training data, and the extended training data are added into the training data set A. In a similar manner, a plurality of pieces of extended training data can be obtained and added into the training data set A together, so that the extension of the training data set A is realized. In this embodiment, the original data with the prediction probability greater than the preset probability threshold is considered as the predicted high confidence data, and the corresponding original data and the corresponding tag can be used together as one piece of extended training data. The accuracy of the extended training data acquired by the method is very high, and the quality of the training data in the training data set A can be effectively ensured.
Further optionally, in this embodiment, new extended training data may be further added by using step S207, and the extended training data set a trains the target model M again until the accuracy of the target model M reaches the preset accuracy threshold, so that the target model M may be further trained by further using the extended and richer and comprehensive training data set a, and the accuracy of the target model M may be further improved.
For example, in this embodiment, optionally, the accuracy of the extended training data may also be checked manually, so as to ensure the accuracy of the training data added to the training data set a.
Fig. 4 is a schematic semantic representation of the semantic representation model of the present application. As shown in fig. 4, the semantic representation model employed in step S204 of the present embodiment serves as a vector generatorFor generating a vector representation of a piece of text data. As shown in FIG. 4, when semantic representation is performed, the text in the text data needs to be segmented to obtain a plurality of segmented words, such as T in FIG. 4 1 、……T N When inputting to the semantic representation model, the initiator CLS needs to be sequentially input, and a plurality of word segmentation T 1 、……T N And a separator SEP. Multiple Transform layers in the semantic representation model may then jointly encode based on all of the information entered. When performing semantic representation, the average representation of the top N layers of the semantic representation model may be employed as the semantic representation of the corresponding text data. Wherein the N may be set to 1, 2 or 3 or other positive integers according to actual requirements.
For example, when N is 1, the semantic representation of the topmost CLS location, T, may be taken 1 、……T N Together with the maximum semantic representation, and the semantic representation of the SEP position, as the final semantic representation of the text data, i.e. the vector representation of the text data. The weight of each part can be preset according to actual requirements.
For N being other values, such as n=3, the semantic representation of the text data in each layer may be obtained in the manner of n=1, and then the semantic representations of the text data in the N layers may be averaged to obtain a final semantic representation.
Furthermore, in order to make the semantic representation model obtain better effect in ANN retrieval, the application can also train the semantic representation model by using the weak supervision aggregation information of the articles. FIG. 5 is a training schematic of the semantic representation model of the present application. As shown in FIG. 5, two articles may optionally be selected during training, assuming article A includes paragraphs D1, D2, D3 and article B includes paragraphs D4, D5, D6, D7. Specifically, the semantic representation of each paragraph, namely a semantic vector, can be generated by adopting the method in fig. 4, then the similarity of the cosines of any two paragraphs can be calculated during training, the similarity of the cosines of the same article is assumed to be Loss +, the similarity of the cosines of different articles is assumed to be Loss-, and the difference between the average value of Loss + and the average value of Loss-is as large as possible during training, wherein the similarity of the same article is 1 and is not counted in Loss +; i.e. training is aimed at making the similarity to paragraphs in the article as close as possible, and the similarity to paragraphs in different articles as far as possible. In addition, in the training with two optional articles as a set of training samples in fig. 5, in practical application, N articles may be selected as a set of training samples for training, where N may be a positive integer greater than 2, and in a similar manner, the similarity of paragraphs in the same article is as close as possible, the similarity of paragraphs in different articles is as far as possible, and the semantic representation model is trained. By training the semantic representation model in the mode, the semantic representation can be performed more accurately. Alternatively, in the above scheme, the paragraphs are granularity, in practical application, sentences may be granularity, and the same thing is similar to sentences of the article, and the Cosine similarity of different sentences is Loss-, and the optimization goal in training is that the difference between the average value of loss+ and the Loss-average value is as large as possible, and the training principle is the same.
Furthermore, by adopting the mining method of the training data, the training data of various corresponding NLP tasks can be mined, further task processing can be performed based on the semantic representation model, the semantic representation model is converted into task models such as emotion analysis, semantic matching, question-answer matching and entity labeling, and the use cost of the semantic representation model is reduced.
In the above embodiments, the emotion analysis task in the NLP field is taken as an example, and the technical scheme of the present application is described in many places. In practical applications, the technical solution of the present embodiment may be suitable for mining training data of tasks such as semantic matching, question-answer matching, entity labeling and the like in the NLP field, and details of the foregoing embodiments may be referred to in the description of the foregoing embodiments and will not be described herein in detail.
According to the training data mining method, technical experts are not required to manually mark training data through the technical scheme, a training data set can be mined through a preset data screening rule, and the training data set can be expanded by further adopting the two training data mining methods. In addition, the accuracy of the extended training data obtained by the mode of the embodiment is very high, and the quality of the training data can be effectively ensured. Therefore, the technical scheme of the embodiment can automatically and intelligently mine the training data, can effectively save the acquisition cost of the training data and improve the acquisition efficiency of the training data.
FIG. 6 is a schematic diagram of a third embodiment according to the present application; as shown in fig. 6, the training data mining apparatus 600 provided in this embodiment includes:
the acquisition module 601 is configured to acquire a plurality of pieces of unsupervised text serving as original data, to form an original data set;
an obtaining module 602, configured to obtain a preconfigured data filtering rule set, where the data filtering rule set includes a plurality of preconfigured data filtering rules;
the mining module 603 is configured to mine a plurality of training data from the original data set according to each data filtering rule in the data filtering rule set, so as to form a training data set.
The training data mining apparatus 600 of the present embodiment implements the principle and technical effects of mining training data by using the above modules, and is the same as the implementation of the above related method embodiments, and details of the above related method embodiments may be referred to in the description of the related method embodiments, which is not repeated herein.
FIG. 7 is a schematic diagram of a fourth embodiment according to the application; as shown in fig. 7, the training data mining apparatus 600 according to the present embodiment further describes the technical scheme of the present application in more detail on the basis of the technical scheme of the embodiment shown in fig. 6.
As shown in fig. 7, the training data mining apparatus 600 of the present embodiment further includes:
the retrieval module 604 is configured to acquire, from the original data set, similar data closest to each training data in the plurality of training data by using a pre-trained semantic representation model and an approximate nearest neighbor retrieval algorithm;
the expansion module 605 is configured to add each closest similar data as expanded training data into the training data set.
Further alternatively, the training data mining apparatus 600 of the present embodiment further includes:
the determining module 606 is configured to determine and determine that the similarity between each of the most similar data and the corresponding training data is greater than a preset similarity threshold.
Further alternatively, the training data mining apparatus 600 of the present embodiment further includes:
a training module 607 for training the target model using each training data in the training data set.
Further optionally, the training data mining apparatus 600 of this embodiment further includes a prediction module 608;
a prediction module 608, configured to predict labels and prediction probabilities of each original data in the remaining data sets except the training data set in the original data set by using the target model;
the mining module 602 is further configured to mine, from the remaining data set, the original data with a prediction probability greater than the preset probability threshold according to each original data, the label of each original data and the corresponding prediction probability, and the preset probability threshold, and use the original data and the label of the original data together as extended training data, and add the extended training data into the training data set.
Further optionally, the training module 607 is further configured to train the target model again using the extended training data set until the accuracy of the target model reaches a preset accuracy threshold.
The training data mining apparatus 600 of the present embodiment implements the principle and technical effects of mining training data by using the above modules, and is the same as the implementation of the above related method embodiments, and details of the above related method embodiments may be referred to in the description of the related method embodiments, which is not repeated herein.
According to an embodiment of the present application, the present application also provides an electronic device and a readable storage medium.
Fig. 8 is a block diagram of an electronic device implementing a training data mining method according to an embodiment of the present application. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the applications described and/or claimed herein.
As shown in fig. 8, the electronic device includes: one or more processors 801, memory 802, and interfaces for connecting the components, including high-speed interfaces and low-speed interfaces. The various components are interconnected using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions executing within the electronic device, including instructions stored in or on memory to display graphical information of the GUI on an external input/output device, such as a display device coupled to the interface. In other embodiments, multiple processors and/or multiple buses may be used, if desired, along with multiple memories and multiple memories. Also, multiple electronic devices may be connected, each providing a portion of the necessary operations (e.g., as a server array, a set of blade servers, or a multiprocessor system). One processor 801 is illustrated in fig. 8.
Memory 802 is a non-transitory computer readable storage medium provided by the present application. The memory stores instructions executable by the at least one processor to cause the at least one processor to perform the training data mining method provided by the present application. The non-transitory computer readable storage medium of the present application stores computer instructions for causing a computer to execute the mining method of training data provided by the present application.
The memory 802 is used as a non-transitory computer readable storage medium for storing non-transitory software programs, non-transitory computer executable programs, and modules, such as program instructions/modules (e.g., related modules shown in fig. 6 and fig. 7) corresponding to a training data mining method according to an embodiment of the present application. The processor 801 executes various functional applications of the server and data processing, that is, implements the mining method of training data in the above-described method embodiment, by running non-transitory software programs, instructions, and modules stored in the memory 802.
Memory 802 may include a storage program area that may store an operating system, at least one application program required for functionality, and a storage data area; the storage data area may store data created according to the use of the electronic device implementing the mining method of training data, and the like. In addition, memory 802 may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid-state storage device. In some embodiments, memory 802 may optionally include memory remotely located with respect to processor 801, which may be connected via a network to an electronic device implementing the mining method of training data. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The electronic device for implementing the training data mining method may further include: an input device 803 and an output device 804. The processor 801, memory 802, input devices 803, and output devices 804 may be connected by a bus or other means, for example in fig. 8.
Input device 803 may receive input numeric or character information and generate key signal inputs related to user settings and function control of an electronic device implementing the XXX method, such as a touch screen, keypad, mouse, trackpad, touch pad, pointer stick, one or more mouse buttons, track ball, joystick, etc. input devices. The output device 804 may include a display apparatus, auxiliary lighting devices (e.g., LEDs), and haptic feedback devices (e.g., vibration motors), among others. The display device may include, but is not limited to, a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display, and a plasma display. In some implementations, the display device may be a touch screen.
Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, application specific ASIC (application specific integrated circuit), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.
These computing programs (also referred to as programs, software applications, or code) include machine instructions for a programmable processor, and may be implemented in a high-level procedural and/or object-oriented programming language, and/or in assembly/machine language. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), the internet, and blockchain networks.
The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.
According to the technical scheme of the embodiment of the application, an original data set is formed by collecting a plurality of pieces of unsupervised text serving as original data; acquiring a preconfigured data screening rule set, wherein the data screening rule set comprises a plurality of preconfigured data screening rules; according to each data screening rule in the data screening rule set, a plurality of pieces of training data are mined from the original data set to form a training data set, compared with the manually marked training data in the prior art, the training data can be automatically and intelligently mined, the manually marked training data is not needed, the acquisition cost of the training data can be effectively saved, and the acquisition efficiency of the training data is improved.
According to the technical scheme provided by the embodiment of the application, the training data set can be mined by the technical scheme without manually marking the training data by a technical expert and through the preset data screening rule, and the training data set can be expanded by further adopting the mining methods of the two training data. In addition, the accuracy of the extended training data obtained by the mode of the embodiment is very high, and the quality of the training data can be effectively ensured. Therefore, the technical scheme of the embodiment can automatically and intelligently mine the training data, can effectively save the acquisition cost of the training data and improve the acquisition efficiency of the training data.
It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps described in the present application may be performed in parallel, sequentially, or in a different order, provided that the desired results of the disclosed embodiments are achieved, and are not limited herein.
The above embodiments do not limit the scope of the present application. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present application should be included in the scope of the present application.
Claims (8)
1. A method of mining training data, wherein the method comprises:
collecting a plurality of pieces of unsupervised text serving as original data to form an original data set;
acquiring a preconfigured data screening rule set, wherein the data screening rule set comprises a plurality of preconfigured data screening rules;
according to each data screening rule in the data screening rule set, mining a plurality of pieces of training data from the original data set to form a training data set;
the method further comprises the steps of:
training a target model by adopting each training data in the training data set;
predicting labels and prediction probabilities of all original data in the rest data sets except the training data set in the original data sets by adopting the target model;
according to the original data, the labels of the original data, the corresponding prediction probability and a preset probability threshold, the original data with the prediction probability larger than the preset probability threshold is mined from the residual data set, and the original data and the labels of the original data are used as extended training data to be added into the training data set;
and training the target model again by adopting the expanded training data set until the accuracy of the target model reaches a preset accuracy threshold.
2. The method of claim 1, wherein, after mining a number of training data from the original dataset according to each of the set of data screening rules, the method further comprises:
acquiring similar data closest to each training data in the plurality of training data from the original data set by adopting a pre-trained semantic representation model and an approximate nearest neighbor search algorithm;
and adding each nearest similar data as extended training data into the training data set.
3. The method of claim 2, wherein each of the most similar data is added as expanded training data to the training data set, the method further comprising:
and judging and determining that the similarity between each nearest similar data and the corresponding training data is larger than a preset similarity threshold value.
4. A training data mining apparatus, wherein the apparatus comprises:
the acquisition module is used for acquiring a plurality of pieces of unsupervised text serving as original data to form an original data set;
the acquisition module is used for acquiring a preset data screening rule set, wherein the data screening rule set comprises a plurality of preset data screening rules;
the mining module is used for mining a plurality of pieces of training data from the original data set according to each data screening rule in the data screening rule set to form a training data set;
the apparatus further comprises:
the training module is used for training a target model by adopting each training data in the training data set;
the prediction module is used for predicting labels and prediction probabilities of all the original data in the rest data sets except the training data set in the original data sets by adopting the target model;
the mining module is further configured to mine, from the remaining data set, the original data with the prediction probability greater than the preset probability threshold according to each original data, the label of each original data, the corresponding prediction probability, and the preset probability threshold, and use the original data and the label of the original data together as extended training data, and add the extended training data into the training data set;
the training module is further configured to retrain the target model using the extended training data set until the accuracy of the target model reaches a preset accuracy threshold.
5. The apparatus of claim 4, wherein the apparatus further comprises:
the retrieval module is used for acquiring similar data closest to each training data in the plurality of pieces of training data from the original data set by adopting a pre-trained semantic representation model and an approximate nearest neighbor retrieval algorithm;
and the expansion module is used for taking each nearest similar data as expanded training data and adding the expanded training data into the training data set.
6. The apparatus of claim 5, wherein the apparatus further comprises:
the judging module is used for judging and determining that the similarity between each nearest similar data and the corresponding training data is larger than a preset similarity threshold value.
7. An electronic device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein,,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-3.
8. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-3.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010576205.XA CN111859953B (en) | 2020-06-22 | 2020-06-22 | Training data mining method and device, electronic equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010576205.XA CN111859953B (en) | 2020-06-22 | 2020-06-22 | Training data mining method and device, electronic equipment and storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111859953A CN111859953A (en) | 2020-10-30 |
CN111859953B true CN111859953B (en) | 2023-08-22 |
Family
ID=72988022
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010576205.XA Active CN111859953B (en) | 2020-06-22 | 2020-06-22 | Training data mining method and device, electronic equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111859953B (en) |
Families Citing this family (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112465023B (en) * | 2020-11-27 | 2021-06-18 | 自然资源部第一海洋研究所 | Method for expanding training data of geological direction artificial neural network |
CN112685536B (en) * | 2020-12-25 | 2024-06-07 | 中国平安人寿保险股份有限公司 | Text dialogue method, text dialogue device, electronic equipment and storage medium |
CN112784050A (en) * | 2021-01-29 | 2021-05-11 | 北京百度网讯科技有限公司 | Method, device, equipment and medium for generating theme classification data set |
CN113656575B (en) * | 2021-07-13 | 2024-02-02 | 北京搜狗科技发展有限公司 | Training data generation method and device, electronic equipment and readable medium |
CN115640808B (en) * | 2022-12-05 | 2023-03-21 | 苏州浪潮智能科技有限公司 | Text labeling method and device, electronic equipment and readable storage medium |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108764372A (en) * | 2018-06-08 | 2018-11-06 | Oppo广东移动通信有限公司 | Construction method and device, mobile terminal, the readable storage medium storing program for executing of data set |
CN110276075A (en) * | 2019-06-21 | 2019-09-24 | 腾讯科技(深圳)有限公司 | Model training method, name entity recognition method, device, equipment and medium |
CN110610197A (en) * | 2019-08-19 | 2019-12-24 | 北京迈格威科技有限公司 | Method and device for mining difficult sample and training model and electronic equipment |
WO2020019252A1 (en) * | 2018-07-26 | 2020-01-30 | 深圳前海达闼云端智能科技有限公司 | Artificial intelligence model training method and device, storage medium and robot |
CN111046952A (en) * | 2019-12-12 | 2020-04-21 | 深圳市随手金服信息科技有限公司 | Method and device for establishing label mining model, storage medium and terminal |
CN111309912A (en) * | 2020-02-24 | 2020-06-19 | 深圳市华云中盛科技股份有限公司 | Text classification method and device, computer equipment and storage medium |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10678816B2 (en) * | 2017-08-23 | 2020-06-09 | Rsvp Technologies Inc. | Single-entity-single-relation question answering systems, and methods |
US10902207B2 (en) * | 2018-09-13 | 2021-01-26 | International Business Machines Corporation | Identifying application software performance problems using automated content-based semantic monitoring |
-
2020
- 2020-06-22 CN CN202010576205.XA patent/CN111859953B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108764372A (en) * | 2018-06-08 | 2018-11-06 | Oppo广东移动通信有限公司 | Construction method and device, mobile terminal, the readable storage medium storing program for executing of data set |
WO2020019252A1 (en) * | 2018-07-26 | 2020-01-30 | 深圳前海达闼云端智能科技有限公司 | Artificial intelligence model training method and device, storage medium and robot |
CN110276075A (en) * | 2019-06-21 | 2019-09-24 | 腾讯科技(深圳)有限公司 | Model training method, name entity recognition method, device, equipment and medium |
CN110610197A (en) * | 2019-08-19 | 2019-12-24 | 北京迈格威科技有限公司 | Method and device for mining difficult sample and training model and electronic equipment |
CN111046952A (en) * | 2019-12-12 | 2020-04-21 | 深圳市随手金服信息科技有限公司 | Method and device for establishing label mining model, storage medium and terminal |
CN111309912A (en) * | 2020-02-24 | 2020-06-19 | 深圳市华云中盛科技股份有限公司 | Text classification method and device, computer equipment and storage medium |
Non-Patent Citations (1)
Title |
---|
张宜浩等."基于用户评论的深度情感分析和多视图协同融合的混合推荐方法".《计算机学报》.2019,第42卷(第6期),第1316-1333页. * |
Also Published As
Publication number | Publication date |
---|---|
CN111859953A (en) | 2020-10-30 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112560912B (en) | Classification model training method and device, electronic equipment and storage medium | |
CN111859953B (en) | Training data mining method and device, electronic equipment and storage medium | |
CN110717339B (en) | Semantic representation model processing method and device, electronic equipment and storage medium | |
CN112507715B (en) | Method, device, equipment and storage medium for determining association relation between entities | |
US20220383190A1 (en) | Method of training classification model, method of classifying sample, and device | |
CN112560479B (en) | Abstract extraction model training method, abstract extraction device and electronic equipment | |
CN111783468B (en) | Text processing method, device, equipment and medium | |
US20220067439A1 (en) | Entity linking method, electronic device and storage medium | |
CN111667056B (en) | Method and apparatus for searching model structures | |
KR102565673B1 (en) | Method and apparatus for generating semantic representation model,and storage medium | |
CN111859982B (en) | Language model training method and device, electronic equipment and readable storage medium | |
CN111737994A (en) | Method, device and equipment for obtaining word vector based on language model and storage medium | |
CN111079442A (en) | Vectorization representation method and device of document and computer equipment | |
CN112148881B (en) | Method and device for outputting information | |
CN110717340B (en) | Recommendation method, recommendation device, electronic equipment and storage medium | |
CN111539209B (en) | Method and apparatus for entity classification | |
CN111797216B (en) | Search term rewriting method, apparatus, device and storage medium | |
CN112528001B (en) | Information query method and device and electronic equipment | |
CN112507702B (en) | Text information extraction method and device, electronic equipment and storage medium | |
CN112329453B (en) | Method, device, equipment and storage medium for generating sample chapter | |
US20230094730A1 (en) | Model training method and method for human-machine interaction | |
CN112232089B (en) | Pre-training method, device and storage medium of semantic representation model | |
CN113312451B (en) | Text label determining method and device | |
CN111930916B (en) | Dialog generation method and device, electronic equipment and storage medium | |
CN111597224B (en) | Method and device for generating structured information, electronic equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |