CN111859953A

CN111859953A - Training data mining method and device, electronic equipment and storage medium

Info

Publication number: CN111859953A
Application number: CN202010576205.XA
Authority: CN
Inventors: 王硕寰; 庞超; 孙宇
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2020-06-22
Filing date: 2020-06-22
Publication date: 2020-10-30
Anticipated expiration: 2040-06-22
Also published as: CN111859953B

Abstract

The application discloses a training data mining method and device, electronic equipment and a storage medium, and relates to the technical field of natural language processing based on artificial intelligence. The specific implementation scheme is as follows: collecting a plurality of unsupervised texts serving as original data to form an original data set; acquiring a pre-configured data screening rule set, wherein the data screening rule set comprises a plurality of pre-configured data screening rules; and mining a plurality of training data from the original data set according to each data screening rule in the data screening rule set to form a training data set. Compared with the manual labeling training data in the prior art, the training data can be automatically and intelligently mined without manually labeling the training data, the acquisition cost of the training data can be effectively saved, and the acquisition efficiency of the training data is improved.

Description

Training data mining method and device, electronic equipment and storage medium

Technical Field

The application relates to the technical field of computers, in particular to the technical field of natural language processing based on artificial intelligence, and specifically relates to a training data mining method and device, electronic equipment and a storage medium.

Background

In recent years, a Pre-training model represented by a transform-based Bidirectional Encoder Representation (BERT) model has been proposed, in which a training paradigm of two stages, Pre-training and Fine-tuning (Fine-tuning), is proposed to train the model, thereby greatly improving the effect of various Natural Language Processing (NLP) tasks. The BERT model adopts a deep Transformer model structure, learns context correlation representation by using massive unsupervised texts, and solves various NLP tasks such as text matching, text generation, emotion classification, text summarization, question answering, retrieval and the like by using a universal unified mode.

Pre-training refers to constructing a self-supervised learning task by using massive unlabelled texts as training data, such as filling in a blank, sequencing sentences and the like. Fine-tuning refers to the task adaptation performed by using a small amount of manually labeled task texts as training data to obtain a specific natural language processing task model.

In the existing Fine-tuning stage training process, manually labeled training data are used. However, the manually labeled training data is expensive in cost and often needs experienced technical experts to label, which results in higher acquisition cost and very low acquisition efficiency of the manually labeled training data in the traditional Fine-tuning stage.

Disclosure of Invention

In order to solve the technical problem, the application provides a training data mining method and device, an electronic device and a storage medium.

According to an aspect of the present application, there is provided a method for mining training data, wherein the method includes:

collecting a plurality of unsupervised texts serving as original data to form an original data set;

acquiring a pre-configured data screening rule set, wherein the data screening rule set comprises a plurality of pre-configured data screening rules;

and mining a plurality of training data from the original data set according to each data screening rule in the data screening rule set to form a training data set.

According to another aspect of the present application, there is provided a training data mining apparatus, wherein the apparatus includes:

the acquisition module is used for acquiring a plurality of unsupervised texts serving as original data to form an original data set;

the system comprises an acquisition module, a data filtering module and a data processing module, wherein the acquisition module is used for acquiring a pre-configured data filtering rule set, and the data filtering rule set comprises a plurality of pre-configured data filtering rules;

and the mining module is used for mining a plurality of training data from the original data set according to each data screening rule in the data screening rule set to form a training data set.

According to still another aspect of the present application, there is provided an electronic apparatus including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method as described above.

According to yet another aspect of the present application, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method as described above.

According to the technology of this application, compare with the artifical mark training data of prior art, can excavate the training data automatically, intellectuality, and do not need artifical mark training data, can save the acquisition cost of training data effectively, improve the acquisition efficiency of training data.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not intended to limit the present application. Wherein:

FIG. 1 is a schematic diagram according to a first embodiment of the present application;

FIG. 2 is a schematic diagram according to a second embodiment of the present application;

FIG. 3 is an exemplary diagram of the embodiment shown in FIG. 2;

FIG. 4 is a semantic representation schematic of the semantic representation model of the present application;

FIG. 5 is a training schematic of the semantic representation model of the present application;

FIG. 6 is a schematic illustration according to a third embodiment of the present application;

FIG. 7 is a schematic illustration according to a fourth embodiment of the present application;

fig. 8 is a block diagram of an electronic device for implementing a training data mining method according to an embodiment of the present application.

Detailed Description

The following description of the exemplary embodiments of the present application, taken in conjunction with the accompanying drawings, includes various details of the embodiments of the application for the understanding of the same, which are to be considered exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

FIG. 1 is a schematic diagram according to a first embodiment of the present application; as shown in fig. 1, the present application provides a training data mining method, which may specifically include the following steps:

s101, collecting a plurality of unsupervised texts serving as original data to form an original data set;

s102, acquiring a pre-configured data screening rule set, wherein the data screening rule set comprises a plurality of pre-configured data screening rules;

s103, mining a plurality of training data from the original data set according to each data screening rule in the data screening rule set to form a training data set.

The execution subject of the training data mining method of this embodiment may be a training data mining device, which may be an electronic entity, or may also be an application integrated by software, and when in use, the device runs on a computer device to mine training data.

In this embodiment, a large number of unsupervised texts may be collected from the network, and each unsupervised text corresponds to one piece of original data, so that a plurality of pieces of original data may be obtained and added to one data set to form an original data set. In addition, optionally, a large amount of unlabeled texts in a certain field provided by a model user can be collected as raw data and added into the raw data set. The model user of the present embodiment may be a user of a model to be trained on the mined training data.

In this embodiment, the plurality of preconfigured data filtering rules included in the data filtering rule set may be configured by the model user based on his own experience to summarize the rules of the training data to be mined. That is, each preconfigured data filtering rule may also be referred to as an artificial prior rule.

Further, in this embodiment, according to each data filtering rule in the data filtering rule set, several pieces of training data may be mined from the original data set to form a training data set. Each piece of training data mined conforms to a data screening rule. In this way, by adopting each data screening rule in the data screening rule set, a plurality of training data can be mined from the original data set to form a training data set.

In the method for mining training data of the embodiment, an original data set is formed by collecting a plurality of unsupervised texts serving as original data; acquiring a pre-configured data screening rule set, wherein the data screening rule set comprises a plurality of pre-configured data screening rules; according to each data screening rule in the data screening rule set, a plurality of training data are mined from the original data set to form a training data set, and compared with the manual marking training data in the prior art, the training data can be automatically and intelligently mined without manually marking the training data, so that the acquisition cost of the training data can be effectively saved, and the acquisition efficiency of the training data is improved.

FIG. 2 is a schematic diagram according to a second embodiment of the present application; as shown in fig. 2, the method for mining training data according to the present embodiment further introduces the technical solution of the present application in more detail on the basis of the technical solution of the embodiment shown in fig. 1. As shown in fig. 2, the training data mining method of this embodiment may specifically include the following steps:

s201, collecting a plurality of unsupervised texts serving as original data to form an original data set U;

that is, in this embodiment, each piece of original data corresponds to one unsupervised text. And the original data and the training data of the embodiment are both corresponding to text data.

S202, acquiring a pre-configured data screening rule set P, wherein the data screening rule set P comprises a plurality of pre-configured data screening rules;

optionally, in this embodiment, each data filtering rule may be expressed in a regular expression manner, and/or each data filtering rule carries a corresponding tag. The label of the embodiment is used for labeling the screened training data, for example, the label can be used as a label of emotional tendency when performing an emotion analysis task. Such as a positive or negative evaluation. That is, in this embodiment, each data filtering rule included in the data filtering rule set is used to filter supervised annotation data from the original data set as training data.

For example, in the emotion analysis task, when training data of user tendency evaluation on accommodation is mined, the data screening rule of forward evaluation that can be adopted can be set to include rules of "quite quiet", "quite affordable", "still coming", "morning \ noon \ evening meal is good at eating", or "very close to [ location ], very convenient", and the like. And the corresponding negative-rated data filtering rules may be set to include rules of "no longer booked", "bad lodging", "too loud", "price [ numbers ] meta, too expensive", etc. In practical application, according to similar rules, data screening rules of training data of responses in various NLP tasks such as semantic matching, machine translation, dialogue understanding, and the like can also be configured, which is not described in detail herein for example.

S203, mining a plurality of training data from the original data set according to each data screening rule in the data screening rule set P to form a training data set A;

specifically, when training data is mined from the original data set by using the data screening rules, whether the original data hit the corresponding data screening rule or not can be specifically judged, and if the original data hit the corresponding data screening rule, the original data can be extracted as the training data. Meanwhile, the label of the data screening rule can be used as the label of the training data. For example, for a piece of raw data "i like the hotel very much and quiet", hit the data filtering rule including the forward evaluation of "quiet", at which time, the raw data + the corresponding label "forward evaluation" together form a piece of training data. In a similar manner, a plurality of training data sets may be mined from the original data set using each data filtering rule in the data filtering rule set to form the training data set a.

Optionally, in order to ensure the accuracy of the mined training data, a manual verification method may be further used herein to check the accuracy of the mined training data, so as to ensure the accuracy of the training data in the training data set a.

S204, acquiring similar data which is most similar to each training data in a plurality of training data from an original data set U by adopting a pre-trained semantic representation model and an Approximate Nearest Neighbor (ANN) retrieval algorithm, and adding the similar data serving as expanded training data into a training data set A;

it should be noted that the Pre-trained semantic representation model adopted in this embodiment may be a semantic representation model obtained through Pre-training, and the semantic representation model may be trained through a large amount of label-free data, so as to accurately perform semantic representation. For example, the semantic representation in the present embodiment may take the form of a vector.

For each training data in the training data set a that has been obtained, a semantic representation model is used to obtain a semantic representation of each training data, and it should be noted that, when obtaining the semantic representation here, only the data part except the label in the training data is semantically represented. Then, respectively obtaining the semantic representation of each piece of original data in the original data set U by adopting a semantic representation model. At this time, each piece of training data may be represented as a vector, each piece of original data in the original data set U may also be represented as a vector, and then similar data that is closest to each piece of training data in the original data set U may be calculated by using an ANN retrieval algorithm based on a calculation manner of similarity of the vectors. For each piece of training data in the plurality of pieces of training data, a piece of most similar data can be obtained in the manner, and a plurality of pieces of similar data can be obtained in total. The most similar data are used as expanded training data and are also added into a training data set A, so that the expansion and enrichment of the training data set A are realized.

Further alternatively, considering that the most similar data of some training data is not very similar to the training data, adding such most similar data to the training data set a may result in a decrease in the accuracy of the added training data. Therefore, before the closest similar data of each training data is used as the expanded training data and added into the training data set a, whether the similarity between the closest similar data and the corresponding training data is greater than a preset similarity threshold value or not can be judged, and if so, the closest similar data is used as the expanded training data and added into the training data set a; otherwise, the most similar data is discarded and not added to the training data set A.

Further optionally, in this embodiment, a manual verification mode may also be adopted to verify the training data added to each extension in the training data set a, so as to ensure the accuracy of the newly added extension training data.

For example, FIG. 3 is an exemplary diagram of the embodiment shown in FIG. 2. As shown in fig. 3, a flowchart of mining the training number by using the training data mining method of the above steps S201 to S204 of the present embodiment is taken as an example of mining the training data required for the emotion analysis task. As shown in fig. 3, a plurality of pieces of raw data are collected to constitute a raw data set U. Each data filtering rule in the data filtering rule set P is configured by a person skilled in the art based on his experience. Then, a plurality of training data are screened from the original data set U by adopting each data screening rule to form a training data set A. As shown in fig. 3, the training data set a obtained at this time includes training data for positive evaluation and training data for negative evaluation. Then, a semantic representation model and an ANN retrieval algorithm are adopted to search the closest similar data of each training data in the training data set A from the original data set U, the closest similar data with the similarity larger than a preset similarity threshold value is used as expanded training data, and the expanded training data set A is added to the training data set A to obtain a final expanded training data set A.

The step S204 is an expansion method of the training data set a, which can add relatively accurate training data into the training data set a, and can further enrich the amount of training data in the training data set a and effectively ensure the accuracy of the added training data compared with the training data obtained in the step S203.

S205, training a target model M by adopting each training data in the training data set A;

s206, predicting labels and prediction probabilities of all original data in residual data sets except the training data set A in the original data set U by adopting the target model M;

s207, according to the original data in the residual data set, the label of the original data, the corresponding prediction probability and a preset probability threshold, the original data with the prediction probability larger than the preset probability threshold are mined from the residual data set, and the original data and the label of the original data are used as expanded training data and added into a training data set A.

Steps S205-S207 of this embodiment are a way to optionally extend the training data set a. Alternatively, the steps S205-S207 may be performed after the step S204. Instead of the above step S204, steps S205-S207 may be directly combined with the above steps S201-S203 to form an alternative embodiment of the present application.

Since the training data set a obtained in step S204 already includes more accurate training data, the training data set a may be used to train a target model M corresponding to a task. In this case, the target model M may be used to predict each original data in the remaining data sets except the training data set a in the original data set U, and predict the label and the prediction probability of each original data, where the prediction probability represents the probability that the original data belongs to the label. For example, in the emotion analysis task described above, the trained target model M can predict whether each raw data in the remaining data set is prone to positive or negative evaluation, as is the corresponding prediction probability. Then, the original data and the corresponding label with the prediction probability greater than the preset probability threshold value can be used as an expanded training data to be added into the training data set A by further combining the preset probability threshold value. In a similar manner, multiple pieces of extended training data may also be acquired and added to the training data set a together to implement the extension of the training data set a. In this embodiment, the original data with the prediction probability greater than the preset probability threshold is regarded as predicted high-confidence data, and the corresponding original data and the corresponding label may be used together as an extended training data. The accuracy of the extended training data acquired by the method is very high, and the quality of the training data in the training data set A can be effectively ensured.

Further optionally, in this embodiment, step S207 may further be adopted to add new extended training data, train the target model M again after the extended training data set a, until the accuracy of the target model M reaches the preset accuracy threshold, so that the target model M may be further trained by using the extended richer and more comprehensive training data set a, and the accuracy of the target model M is further improved.

For example, optionally, in this embodiment, the accuracy of the extended training data may also be verified manually, so as to ensure the accuracy of the training data added to the training data set a.

Fig. 4 is a semantic representation schematic diagram of the semantic representation model according to the present application. As shown in fig. 4, the semantic representation model adopted in step S204 of the present embodiment serves as a vector generator for generating a vector representation of a piece of text data. As shown in fig. 4, when representing semantically, it is necessary to segment the text in the text data to obtain a plurality of segmented words, such as T in fig. 4₁、……T_NWhen inputting into the semantic representation model, it is necessary to input the start symbol CLS and the plurality of participles T in sequence₁、……T_NAnd a separator SEP. Multiple Transform layers in the semantic representation model can then be jointly encoded based on all the information entered. When performing semantic representation, an average representation of the top N layers of the semantic representation model may be used as the semantic representation of the corresponding text data. Wherein the N can be set to 1, 2 or 3 or other positive integers according to actual requirements.

For example, when N is 1, the semantic representation of the top-most CLS position, T, can be taken₁、……T_NThe average semantic representation and the maximum semantic representation of the respective word(s) and the semantic representation of the SEP location are weighted together and summed (Sum) as the final semantic representation of the text data, i.e. the vector representation of the text data. The weight of each part can be preset according to actual requirements.

For N as another numerical value, for example, N is 3, the semantic representation of the text data in each layer may be obtained in a manner that N is 1, and then the semantic representations of the text data in N layers are averaged to obtain the final semantic representation.

Further, in order to enable the semantic representation model to obtain a better effect in ANN retrieval, the semantic representation model can be trained by using the weakly supervised aggregation information of the articles. FIG. 5 is a training diagram of the semantic representation model of the present application. As shown in fig. 5, two articles may be selected at the time of training, assuming article a includes paragraphs D1, D2, D3 and article B includes paragraphs D4, D5, D6, D7. Specifically, the method in fig. 4 may be adopted to generate semantic representations, i.e., semantic vectors, of each paragraph, then during training, the Cosine similarity of any two paragraphs may be calculated, the Cosine similarity of the same article is assumed to be Loss +, the Cosine similarity of different articles is Loss-, and the optimization goal during training is that the difference between the average value of Loss + and the average value of Loss is as large as possible, wherein the similarity of the same article is 1 and is not included in Loss +; that is, the training purpose is to make the similarity of paragraphs in the same article as close as possible, and the similarity of paragraphs in different articles as far as possible. In addition, in the above fig. 5, two optional articles are used as a set of training samples for training, in practical applications, N articles may also be optionally selected as a set of training samples for training, where N may be a positive integer greater than 2, and in a similar manner, the similarity of the paragraphs in the same article is as close as possible, and the similarity of the paragraphs in different articles is as far as possible, so that the semantic representation model is trained. The semantic representation model is trained in the above mode, so that semantic representation can be performed more accurately. Optionally, in the above scheme, paragraphs are taken as granularity, in practical application, sentences may also be taken as granularity, in the same way, the similarity of sentences of the same article is Loss +, the Cosine similarity of different sentences is Loss-, the optimization goal during training is to make the difference between the average value of Loss + and the Loss-average value as large as possible, and in the same way, the training principle is as large as possible.

Furthermore, by adopting the mining method of the training data of the embodiment, the corresponding training data of various NLP tasks can be mined, and further, the task processing can be performed based on the semantic representation model, so that the semantic representation model is converted into task models such as emotion analysis, semantic matching, question and answer matching, entity tagging and the like, and the use cost of the semantic representation model is reduced.

In the above embodiments, the emotion analysis task in the NLP field is taken as an example to describe the technical solution of the present application. In practical application, the technical solution of this embodiment may be applicable to mining training data of tasks such as semantic matching, question-answer matching, and entity tagging in the NLP field, and details of the above embodiment may be referred to, and are not described herein again.

According to the method for mining the training data, through the technical scheme, technical experts do not need to manually mark the training data, the training data set can be mined through the preset data screening rules, and the two methods for mining the training data can be further adopted to expand the training data set. In addition, the extended training data acquired by the method of the embodiment has very high accuracy, and the quality of the training data can be effectively ensured. Therefore, the technical scheme of the embodiment can automatically and intelligently mine the training data, effectively save the acquisition cost of the training data and improve the acquisition efficiency of the training data.

FIG. 6 is a schematic illustration according to a third embodiment of the present application; as shown in fig. 6, the training data mining apparatus 600 according to this embodiment includes:

the acquisition module 601 is used for acquiring a plurality of unsupervised texts serving as original data to form an original data set;

an obtaining module 602, configured to obtain a preconfigured data screening rule set, where the data screening rule set includes a plurality of preconfigured data screening rules;

the mining module 603 is configured to mine a plurality of training data from the original data set according to each data screening rule in the data screening rule set, so as to form a training data set.

The training data mining apparatus 600 of this embodiment implements the implementation principle and technical effect of the training data mining by using the modules, which are the same as the implementation of the related method embodiments described above, and reference may be made to the description of the related method embodiments in detail, which is not described herein again.

FIG. 7 is a schematic illustration according to a fourth embodiment of the present application; as shown in fig. 7, the training data mining device 600 of the present embodiment further describes the technical solution of the present application in more detail based on the technical solution of the embodiment shown in fig. 6.

As shown in fig. 7, the training data mining apparatus 600 according to the present embodiment further includes:

a retrieval module 604, configured to obtain similar data that is closest to each piece of training data in the plurality of pieces of training data from the original data set by using a pre-trained semantic representation model and an approximate nearest neighbor retrieval algorithm;

an expanding module 605, configured to add each closest similar data as expanded training data into the training data set.

Further optionally, the training data mining apparatus 600 of this embodiment further includes:

the determining module 606 is configured to determine and determine that the similarity between each closest similar data and the corresponding training data is greater than a preset similarity threshold.

the training module 607 is configured to train the target model by using each training data in the training data set.

Further optionally, the training data mining apparatus 600 of this embodiment further includes a prediction module 608;

a prediction module 608, configured to predict, by using the target model, a label and a prediction probability of each original data in the remaining data sets except the training data set in the original data set;

The mining module 602 is further configured to mine, according to each original data, the label of each original data, the corresponding prediction probability, and a preset probability threshold, the original data with the prediction probability greater than the preset probability threshold from the remaining data set, and add the original data and the label of the original data as extended training data into the training data set.

Further optionally, the training module 607 is further configured to train the target model again by using the extended training data set until the accuracy of the target model reaches the preset accuracy threshold.

According to an embodiment of the present application, an electronic device and a readable storage medium are also provided.

Fig. 8 is a block diagram of an electronic device implementing a training data mining method according to an embodiment of the present application. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the present application that are described and/or claimed herein.

As shown in fig. 8, the electronic apparatus includes: one or more processors 801, memory 802, and interfaces for connecting the various components, including a high speed interface and a low speed interface. The various components are interconnected using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions for execution within the electronic device, including instructions stored in or on the memory to display graphical information of a GUI on an external input/output apparatus (such as a display device coupled to the interface). In other embodiments, multiple processors and/or multiple buses may be used, along with multiple memories and multiple memories, as desired. Also, multiple electronic devices may be connected, with each device providing portions of the necessary operations (e.g., as a server array, a group of blade servers, or a multi-processor system). Fig. 8 illustrates an example of a processor 801.

The memory 802 is a non-transitory computer readable storage medium as provided herein. Wherein the memory stores instructions executable by at least one processor to cause the at least one processor to perform the method of mining training data provided herein. The non-transitory computer readable storage medium of the present application stores computer instructions for causing a computer to perform the mining method of training data provided herein.

The memory 802, which is a non-transitory computer readable storage medium, may be used to store non-transitory software programs, non-transitory computer executable programs, and modules, such as program instructions/modules (e.g., related modules shown in fig. 6 and 7) corresponding to the training data mining method in the embodiments of the present application. The processor 801 executes various functional applications of the server and data processing, i.e., implements the mining method of training data in the above-described method embodiments, by running non-transitory software programs, instructions, and modules stored in the memory 802.

The memory 802 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to use of an electronic device implementing a mining method of training data, and the like. Further, the memory 802 may include high speed random access memory and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory 802 optionally includes memory located remotely from the processor 801, which may be connected over a network to an electronic device implementing the mining method of training data. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The electronic device implementing the mining method of training data may further include: an input device 803 and an output device 804. The processor 801, the memory 802, the input device 803, and the output device 804 may be connected by a bus or other means, and are exemplified by a bus in fig. 8.

The input device 803 may receive input numeric or character information and generate key signal inputs related to user settings and function controls of an electronic device implementing the XXX method, such as a touch screen, keypad, mouse, track pad, touch pad, pointer stick, one or more mouse buttons, track ball, joystick, or other input device. The output devices 804 may include a display device, auxiliary lighting devices (e.g., LEDs), and haptic feedback devices (e.g., vibrating motors), among others. The display device may include, but is not limited to, a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display, and a plasma display. In some implementations, the display device can be a touch screen.

Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, application specific ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

These computer programs (also known as programs, software applications, or code) include machine instructions for a programmable processor, and may be implemented using high-level procedural and/or object-oriented programming languages, and/or assembly/machine languages. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), the internet, and blockchain networks.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

According to the technical scheme of the embodiment of the application, an original data set is formed by collecting a plurality of unsupervised texts serving as original data; acquiring a pre-configured data screening rule set, wherein the data screening rule set comprises a plurality of pre-configured data screening rules; according to each data screening rule in the data screening rule set, a plurality of training data are mined from the original data set to form a training data set, and compared with the manual marking training data in the prior art, the training data can be automatically and intelligently mined without manually marking the training data, so that the acquisition cost of the training data can be effectively saved, and the acquisition efficiency of the training data is improved.

According to the technical scheme of the embodiment of the application, the training data set can be mined by the preset data screening rules without manually marking the training data by technical experts, and the training data set can be expanded by further adopting the two training data mining methods. In addition, the extended training data acquired by the method of the embodiment has very high accuracy, and the quality of the training data can be effectively ensured. Therefore, the technical scheme of the embodiment can automatically and intelligently mine the training data, effectively save the acquisition cost of the training data and improve the acquisition efficiency of the training data.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present application may be executed in parallel, sequentially, or in different orders, and the present invention is not limited thereto as long as the desired results of the technical solutions disclosed in the present application can be achieved.

The above-described embodiments should not be construed as limiting the scope of the present application. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. A method of mining training data, wherein the method comprises:

2. The method of claim 1, wherein after mining a number of training data from the raw data set according to each of the data filtering rules of the set of data filtering rules, the method further comprises:

acquiring similar data which is most similar to each training data in the plurality of training data from the original data set by adopting a pre-trained semantic representation model and an approximate nearest neighbor retrieval algorithm;

and adding each closest similar data serving as expanded training data into the training data set.

3. The method of claim 2, wherein before adding each of the most closely similar data as extended training data into the training data set, the method further comprises:

And judging and determining that the similarity between each closest similar data and the corresponding training data is greater than a preset similarity threshold.

4. The method according to any one of claims 1-3, wherein the method further comprises:

and training a target model by adopting each training data in the training data set.

5. The method of claim 4, wherein after training a target model using each of the training data in the set of training data, the method further comprises:

predicting labels and prediction probabilities of all the original data in residual data sets except the training data set in the original data set by adopting the target model;

and mining the original data with the prediction probability larger than the preset probability threshold from the residual data set according to each original data, the label of each original data, the corresponding prediction probability and a preset probability threshold, taking the original data and the label of the original data as expanded training data, and adding the expanded training data and the expanded training data into the training data set.

6. The method of claim 5, wherein, according to each of the original data, the label of each of the original data and the corresponding prediction probability, and a preset probability threshold, the original data with the prediction probability greater than the preset probability threshold are mined from the remaining data set, and are used as extended training data together with the label of the original data, and after being added to the training data set, the method further comprises:

And adopting the expanded training data set to train the target model again until the accuracy of the target model reaches a preset accuracy threshold.

7. An apparatus for mining training data, wherein the apparatus comprises:

8. The apparatus of claim 7, wherein the apparatus further comprises:

the retrieval module is used for acquiring similar data which is closest to each training data in the plurality of pieces of training data from the original data set by adopting a pre-trained semantic representation model and an approximate nearest neighbor retrieval algorithm;

and the expansion module is used for adding each closest similar data serving as expanded training data into the training data set.

9. The apparatus of claim 8, wherein the apparatus further comprises:

and the judging module is used for judging and determining that the similarity between each closest similar data and the corresponding training data is greater than a preset similarity threshold.

10. The apparatus of any of claims 7-9, wherein the apparatus further comprises:

and the training module is used for training a target model by adopting each training data in the training data set.

11. The apparatus of claim 10, wherein the apparatus further comprises a prediction module;

the prediction module is used for predicting the label and the prediction probability of each original data in the residual data set except the training data set in the original data set by adopting the target model;

the mining module is further configured to mine the original data of which the prediction probability is greater than the preset probability threshold from the remaining data set according to each original data, the label of each original data, the corresponding prediction probability, and a preset probability threshold, and add the original data and the label of the original data as extended training data into the training data set.

12. The apparatus of claim 11, wherein:

And the training module is further used for adopting the expanded training data set to train the target model again until the accuracy of the target model reaches a preset accuracy threshold.

13. An electronic device, comprising:

at least one processor; and

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-6.

14. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-6.