CN111859953B

CN111859953B - Training data mining method and device, electronic equipment and storage medium

Info

Publication number: CN111859953B
Application number: CN202010576205.XA
Authority: CN
Inventors: 王硕寰; 庞超; 孙宇
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2020-06-22
Filing date: 2020-06-22
Publication date: 2023-08-22
Anticipated expiration: 2040-06-22
Also published as: CN111859953A

Abstract

The application discloses a training data mining method, a training data mining device, electronic equipment and a storage medium, and relates to the technical field of natural language processing based on artificial intelligence. The specific implementation scheme is as follows: collecting a plurality of pieces of unsupervised text serving as original data to form an original data set; acquiring a preconfigured data screening rule set, wherein the data screening rule set comprises a plurality of preconfigured data screening rules; and mining a plurality of pieces of training data from the original data set according to each data screening rule in the data screening rule set to form a training data set. Compared with the manual labeling training data in the prior art, the method and the device can automatically and intelligently mine the training data without manually labeling the training data, can effectively save the acquisition cost of the training data and improve the acquisition efficiency of the training data.

Description

Training data mining method and device, electronic equipment and storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to the field of natural language processing technologies based on artificial intelligence, and in particular, to a training data mining method, device, electronic equipment, and storage medium.

Background

In recent years, a Pre-training model represented by a transform-based bi-directional encoder representation (Bidirectional Encoder Representation from Transformers; BERT) model provides a training paradigm of two stages of Pre-training (Pre-training) +fine-tuning (Fine-tuning), so as to train the model, and greatly improve the effects of various natural language processing (Natural Language Processing; NLP) tasks. The BERT model adopts a deep-layer transducer model structure, uses massive unsupervised text to learn context related representations, and uses a general unified mode to solve various NLP tasks such as text matching, text generation, emotion classification, text abstract, question and answer, retrieval and the like.

Wherein, pre-training refers to constructing self-supervised learning tasks, such as shape filling, sentence sorting, etc., by using massive unlabeled text as training data. The Fine-tuning refers to performing task adaptation by using a small amount of task text with manual labels as training data, and obtaining a specific natural language processing task model.

In the existing training process of the Fine-tuning stage, manually marked training data are used. However, the manually labeled training data is expensive, and often requires a technical expert with abundant experience to label, so that the manually labeled training data in the existing Fine-tuning stage has higher acquisition cost and very low acquisition efficiency.

Disclosure of Invention

In order to solve the technical problems, the application provides a training data mining method, a training data mining device, electronic equipment and a storage medium.

According to an aspect of the present application, there is provided a training data mining method, wherein the method includes:

collecting a plurality of pieces of unsupervised text serving as original data to form an original data set;

acquiring a preconfigured data screening rule set, wherein the data screening rule set comprises a plurality of preconfigured data screening rules;

and mining a plurality of pieces of training data from the original data set according to each data screening rule in the data screening rule set to form a training data set.

According to another aspect of the present application, there is provided an apparatus for mining training data, wherein the apparatus includes:

the acquisition module is used for acquiring a plurality of pieces of unsupervised text serving as original data to form an original data set;

the acquisition module is used for acquiring a preset data screening rule set, wherein the data screening rule set comprises a plurality of preset data screening rules;

and the mining module is used for mining a plurality of pieces of training data from the original data set according to each data screening rule in the data screening rule set to form a training data set.

According to still another aspect of the present application, there is provided an electronic apparatus including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method as described above.

According to yet another aspect of the present application, there is provided a non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method as described above.

According to yet another aspect of the present application, there is provided a computer program product comprising a computer program which, when executed by a processor, implements a method as described above.

According to the technology of the application, compared with the manual annotation training data in the prior art, the training data can be automatically and intelligently mined without manual annotation training data, so that the acquisition cost of the training data can be effectively saved, and the acquisition efficiency of the training data can be improved.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

The drawings are included to provide a better understanding of the present application and are not to be construed as limiting the application. Wherein:

FIG. 1 is a schematic diagram of a first embodiment according to the present application;

FIG. 2 is a schematic diagram of a second embodiment according to the present application;

FIG. 3 is an exemplary diagram of the embodiment shown in FIG. 2;

FIG. 4 is a schematic semantic representation of the semantic representation model of the present application;

FIG. 5 is a training schematic of the semantic representation model of the present application;

FIG. 6 is a schematic diagram of a third embodiment according to the present application;

FIG. 7 is a schematic diagram of a fourth embodiment according to the application;

fig. 8 is a block diagram of an electronic device for implementing a training data mining method of an embodiment of the present application.

Detailed Description

Exemplary embodiments of the present application will now be described with reference to the accompanying drawings, in which various details of the embodiments of the present application are included to facilitate understanding, and are to be considered merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

FIG. 1 is a schematic diagram of a first embodiment according to the present application; as shown in fig. 1, the present application provides a training data mining method, which specifically includes the following steps:

s101, collecting a plurality of pieces of unsupervised text serving as original data to form an original data set;

s102, acquiring a preconfigured data screening rule set, wherein the data screening rule set comprises a plurality of preconfigured data screening rules;

s103, mining a plurality of pieces of training data from the original data set according to each data screening rule in the data screening rule set to form a training data set.

The execution main body of the training data mining method of the embodiment may be a training data mining device, which may be an electronic entity, or may also be an application adopting software integration, and when in use, the device is run on a computer device to implement mining of training data.

In this embodiment, a large number of unsupervised texts can be collected from the network, and each unsupervised text corresponds to one piece of original data, so that several pieces of original data can be obtained and added into one data set to form the original data set. Alternatively, a large amount of unlabeled text in a certain field provided by the model user may be collected and added as raw data to the raw data set. The model user of the present embodiment may be a user of a model to be trained by the mined training data.

The data filtering rules of the data filtering rule set in this embodiment may be configured by a model user summarizing rules of training data to be mined based on his own experience. I.e. each pre-configured data screening rule may also be referred to as an artificial a priori rule.

Further, in this embodiment, according to each data filtering rule in the data filtering rule set, a plurality of pieces of training data may be mined from the original data set to form a training data set. Wherein each piece of training data mined is in accordance with a data screening rule. In this way, by adopting each data screening rule in the data screening rule set, a plurality of pieces of training data can be mined from the original data set to form a training data set.

According to the mining method of the training data, an original data set is formed by collecting a plurality of pieces of unsupervised text serving as the original data; acquiring a preconfigured data screening rule set, wherein the data screening rule set comprises a plurality of preconfigured data screening rules; according to each data screening rule in the data screening rule set, a plurality of pieces of training data are mined from the original data set to form a training data set, compared with the manually marked training data in the prior art, the training data can be automatically and intelligently mined, the manually marked training data is not needed, the acquisition cost of the training data can be effectively saved, and the acquisition efficiency of the training data is improved.

FIG. 2 is a schematic diagram of a second embodiment according to the present application; as shown in fig. 2, the training data mining method of the present embodiment further describes the technical solution of the present application in more detail on the basis of the technical solution of the embodiment shown in fig. 1. As shown in fig. 2, the training data mining method of the present embodiment may specifically include the following steps:

s201, collecting a plurality of pieces of unsupervised text serving as original data to form an original data set U;

that is, in this embodiment, each piece of original data corresponds to one piece of unsupervised text. And the original data and the training data in this embodiment are text data.

S202, acquiring a preconfigured data screening rule set P, wherein the data screening rule set P comprises a plurality of preconfigured data screening rules;

optionally, in this embodiment, each data filtering rule may be expressed by using a regular expression, and/or each data filtering rule carries a corresponding tag. The label in this embodiment is used to label the training data of the filtering, for example, when performing emotion analysis task, the label may be used as a label of emotion tendency. Such as positive or negative evaluation. That is, in this embodiment, each data filtering rule included in the data filtering rule set is used to filter supervised annotation data from the original dataset as training data.

For example, in the emotion analysis task, when training data for evaluation of user's tendency to lodge is mined, data screening rules for forward evaluation that can be adopted may be set to rules including "quite quiet", "quite affordable", "yet come again", "early\middle\late dining very good, or" very near to [ place ], very convenient ", and so on. While the data screening rules for the corresponding negative evaluations may be set to include rules of "no longer ordered", "bad accommodation", "too noisy", "price [ number ] element, too expensive", etc. In practical application, according to similar rules, data filtering rules of response training data in various NLP tasks such as semantic matching, machine translation, dialogue understanding and the like can be configured, and are not described in detail herein.

S203, mining a plurality of pieces of training data from the original data set according to each data screening rule in the data screening rule set P to form a training data set A;

specifically, when the training data is mined from the original data set by adopting the data screening rule, whether the original data hits the corresponding data screening rule can be specifically judged, and if the original data hits, the original data can be extracted as the training data. And meanwhile, the label of the data screening rule can be used as the label of the training data. For example, for a piece of original data "i liked the hotel very much, quite quiet", hit the data screening rule including the "quite quiet" forward evaluation, at this time, the original data and the corresponding tag "forward evaluation" together form a piece of training data. In a similar manner, each data filtering rule in the data filtering rule set may be used to mine a plurality of pieces of training data from the original data set to form a training data set a.

Optionally, in order to ensure the accuracy of the mined training data, a manual verification manner may be used herein to verify the accuracy of the mined training data, so as to ensure the accuracy of the training data in the training data set a.

S204, acquiring similar data most similar to each training data in a plurality of pieces of training data from an original data set U by adopting a pre-trained semantic representation model and an approximate nearest neighbor (Approximate Nearest Neighbors; ANN) retrieval algorithm, and adding the similar data as extended training data into a training data set A;

the Pre-trained semantic representation model adopted in the embodiment may be a semantic representation model obtained through Pre-training, and the semantic representation model may be trained by a large amount of non-labeling data, so that semantic representation can be accurately performed. For example, the semantic representation in this embodiment may take the form of a vector.

For each training data in the training data set a obtained, a semantic representation model is used to obtain semantic representations of each training data, and it should be noted that, when the semantic representations are obtained here, only the data portions other than the labels in the training data are subjected to semantic representation. Then, semantic representation of each piece of original data in the original data set U is obtained by adopting a semantic representation model. At this time, each piece of training data may be represented as a vector, each piece of original data in the original data set U may also be represented as a vector, and then, based on a calculation manner of similarity of the vectors, an ANN search algorithm may be adopted to calculate similar data in the original data set U that is most similar to each piece of training data. For each piece of training data in the pieces of training data, a piece of similar data which is the closest can be obtained in this way, and a plurality of pieces of similar data can be obtained altogether. The most similar data are used as extended training data and added into the training data set A, so that the extension and enrichment of the training data set A are realized.

Further alternatively, it is considered that the most similar data of some training data is not very similar to the training data, and adding such most similar data to the training data set a may result in a decrease in accuracy of the added training data. Therefore, before the most similar data of each training data is used as the expanded training data and added into the training data set A, whether the similarity between the most similar data and the corresponding training data is larger than a preset similarity threshold value or not can be judged first, if so, the most similar data is used as the expanded training data and added into the training data set A; otherwise, the most similar data is discarded and not added to training data set A.

Further alternatively, in this embodiment, a manual verification manner may be also adopted to verify each piece of extended training data added into the training data set a, so as to ensure accuracy of the newly added extended training data.

For example, FIG. 3 is an exemplary diagram of the embodiment shown in FIG. 2. As shown in fig. 3, taking training data required for mining emotion analysis tasks as an example, a flowchart of mining training data is adopted by the mining method of training data in steps S201 to S204 described above in this embodiment. As shown in fig. 3, a plurality of pieces of raw data are acquired, constituting a raw data set U. Each data screening rule in the set of data screening rules P is configured by a person skilled in the art based on his experience. And then, screening a plurality of pieces of training data from the original data set U by adopting each data screening rule to form a training data set A. As shown in fig. 3, the training data set a obtained at this time includes training data for positive evaluation and training data for negative evaluation. And then searching the most similar data of each training data in the training data set A from the original data set U by adopting a semantic representation model and an ANN retrieval algorithm, and adding the most similar data with similarity larger than a preset similarity threshold value into the training data set A as expanded training data to obtain a final expanded training data set A.

The step S204 is an expansion method of the training data set a, which can add more accurate training data into the training data set a, and compared with the training data obtained in the step S203, the method can further enrich the number of training data in the training data set a, and can effectively ensure the accuracy of the added training data.

S205, training a target model M by adopting each training data in the training data set A;

s206, predicting labels and prediction probabilities of all original data in the rest data sets except the training data set A in the original data set U by adopting a target model M;

s207, according to each original data, the labels of each original data and the corresponding prediction probability and the preset probability threshold value in the residual data set, the original data with the prediction probability larger than the preset probability threshold value is mined from the residual data set, and the original data and the labels of the original data are used as extended training data to be added into the training data set A.

Steps S205-S207 of the present embodiment are one way of optionally expanding the training data set a. Alternatively, the steps S205 to S207 may be performed after the step S204. Steps S205-S207 may also be combined directly with steps S201-S203 without step S204 described above, constituting an alternative embodiment of the present application.

Since the training data set a obtained in the step S204 already includes more accurate training data, the training data set a may be used to train a target model M of a corresponding task. At this time, the target model M may be used to predict each original data in the remaining data sets except the training data set a in the original data set U, and predict the label and the prediction probability of each original data, where the prediction probability indicates the probability that the original data belongs to the label. For example, in the emotion analysis task described above, the trained target model M can predict whether each original data in the remaining data set tends to be positively evaluated or negatively evaluated, and what the corresponding prediction probability is. And then, the original data with the predicted probability larger than the preset probability threshold and the corresponding labels can be further combined with the preset probability threshold to serve as extended training data, and the extended training data are added into the training data set A. In a similar manner, a plurality of pieces of extended training data can be obtained and added into the training data set A together, so that the extension of the training data set A is realized. In this embodiment, the original data with the prediction probability greater than the preset probability threshold is considered as the predicted high confidence data, and the corresponding original data and the corresponding tag can be used together as one piece of extended training data. The accuracy of the extended training data acquired by the method is very high, and the quality of the training data in the training data set A can be effectively ensured.

Further optionally, in this embodiment, new extended training data may be further added by using step S207, and the extended training data set a trains the target model M again until the accuracy of the target model M reaches the preset accuracy threshold, so that the target model M may be further trained by further using the extended and richer and comprehensive training data set a, and the accuracy of the target model M may be further improved.

For example, in this embodiment, optionally, the accuracy of the extended training data may also be checked manually, so as to ensure the accuracy of the training data added to the training data set a.

Fig. 4 is a schematic semantic representation of the semantic representation model of the present application. As shown in fig. 4, the semantic representation model employed in step S204 of the present embodiment serves as a vector generatorFor generating a vector representation of a piece of text data. As shown in FIG. 4, when semantic representation is performed, the text in the text data needs to be segmented to obtain a plurality of segmented words, such as T in FIG. 4 ₁ 、……T _N When inputting to the semantic representation model, the initiator CLS needs to be sequentially input, and a plurality of word segmentation T ₁ 、……T _N And a separator SEP. Multiple Transform layers in the semantic representation model may then jointly encode based on all of the information entered. When performing semantic representation, the average representation of the top N layers of the semantic representation model may be employed as the semantic representation of the corresponding text data. Wherein the N may be set to 1, 2 or 3 or other positive integers according to actual requirements.

For example, when N is 1, the semantic representation of the topmost CLS location, T, may be taken ₁ 、……T _N Together with the maximum semantic representation, and the semantic representation of the SEP position, as the final semantic representation of the text data, i.e. the vector representation of the text data. The weight of each part can be preset according to actual requirements.

For N being other values, such as n=3, the semantic representation of the text data in each layer may be obtained in the manner of n=1, and then the semantic representations of the text data in the N layers may be averaged to obtain a final semantic representation.

Furthermore, in order to make the semantic representation model obtain better effect in ANN retrieval, the application can also train the semantic representation model by using the weak supervision aggregation information of the articles. FIG. 5 is a training schematic of the semantic representation model of the present application. As shown in FIG. 5, two articles may optionally be selected during training, assuming article A includes paragraphs D1, D2, D3 and article B includes paragraphs D4, D5, D6, D7. Specifically, the semantic representation of each paragraph, namely a semantic vector, can be generated by adopting the method in fig. 4, then the similarity of the cosines of any two paragraphs can be calculated during training, the similarity of the cosines of the same article is assumed to be Loss +, the similarity of the cosines of different articles is assumed to be Loss-, and the difference between the average value of Loss + and the average value of Loss-is as large as possible during training, wherein the similarity of the same article is 1 and is not counted in Loss +; i.e. training is aimed at making the similarity to paragraphs in the article as close as possible, and the similarity to paragraphs in different articles as far as possible. In addition, in the training with two optional articles as a set of training samples in fig. 5, in practical application, N articles may be selected as a set of training samples for training, where N may be a positive integer greater than 2, and in a similar manner, the similarity of paragraphs in the same article is as close as possible, the similarity of paragraphs in different articles is as far as possible, and the semantic representation model is trained. By training the semantic representation model in the mode, the semantic representation can be performed more accurately. Alternatively, in the above scheme, the paragraphs are granularity, in practical application, sentences may be granularity, and the same thing is similar to sentences of the article, and the Cosine similarity of different sentences is Loss-, and the optimization goal in training is that the difference between the average value of loss+ and the Loss-average value is as large as possible, and the training principle is the same.

Furthermore, by adopting the mining method of the training data, the training data of various corresponding NLP tasks can be mined, further task processing can be performed based on the semantic representation model, the semantic representation model is converted into task models such as emotion analysis, semantic matching, question-answer matching and entity labeling, and the use cost of the semantic representation model is reduced.

In the above embodiments, the emotion analysis task in the NLP field is taken as an example, and the technical scheme of the present application is described in many places. In practical applications, the technical solution of the present embodiment may be suitable for mining training data of tasks such as semantic matching, question-answer matching, entity labeling and the like in the NLP field, and details of the foregoing embodiments may be referred to in the description of the foregoing embodiments and will not be described herein in detail.

According to the training data mining method, technical experts are not required to manually mark training data through the technical scheme, a training data set can be mined through a preset data screening rule, and the training data set can be expanded by further adopting the two training data mining methods. In addition, the accuracy of the extended training data obtained by the mode of the embodiment is very high, and the quality of the training data can be effectively ensured. Therefore, the technical scheme of the embodiment can automatically and intelligently mine the training data, can effectively save the acquisition cost of the training data and improve the acquisition efficiency of the training data.

FIG. 6 is a schematic diagram of a third embodiment according to the present application; as shown in fig. 6, the training data mining apparatus 600 provided in this embodiment includes:

the acquisition module 601 is configured to acquire a plurality of pieces of unsupervised text serving as original data, to form an original data set;

an obtaining module 602, configured to obtain a preconfigured data filtering rule set, where the data filtering rule set includes a plurality of preconfigured data filtering rules;

the mining module 603 is configured to mine a plurality of training data from the original data set according to each data filtering rule in the data filtering rule set, so as to form a training data set.

The training data mining apparatus 600 of the present embodiment implements the principle and technical effects of mining training data by using the above modules, and is the same as the implementation of the above related method embodiments, and details of the above related method embodiments may be referred to in the description of the related method embodiments, which is not repeated herein.

FIG. 7 is a schematic diagram of a fourth embodiment according to the application; as shown in fig. 7, the training data mining apparatus 600 according to the present embodiment further describes the technical scheme of the present application in more detail on the basis of the technical scheme of the embodiment shown in fig. 6.

As shown in fig. 7, the training data mining apparatus 600 of the present embodiment further includes:

the retrieval module 604 is configured to acquire, from the original data set, similar data closest to each training data in the plurality of training data by using a pre-trained semantic representation model and an approximate nearest neighbor retrieval algorithm;

the expansion module 605 is configured to add each closest similar data as expanded training data into the training data set.

Further alternatively, the training data mining apparatus 600 of the present embodiment further includes:

the determining module 606 is configured to determine and determine that the similarity between each of the most similar data and the corresponding training data is greater than a preset similarity threshold.

a training module 607 for training the target model using each training data in the training data set.

Further optionally, the training data mining apparatus 600 of this embodiment further includes a prediction module 608;

a prediction module 608, configured to predict labels and prediction probabilities of each original data in the remaining data sets except the training data set in the original data set by using the target model;

the mining module 602 is further configured to mine, from the remaining data set, the original data with a prediction probability greater than the preset probability threshold according to each original data, the label of each original data and the corresponding prediction probability, and the preset probability threshold, and use the original data and the label of the original data together as extended training data, and add the extended training data into the training data set.

Further optionally, the training module 607 is further configured to train the target model again using the extended training data set until the accuracy of the target model reaches a preset accuracy threshold.

According to an embodiment of the present application, the present application also provides an electronic device and a readable storage medium.

Fig. 8 is a block diagram of an electronic device implementing a training data mining method according to an embodiment of the present application. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the applications described and/or claimed herein.

As shown in fig. 8, the electronic device includes: one or more processors 801, memory 802, and interfaces for connecting the components, including high-speed interfaces and low-speed interfaces. The various components are interconnected using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions executing within the electronic device, including instructions stored in or on memory to display graphical information of the GUI on an external input/output device, such as a display device coupled to the interface. In other embodiments, multiple processors and/or multiple buses may be used, if desired, along with multiple memories and multiple memories. Also, multiple electronic devices may be connected, each providing a portion of the necessary operations (e.g., as a server array, a set of blade servers, or a multiprocessor system). One processor 801 is illustrated in fig. 8.

Memory 802 is a non-transitory computer readable storage medium provided by the present application. The memory stores instructions executable by the at least one processor to cause the at least one processor to perform the training data mining method provided by the present application. The non-transitory computer readable storage medium of the present application stores computer instructions for causing a computer to execute the mining method of training data provided by the present application.

The memory 802 is used as a non-transitory computer readable storage medium for storing non-transitory software programs, non-transitory computer executable programs, and modules, such as program instructions/modules (e.g., related modules shown in fig. 6 and fig. 7) corresponding to a training data mining method according to an embodiment of the present application. The processor 801 executes various functional applications of the server and data processing, that is, implements the mining method of training data in the above-described method embodiment, by running non-transitory software programs, instructions, and modules stored in the memory 802.

Memory 802 may include a storage program area that may store an operating system, at least one application program required for functionality, and a storage data area; the storage data area may store data created according to the use of the electronic device implementing the mining method of training data, and the like. In addition, memory 802 may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid-state storage device. In some embodiments, memory 802 may optionally include memory remotely located with respect to processor 801, which may be connected via a network to an electronic device implementing the mining method of training data. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The electronic device for implementing the training data mining method may further include: an input device 803 and an output device 804. The processor 801, memory 802, input devices 803, and output devices 804 may be connected by a bus or other means, for example in fig. 8.

Input device 803 may receive input numeric or character information and generate key signal inputs related to user settings and function control of an electronic device implementing the XXX method, such as a touch screen, keypad, mouse, trackpad, touch pad, pointer stick, one or more mouse buttons, track ball, joystick, etc. input devices. The output device 804 may include a display apparatus, auxiliary lighting devices (e.g., LEDs), and haptic feedback devices (e.g., vibration motors), among others. The display device may include, but is not limited to, a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display, and a plasma display. In some implementations, the display device may be a touch screen.

Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, application specific ASIC (application specific integrated circuit), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

These computing programs (also referred to as programs, software applications, or code) include machine instructions for a programmable processor, and may be implemented in a high-level procedural and/or object-oriented programming language, and/or in assembly/machine language. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), the internet, and blockchain networks.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

According to the technical scheme of the embodiment of the application, an original data set is formed by collecting a plurality of pieces of unsupervised text serving as original data; acquiring a preconfigured data screening rule set, wherein the data screening rule set comprises a plurality of preconfigured data screening rules; according to each data screening rule in the data screening rule set, a plurality of pieces of training data are mined from the original data set to form a training data set, compared with the manually marked training data in the prior art, the training data can be automatically and intelligently mined, the manually marked training data is not needed, the acquisition cost of the training data can be effectively saved, and the acquisition efficiency of the training data is improved.

According to the technical scheme provided by the embodiment of the application, the training data set can be mined by the technical scheme without manually marking the training data by a technical expert and through the preset data screening rule, and the training data set can be expanded by further adopting the mining methods of the two training data. In addition, the accuracy of the extended training data obtained by the mode of the embodiment is very high, and the quality of the training data can be effectively ensured. Therefore, the technical scheme of the embodiment can automatically and intelligently mine the training data, can effectively save the acquisition cost of the training data and improve the acquisition efficiency of the training data.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps described in the present application may be performed in parallel, sequentially, or in a different order, provided that the desired results of the disclosed embodiments are achieved, and are not limited herein.

The above embodiments do not limit the scope of the present application. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present application should be included in the scope of the present application.

Claims

1. A method of mining training data, wherein the method comprises:

according to each data screening rule in the data screening rule set, mining a plurality of pieces of training data from the original data set to form a training data set;

the method further comprises the steps of:

training a target model by adopting each training data in the training data set;

predicting labels and prediction probabilities of all original data in the rest data sets except the training data set in the original data sets by adopting the target model;

according to the original data, the labels of the original data, the corresponding prediction probability and a preset probability threshold, the original data with the prediction probability larger than the preset probability threshold is mined from the residual data set, and the original data and the labels of the original data are used as extended training data to be added into the training data set;

and training the target model again by adopting the expanded training data set until the accuracy of the target model reaches a preset accuracy threshold.

2. The method of claim 1, wherein, after mining a number of training data from the original dataset according to each of the set of data screening rules, the method further comprises:

acquiring similar data closest to each training data in the plurality of training data from the original data set by adopting a pre-trained semantic representation model and an approximate nearest neighbor search algorithm;

and adding each nearest similar data as extended training data into the training data set.

3. The method of claim 2, wherein each of the most similar data is added as expanded training data to the training data set, the method further comprising:

and judging and determining that the similarity between each nearest similar data and the corresponding training data is larger than a preset similarity threshold value.

4. A training data mining apparatus, wherein the apparatus comprises:

the mining module is used for mining a plurality of pieces of training data from the original data set according to each data screening rule in the data screening rule set to form a training data set;

the apparatus further comprises:

the training module is used for training a target model by adopting each training data in the training data set;

the prediction module is used for predicting labels and prediction probabilities of all the original data in the rest data sets except the training data set in the original data sets by adopting the target model;

the mining module is further configured to mine, from the remaining data set, the original data with the prediction probability greater than the preset probability threshold according to each original data, the label of each original data, the corresponding prediction probability, and the preset probability threshold, and use the original data and the label of the original data together as extended training data, and add the extended training data into the training data set;

the training module is further configured to retrain the target model using the extended training data set until the accuracy of the target model reaches a preset accuracy threshold.

5. The apparatus of claim 4, wherein the apparatus further comprises:

the retrieval module is used for acquiring similar data closest to each training data in the plurality of pieces of training data from the original data set by adopting a pre-trained semantic representation model and an approximate nearest neighbor retrieval algorithm;

and the expansion module is used for taking each nearest similar data as expanded training data and adding the expanded training data into the training data set.

6. The apparatus of claim 5, wherein the apparatus further comprises:

the judging module is used for judging and determining that the similarity between each nearest similar data and the corresponding training data is larger than a preset similarity threshold value.

7. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-3.

8. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-3.