CN118035751A

CN118035751A - Data construction method and device for large language model fine tuning training

Info

Publication number: CN118035751A
Application number: CN202410439446.8A
Authority: CN
Inventors: 代季峰; 宁雪妃
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2024-04-12
Filing date: 2024-04-12
Publication date: 2024-05-14

Abstract

The invention relates to the technical field of artificial intelligence, in particular to a data construction method and a device for fine tuning training of a large language model, wherein the method comprises the following steps: the method comprises the steps of obtaining a self-generated instruction set by using a first large language model based on a seed instruction set, determining the actual category of each self-generated instruction, clustering a self-generated instruction subset meeting preset requirements by using the actual category to obtain a plurality of clusters of the self-generated instructions, screening the self-generated instructions meeting preset standards in each cluster in the plurality of clusters to obtain a screening instruction set, and constructing a fine-tuning training data set of the pre-trained large language model based on the marked screening instruction set. According to the embodiment of the invention, the instruction data can be automatically generated through the large language model, and the high-quality training data set is obtained through screening and de-duplication of the self-generated instruction data, so that the method is used for fine tuning training of the large language model, the time cost and the labor cost for constructing the training data are reduced, and the data construction result is more practical and diverse.

Description

Data construction method and device for large language model fine tuning training

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a data construction method and device for fine tuning training of a large language model.

Background

With the rapid development of deep learning technology, the requirements related to a large language model are improved, and the model is not limited to be designed aiming at a single field, but aims to solve a general task.

In the instruction fine tuning stage of the large language model training process, a large number (for example, hundreds of thousands) of high-quality multi-field dialogue data are needed to conduct fine tuning training on the model, so that quality control of the dialogue data is particularly important.

However, in the related art, the instruction fine tuning data is limited in terms of quantity, diversity, creativity and the like under the condition of manual writing, so that the degree of repeatability of the finally obtained dialogue data is higher, more low-quality data and useless information are covered, the good quality of the training data of the large language model cannot be ensured, the safety and applicability of the large language model after training are influenced, the labor cost is higher, and the problem needs to be solved.

Disclosure of Invention

The invention provides a data construction method and device for fine tuning training of a pre-trained large language model, which are used for solving the problems that in the related art, instruction fine tuning data are limited in quantity, diversity, creativity and the like under the condition of manual writing, so that the finally obtained dialogue data have higher repeatability, more low-quality data and useless information are covered, the good quality of the large language model training data cannot be ensured, the safety and applicability of the trained large language model are influenced, the labor cost is higher and the like.

An embodiment of the first aspect of the present invention provides a data construction method for fine-tuning training of a pre-trained large language model, including the steps of: inputting a plurality of seed instructions in a seed instruction set into a first large language model to obtain a self-generated instruction set; classifying each self-generated instruction in the self-generated instruction set by using a preset neural network to determine the actual category of each self-generated instruction, and determining a self-generated instruction subset of which the actual category meets the preset requirement; clustering the self-generated instruction subsets meeting the preset requirements to obtain a plurality of clusters of the self-generated instructions; screening self-generated instructions meeting a predetermined criterion in each of the plurality of clusters to obtain a screening instruction set; and obtaining a screening instruction set after manual labeling, and constructing a fine tuning training data set of the pre-trained large language model based on the screening instruction set after labeling.

Optionally, in one embodiment of the present invention, the inputting the plurality of seed instructions in the seed instruction set into the first large language model to obtain the self-generated instruction set includes: randomly extracting a first number of seed instructions from the plurality of seed instruction sets to generate a first sample set; inputting the first sample set into the first large language model, and updating a self-generated instruction set by using the obtained instruction; randomly extracting a first number of seed instructions from the seed instruction set and randomly extracting a second number of instructions from the self-generated instruction set to generate a second sample set; inputting the second sample set into the first large language model and updating the self-generated instruction set with the obtained instructions.

Optionally, in one embodiment of the present invention, the inputting the plurality of seed instructions in the seed instruction set into the first large language model to obtain the self-generated instruction set further includes: calculating the similarity between each self-generated instruction and other self-generated instructions in the self-generated instruction set; deleting an initial self-generated instruction in the self-generated instruction set, wherein the similarity between any one of the other self-generated instructions is smaller than or equal to a preset similarity threshold value, so as to update the self-generated instruction set.

Optionally, in an embodiment of the present invention, the clustering the subset of self-generated instructions meeting the preset requirement to obtain a plurality of clusters of self-generated instructions includes: inputting the self-generated instruction subset meeting the preset requirement into a second large language model for vectorization processing to obtain a feature vector set corresponding to the self-generated instruction subset; and clustering the feature vector set by adopting a clustering algorithm to obtain a plurality of clusters of self-generated instructions.

Optionally, in one embodiment of the present invention, the filtering the self-generated instruction in each cluster of the plurality of clusters to obtain a set of filtering instructions includes: scoring the self-generated instructions corresponding to each of the plurality of clusters using a reward model; the set of screening instructions is generated using self-generated instructions that score above a threshold.

Optionally, in one embodiment of the present invention, the calculating the similarity between each self-generated instruction in the set of self-generated instructions and other self-generated instructions includes: mapping each self-generated instruction into a plurality of hash values using a plurality of hash functions; a jaccard distance between a plurality of hash values of each self-generated instruction in the set of self-generated instructions is calculated as the similarity.

An embodiment of the second aspect of the present invention provides a data construction apparatus for fine-tuning training of a pre-trained large language model, including: an acquisition module for inputting a plurality of seed instructions in a seed instruction set into a first large language model to acquire a self-generated instruction set; the classification module is used for classifying each self-generated instruction in the self-generated instruction set by using a preset neural network so as to determine the actual category of each self-generated instruction and determine a self-generated instruction subset of which the actual category meets the preset requirement; the clustering module is used for clustering the self-generated instruction subsets meeting the preset requirements to obtain a plurality of clusters of the self-generated instructions; a screening module, configured to screen self-generated instructions meeting a predetermined criterion in each cluster of the plurality of clusters to obtain a screening instruction set; the construction module is used for obtaining the screening instruction set after manual labeling and constructing a fine tuning training data set of the pre-trained large language model based on the screening instruction set after labeling.

Optionally, in one embodiment of the present invention, the acquiring module includes: a first extraction unit, configured to randomly extract a first number of seed instructions from the plurality of seed instruction sets to generate a first sample set; a first updating unit, configured to input the first sample set into the first large language model, and update a self-generated instruction set with the obtained instruction; a second extraction unit for randomly extracting a first number of seed instructions from the seed instruction set and randomly extracting a second number of instructions from the self-generated instruction set to generate a second sample set; and the second updating unit is used for inputting the second sample set into the first large language model and updating the self-generated instruction set by using the obtained instructions.

Optionally, in one embodiment of the present invention, the acquiring module further includes: a calculation unit for calculating the similarity between each self-generated instruction and other self-generated instructions in the self-generated instruction set; and a third updating unit, configured to delete, in the self-generated instruction set, an initial self-generated instruction whose similarity between any one of the other self-generated instructions is less than or equal to a preset similarity threshold, so as to update the self-generated instruction set.

Optionally, in one embodiment of the present invention, the clustering module includes: the vectorization unit is used for inputting the self-generated instruction subset meeting the preset requirement into a second large language model for vectorization processing so as to obtain a feature vector set corresponding to the self-generated instruction subset; and the clustering unit is used for clustering the feature vector set by adopting a clustering algorithm to obtain a plurality of clusters of the self-generated instruction.

Optionally, in one embodiment of the present invention, the screening module includes: a scoring unit configured to score a self-generated instruction corresponding to each of the plurality of clusters using a reward model; and the generation unit is used for generating the screening instruction set by utilizing the self-generated instructions with scores exceeding a threshold value.

Optionally, in one embodiment of the present invention, the computing unit is specifically configured to: mapping each self-generated instruction into a plurality of hash values using a plurality of hash functions; a jaccard distance between a plurality of hash values of each self-generated instruction in the set of self-generated instructions is calculated as the similarity.

An embodiment of a third aspect of the present invention provides an electronic device, including: a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor executing the program to implement the data construction method for fine-tuning training of a pre-trained large language model as described in the above embodiments.

A fourth aspect of the present invention provides a computer-readable storage medium storing a computer program which, when executed by a processor, implements a data construction method for fine-tuning training of a pre-trained large language model as above.

A fifth aspect of the embodiment of the invention provides a computer program product comprising a computer program which, when executed by a processor, implements a data construction method as above for fine-tuning training of a pre-trained large language model.

According to the embodiment of the invention, the instruction data can be automatically generated through the large language model, and the high-quality training data set is obtained through screening and de-duplication of the self-generated instruction data, so that the method is used for fine-tuning training of the large language model, further, the application performance of the large language model is improved, the time cost and the labor cost of training data construction are reduced, and the data construction result has higher practicability and diversity. Therefore, the problems that in the related technology, instruction fine tuning data are limited in the aspects of quantity, diversity, creativity and the like under the condition of manual writing, the repeatability of finally obtained dialogue data is high, more low-quality data and useless information are covered, the good quality of training data of a large language model cannot be ensured, the safety and applicability of the large language model after training are influenced, the labor cost is high and the like are solved.

Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.

Drawings

The foregoing and/or additional aspects and advantages of the invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings, in which:

FIG. 1 is a flow chart of a data construction method for fine-tuning training of a pre-trained large language model, provided in accordance with an embodiment of the present invention;

FIG. 2 is a schematic diagram of a first portion of an architecture of a categorized neural network according to one embodiment of the invention;

FIG. 3 is a schematic diagram of a second portion of the architecture of a categorized neural network according to one embodiment of the present invention;

FIG. 4 is a schematic diagram of a data construction flow of a large language model fine tuning training according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of a data construction apparatus for fine-tuning training of a pre-trained large language model according to an embodiment of the present invention;

Fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are illustrative and intended to explain the present invention and should not be construed as limiting the invention.

The data construction method and apparatus for fine-tuning training of a pre-trained large language model according to embodiments of the present invention are described below with reference to the accompanying drawings. Aiming at the problems that in the related art mentioned in the background art, instruction fine tuning data are limited in the aspects of quantity, diversity, creativity and the like under the condition of manual writing, so that the finally obtained dialogue data have higher repeatability and cover more low-quality data and useless information, the excellent quality of training data of a large language model cannot be guaranteed, the safety and applicability of the large language model after training are influenced, and the labor cost is higher, the invention provides a data construction method for fine tuning training of a pre-trained large language model. Therefore, the problems that in the related technology, instruction fine tuning data are limited in the aspects of quantity, diversity, creativity and the like under the condition of manual writing, the repeatability of finally obtained dialogue data is high, more low-quality data and useless information are covered, the good quality of training data of a large language model cannot be ensured, the safety and applicability of the large language model after training are influenced, the labor cost is high and the like are solved.

Generally, the training process of a large language model can be divided into two stages, a pre-training stage trains through massive (e.g., on the order of 10T) text, and an instruction fine-tuning stage fine-tunes the pre-trained large language model through massive (e.g., hundreds of thousands) of high-quality multi-domain dialogue data.

Specifically, fig. 1 is a schematic flow chart of a data construction method for fine-tuning training of a pre-trained large language model according to an embodiment of the present invention.

As shown in fig. 1, the data construction method for fine-tuning training of a pre-trained large language model includes the steps of:

In step S101, a plurality of seed instructions in a seed instruction set are input into a first large language model to be acquired from a generation instruction set.

It will be appreciated that in embodiments of the present invention, the seed instruction (sample) may be a plurality (e.g., 100, 180, etc.) of manually written selected instructions, and new instructions may be iteratively generated from a pre-built first large language model using the context awareness capabilities of the large language model, resulting in a self-generated instruction set.

Optionally, in one embodiment of the present invention, inputting a plurality of seed instructions in a seed instruction set into the first large language model to obtain the self-generated instruction set includes: randomly extracting a first number of seed instructions from the seed instruction set to generate a first sample set; inputting the first sample set into a first large language model, and updating a self-generated instruction set by using the obtained instruction; randomly extracting a first number of seed instructions from the seed instruction set and randomly extracting a second number of instructions from the self-generated instruction set to generate a second sample set; the second set of examples is input into the first large language model and the resulting instruction update is used to self-generate the instruction set.

It should be appreciated that the initial self-generated instruction set is empty, and therefore, only the seed instructions in the seed instruction set may be extracted as a sample. With the continuous generation of the self-generated instructions, a certain number (e.g., 5) of seed instructions can be extracted, and a certain number (e.g., 3) of self-generated instructions are extracted as samples, so that the generated instructions can both follow the specification of the seed instructions and have diversity. The first large language model is a built large language model that can generate similar instructions based on instructions in the sample set.

Optionally, in some embodiments of the present invention, inputting a plurality of seed instructions in a seed instruction set into the first large language model to obtain the self-generated instruction set further comprises: calculating the similarity between each self-generated instruction and other self-generated instructions in the self-generated instruction set; deleting an initial self-generated instruction in the self-generated instruction set, wherein the similarity between any one of the other self-generated instructions is smaller than or equal to a preset similarity threshold value, so as to update the self-generated instruction set.

Optionally, in some embodiments of the present invention, calculating a similarity between each self-generated instruction in the set of self-generated instructions and other self-generated instructions includes: mapping each self-generated instruction into a plurality of hash values using a plurality of hash functions; the Jaccard distance between the hash values of each self-generated instruction in the set of self-generated instructions is calculated as a similarity.

For example, in the actual execution process, according to a plurality of carefully selected instructions written by manpower as seed instruction sets and self-generated instruction sets obtained by continuous self-generation, 5 seed instructions and 3 instructions of the instruction sets are randomly extracted in each iteration generating task as sample sets, the sample sets are input into a pre-constructed first large language model to generate new self-generated instructions, and the new self-generated instruction sets are updated by the new self-generated instructions. It should be appreciated that new self-generated instructions may be added directly to the set of self-generated instructions. In some embodiments, instructions with high similarity may also be deleted from the self-generated instruction set. For example, to avoid excessive repetitive instructions in the instruction set, k hash functions (where k is a positive integer) may be used to map the text of the self-generated instructions to hash values, calculate the Jaccard Distance (Jaccard Distance) of the hash values as a representation of similarity (it is understood that smaller distances indicate higher similarity), and extract self-generated instructions with each other self-generated instruction having a similarity greater than a preset similarity threshold (e.g., 0.8 or 0.6), or delete self-generated instructions with any other self-generated instruction having a similarity less than or equal to the preset similarity threshold.

In step S102, classifying each self-generated instruction in the set of self-generated instructions by using a preset neural network to determine an actual category of each self-generated instruction, and determining a subset of self-generated instructions for which the actual category meets a preset requirement.

It should be noted that the preset neural network and the preset requirements may be set by those skilled in the art according to the actual situation, and are not specifically limited herein.

It will be appreciated that in embodiments of the present invention, the actual categories of self-generated instructions may be correct issues, low quality issues, issues requiring invocation of agents (e.g., time issues), identity validation issues, political issues, control issues, bias issues, etc. These categories may be set by themselves, and the invention is not limited. In some embodiments, only one or more categories of instructions are reserved for the next stage, i.e., the preset requirement may be that the category of the self-generated instruction be specific to the one or more categories. For example, since an incorrect problem may require a series of post-processing, for example, a problem requiring an agent to be invoked should follow a labeling procedure of a state machine, an identity confirmation problem of a large model should require manual injection of identity information, etc., only the correct problem is retained to enter the next stage, that is, a subset of self-generated instructions meeting a preset requirement includes self-generated instructions belonging to the correct problem category.

In particular, the preset neural network may be a pre-built classification neural network (e.g., a large language model) that may be used for determination of the actual class of each self-generated instruction in the set of self-generated instructions. In fig. 2, a schematic diagram of a first part of a framework of a neural network according to an embodiment of the present invention is shown, where the neural network is trained by using a BERT style framework, and the input is an instruction, and the output is a classification value predicted by the neural network. The backbone network of the classified neural network is formed by stacking L layers of encoder layers (L×layers), N vectors (N×vector) are input into each layer, and N vectors (N×vector) with the same dimension are output sequentially through Multi-Head Attention (Multi-Head Attention), residual connection (Add & Nor), feed Forward network (Feed Forward), residual connection (Add & Nor), and finally, L and N are positive integers.

As further shown in fig. 3, which is a schematic diagram of a second part of the architecture of the classification neural network according to an embodiment of the present invention, after passing through the L-layer encoder layer, the first vector of the N vectors (n×vector) is linearly classified by the linear layer (LINEAR LAYER), and finally, a classification value (classification label) predicted by the classification neural network is output.

Further, in the training process of the classified neural network, cross entropy Loss of the multi-classification problem can be used as a Loss function Loss, and the expression is:

Wherein, For instructions belonging to the/>Probability of class label,/>For/>One-Hot (One-Hot) representation of class labels, C is the total number of labels, and/>，/>When the instruction belongs to the/>Category time/>Otherwise。

It should be understood that other neural network architectures or loss functions may also be employed, and the invention is not limited.

In step S103, the subset of self-generated instructions satisfying the preset requirement is clustered to obtain a plurality of clusters of self-generated instructions.

It can be understood that in the embodiment of the present invention, the self-generated instruction subset detected in the above steps and meeting the preset requirement may be clustered by using the text feature of the instruction, so as to implement clustering of the self-generated instruction subset.

Optionally, in one embodiment of the present invention, clustering the subset of self-generated instructions meeting the preset requirement to obtain a plurality of clusters of self-generated instructions includes: inputting the self-generated instruction subset meeting the preset requirement into a second large language model for vectorization processing to obtain a feature vector set corresponding to the self-generated instruction subset; and clustering the feature vector set by adopting a clustering algorithm to obtain a plurality of clusters of self-generated instructions.

The second largest language model is a pre-built model, the input of which is text, and the output of which is a feature vector, wherein the feature vector can characterize the features of the text. It should be noted that the clustering algorithm may be set by those skilled in the art according to the actual situation, and is not specifically limited herein.

In some embodiments, the k-means algorithm may be employed to cluster the subset of self-generated instructions that meet the preset requirements. For example, in actual execution, clustering may include the steps of: 1. and (3) calculating the average value of the feature vectors in the clustered clusters, updating the cluster centers by using the average value, and 4, iteratively executing the step (2) and the step (3) until the moving distance of the updated cluster centers is smaller than a preset distance threshold value. At this time, the cluster center can be confirmed, and thus a plurality of clusters of the self-generated instruction can be obtained.

In step S104, the self-generated instructions in each of the plurality of clusters that meet the predetermined criteria are filtered to obtain a filtered instruction set, thereby implementing deduplication of the subset of self-generated instructions.

Optionally, in one embodiment of the present invention, filtering the self-generated instructions in each cluster of the plurality of clusters that meet a predetermined criterion to obtain a filtered instruction set includes: scoring the self-generated instructions corresponding to each of the plurality of clusters using a reward model; the set of screening instructions is generated using self-generated instructions that score above a threshold.

It should be noted that the predetermined standard may be set by those skilled in the art according to the actual situation, and is not specifically limited herein. The reward model is a large language model which is built in advance, the input of the reward model is text, and the output of the reward model is the score of a text instruction.

In some embodiments, the score of each self-generated instruction in each cluster may be obtained based on a reward algorithm, the size of the score may be ranked, and the self-generated instructions in each cluster with ranking meeting a predetermined criterion (e.g., top 10%, top 20%, etc.) may be filtered to obtain a set of filtered instructions. The predetermined criteria may also be a score threshold and, accordingly, self-generated instructions in each cluster that score above the score threshold may be filtered.

In step S105, a manually labeled screening instruction set is obtained, and a fine-tuning training dataset of the pre-trained large language model is constructed based on the labeled screening instruction set.

It can be understood that in the embodiment of the present invention, the screening instruction set obtained in the step S104 may be manually marked, and the manually marked screening instruction set is used to construct the fine tuning training data set of the pre-trained large language model, so as to be applied to the fine tuning training process of the large language model, so that the trained large language model has stronger accuracy and practicability in practical application.

The annotation screening instruction set may comprise: the screening instruction set is subjected to instruction rewriting and answers are generated, and the rewritten screening instruction set and the corresponding answers are evaluated to obtain an evaluation result; and confirming at least one labeling label of each screening instruction in the rewritten screening instruction set based on the judging result, wherein the screening instruction set after labeling is composed of the screening instructions containing the at least one labeling label.

In the actual execution process, the instructions of the screening instruction set can be rewritten manually and answers can be provided, and the quality of the answers corresponding to the rewritten instructions can be judged manually. Wherein the assessment of the quality of the answer is expanded from multiple dimensions, such as instruction compliance, including the degree of matching of the question to the answer, whether to follow the required format of the instruction; the correctness of the answer, including whether objective facts are met; the correctness of the content typesetting; content richness, and the like. And finally, rejecting unqualified instructions and answers according to the labeling labels by a person. Instruction datasets for fine-tuning training of the pre-trained large language model can then be constructed based on the annotated screening instruction sets. For example, the annotated screening instruction set may be selected, ordered, etc., to obtain an instruction dataset for fine-tuning training.

As shown in fig. 4, the working of the embodiment of the present invention will be described in detail below with reference to a specific embodiment.

Step S401: the dialog instructions are self-generated.

The context awareness capability of the language model is utilized, and after eight samples are provided, new instructions are continuously and iteratively generated by the large language model.

Step S402: instruction classification screening.

The instructions are classified by a pre-trained classification neural network, and the instructions are classified into correct problems, low-quality problems, instructions (such as time problems) requiring an agent to be called, identity confirmation problems, political problems, control problems, bias problems and the like. For incorrect problems, a series of post-processing is performed, leaving only the correct problem to go to the next stage.

Step S403: instruction deduplication.

The method comprises the steps of using a large language model to vectorize instructions to obtain feature vectors, then performing a clustering algorithm, and screening the instructions with higher scores output by the winning models of each category to enter the label of the next link after clustering is completed.

Step S404: and delivering the label.

The label is divided into two stages, wherein the first stage is to rewrite the provided instruction and answer manually, and the second stage is to judge the quality of the answer manually.

According to the data construction method for the fine tuning training of the pre-trained large language model, instruction data can be generated by the large language model, and a high-quality training data set is obtained by screening and de-duplicating the self-generated instruction data so as to be used for the fine tuning training of the large language model, so that the application performance of the large language model is improved, the time cost and the labor cost for constructing the training data are reduced, and the data construction result is more practical and diverse. Therefore, the problems that in the related technology, instruction fine tuning data are limited in the aspects of quantity, diversity, creativity and the like under the condition of manual writing, the repeatability of finally obtained dialogue data is high, more low-quality data and useless information are covered, the good quality of training data of a large language model cannot be ensured, the safety and applicability of the large language model after training are influenced, the labor cost is high and the like are solved.

Next, a data construction apparatus for fine-tuning training of a pre-trained large language model according to an embodiment of the present invention will be described with reference to the accompanying drawings.

FIG. 5 is a schematic diagram of the structure of a data construction apparatus for fine-tuning training of a pre-trained large language model according to an embodiment of the present invention.

As shown in fig. 5, the data construction apparatus 10 for fine-tuning training of a pre-trained large language model includes: the system comprises an acquisition module 100, a classification module 200, a clustering module 300, a screening module 400 and a construction module 500.

Wherein, the obtaining module 100 is configured to input a plurality of seed instructions in a seed instruction set into the first large language model to obtain a self-generated instruction set.

The classification module 200 is configured to classify each self-generated instruction in the set of self-generated instructions by using a preset neural network, so as to determine an actual class of each self-generated instruction, and determine a subset of self-generated instructions that the actual class meets a preset requirement.

The clustering module 300 is configured to cluster the subset of the self-generated instructions that meets the preset requirement, so as to obtain a plurality of clusters of the self-generated instructions.

A screening module 400, configured to screen the self-generated instructions meeting the predetermined criteria in each cluster of the plurality of clusters to obtain a screening instruction set.

The construction module 500 is configured to obtain a manually labeled screening instruction set, and construct a fine tuning training data set of the pre-trained large language model based on the labeled screening instruction set.

Optionally, in one embodiment of the present invention, the acquiring module 100 includes: the device comprises a first extraction unit, a first updating unit, a second extraction unit and a second updating unit.

The first extraction unit is used for randomly extracting a first number of seed instructions from the plurality of seed instruction sets to generate a first sample set.

And the first updating unit is used for inputting the first sample set into the first large language model and updating the self-generated instruction set by using the obtained instruction.

A second extraction unit for randomly extracting a first number of seed instructions from the seed instruction set and randomly extracting a second number of instructions from the self-generated instruction set to generate a second sample set.

And the second updating unit is used for inputting the second sample set into the first large language model and updating the self-generated instruction set by using the obtained instructions.

Optionally, in one embodiment of the present invention, the obtaining module 100 further includes: a computing unit and a third updating unit.

The computing unit is used for computing the similarity between each self-generated instruction and other self-generated instructions in the self-generated instruction set.

And a third updating unit configured to delete, in the self-generated instruction set, an initial self-generated instruction having a similarity between any one of the other self-generated instructions that is less than or equal to a preset similarity threshold, so as to update the self-generated instruction set.

Alternatively, in one embodiment of the invention, the clustering module 300 includes: and a vectorization unit and a clustering unit.

The vectorization unit is used for inputting the self-generated instruction subset meeting the preset requirement into the second large language model for vectorization processing so as to obtain a feature vector set corresponding to the self-generated instruction subset.

And the clustering unit is used for clustering the feature vector set by adopting a clustering algorithm to obtain a plurality of clusters of the self-generated instruction.

Optionally, in one embodiment of the present invention, the screening module 400 includes: a scoring unit and a generating unit.

And the scoring unit is used for scoring the self-generated instruction corresponding to each cluster in the clusters by using the rewarding model.

And the generation unit is used for generating a screening instruction set by utilizing the self-generated instructions with scores exceeding a threshold value.

Optionally, in one embodiment of the present invention, the computing unit is specifically configured to: mapping each self-generated instruction into a plurality of hash values using a plurality of hash functions; the Jaccard distance between the hash values of each self-generated instruction in the set of self-generated instructions is calculated as a similarity.

It should be noted that the explanation of the foregoing embodiment of the data construction method for fine-tuning training of a pre-trained large language model is also applicable to the data construction device for fine-tuning training of a pre-trained large language model of this embodiment, and will not be repeated here.

According to the data construction device for the fine tuning training of the pre-trained large language model, instruction data can be generated by the large language model, and a high-quality training data set is obtained by screening and de-duplicating the self-generated instruction data so as to be used for the fine tuning training of the large language model, so that the application performance of the large language model is improved, the time cost and the labor cost for constructing the training data are reduced, and the data construction result is more practical and diverse. Therefore, the problems that in the related technology, instruction fine tuning data are limited in the aspects of quantity, diversity, creativity and the like under the condition of manual writing, the repeatability of finally obtained dialogue data is high, more low-quality data and useless information are covered, the good quality of training data of a large language model cannot be ensured, the safety and applicability of the large language model after training are influenced, the labor cost is high and the like are solved.

Fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present invention. The electronic device may include:

a memory 601, a processor 602, and a computer program stored on the memory 601 and executable on the processor 602.

The processor 602, when executing the program, implements the data construction method for fine-tuning training of a pre-trained large language model provided in the above embodiments.

Further, the electronic device further includes:

A communication interface 603 for communication between the memory 601 and the processor 602.

A memory 601 for storing a computer program executable on the processor 602.

The memory 601 may comprise a high-speed RAM memory or may further comprise a non-volatile memory (non-volatile memory), such as at least one disk memory.

If the memory 601, the processor 602, and the communication interface 603 are implemented independently, the communication interface 603, the memory 601, and the processor 602 may be connected to each other through a bus and perform communication with each other. The bus may be an industry standard architecture (Industry Standard Architecture, abbreviated ISA) bus, an external device interconnect (PERIPHERAL COMPONENT INTERCONNECT, abbreviated PCI) bus, or an extended industry standard architecture (Extended Industry Standard Architecture, abbreviated EISA) bus, among others. The buses may be divided into address buses, data buses, control buses, etc. For ease of illustration, only one thick line is shown in fig. 6, but not only one bus or one type of bus.

Alternatively, in a specific implementation, if the memory 601, the processor 602, and the communication interface 603 are integrated on a chip, the memory 601, the processor 602, and the communication interface 603 may perform communication with each other through internal interfaces.

The processor 602 may be a central processing unit (Central Processing Unit, abbreviated as CPU), or an Application SPECIFIC INTEGRATED Circuit (ASIC), or one or more integrated circuits configured to implement embodiments of the invention.

The present embodiment also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the data construction method for fine-tuning training of a pre-trained large language model as above.

The present embodiment also provides a computer program product comprising a computer program which, when executed by a processor, implements a data construction method for fine-tuning training of a pre-trained large language model as described above.

In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In this specification, schematic representations of the above terms are not necessarily directed to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or N embodiments or examples. Furthermore, the different embodiments or examples described in this specification and the features of the different embodiments or examples may be combined and combined by those skilled in the art without contradiction.

Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include at least one such feature. In the description of the present invention, "N" means at least two, for example, two, three, etc., unless specifically defined otherwise.

Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and additional implementations are included within the scope of the preferred embodiment of the present invention in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order from that shown or discussed, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the embodiments of the present invention.

Logic and/or steps represented in the flowcharts or otherwise described herein, e.g., a ordered listing of executable instructions for implementing logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or N wires, a portable computer cartridge (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). In addition, the computer readable medium may even be paper or other suitable medium on which the program is printed, as the program may be electronically captured, via optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner if necessary, and then stored in a computer memory.

It is to be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above-described embodiments, the N steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. As with the other embodiments, if implemented in hardware, may be implemented using any one or combination of the following techniques, as is well known in the art: discrete logic circuits having logic gates for implementing logic functions on data signals, application specific integrated circuits having suitable combinational logic gates, programmable Gate Arrays (PGAs), field Programmable Gate Arrays (FPGAs), and the like.

Those of ordinary skill in the art will appreciate that all or a portion of the steps carried out in the method of the above-described embodiments may be implemented by a program to instruct related hardware, where the program may be stored in a computer readable storage medium, and where the program, when executed, includes one or a combination of the steps of the method embodiments.

In addition, each functional unit in the embodiments of the present invention may be integrated in one processing module, or each unit may exist alone physically, or two or more units may be integrated in one module. The integrated modules may be implemented in hardware or in software functional modules. The integrated modules may also be stored in a computer readable storage medium if implemented in the form of software functional modules and sold or used as a stand-alone product.

The above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, or the like. While embodiments of the present invention have been shown and described above, it will be understood that the above embodiments are illustrative and not to be construed as limiting the invention, and that variations, modifications, alternatives and variations may be made to the above embodiments by one of ordinary skill in the art within the scope of the invention.

Claims

1. A data construction method for fine-tuning training of a pre-trained large language model, comprising the steps of:

inputting a plurality of seed instructions in a seed instruction set into a first large language model to obtain a self-generated instruction set;

Classifying each self-generated instruction in the self-generated instruction set by using a preset neural network to determine the actual category of each self-generated instruction, and determining a self-generated instruction subset of which the actual category meets the preset requirement;

Clustering the self-generated instruction subsets meeting the preset requirements to obtain a plurality of clusters of the self-generated instructions;

screening self-generated instructions meeting a predetermined criterion in each of the plurality of clusters to obtain a screening instruction set;

And obtaining a screening instruction set after manual labeling, and constructing a fine tuning training data set of the pre-trained large language model based on the screening instruction set after labeling.

2. The method of claim 1, wherein inputting the plurality of seed instructions in the seed instruction set into the first large language model to obtain the self-generated instruction set comprises:

randomly extracting a first number of seed instructions from the plurality of seed instruction sets to generate a first sample set;

inputting the first sample set into the first large language model, and updating a self-generated instruction set by using the obtained instruction;

Randomly extracting a first number of seed instructions from the seed instruction set and randomly extracting a second number of instructions from the self-generated instruction set to generate a second sample set;

Inputting the second sample set into the first large language model and updating the self-generated instruction set with the obtained instructions.

3. The method of claim 1 or 2, wherein inputting a plurality of seed instructions in a seed instruction set into a first large language model to obtain a self-generated instruction set, further comprising:

calculating the similarity between each self-generated instruction and other self-generated instructions in the self-generated instruction set;

Deleting an initial self-generated instruction in the self-generated instruction set, wherein the similarity between any one of the other self-generated instructions is smaller than or equal to a preset similarity threshold value, so as to update the self-generated instruction set.

4. The method according to claim 1 or 2, wherein clustering the subset of self-generated instructions meeting the preset requirement to obtain a plurality of clusters of self-generated instructions comprises:

Inputting the self-generated instruction subset meeting the preset requirement into a second large language model for vectorization processing to obtain a feature vector set corresponding to the self-generated instruction subset; and

And clustering the feature vector set by adopting a clustering algorithm to obtain a plurality of clusters of self-generated instructions.

5. The method according to claim 1 or 2, wherein said screening of the self-generated instructions in each of the plurality of clusters meeting a predetermined criterion to obtain a set of screening instructions comprises:

scoring the self-generated instructions corresponding to each of the plurality of clusters using a reward model;

The set of screening instructions is generated using self-generated instructions that score above a threshold.

6. A method according to claim 3, wherein said calculating the similarity between each self-generated instruction in said set of self-generated instructions and other self-generated instructions comprises:

mapping each self-generated instruction into a plurality of hash values using a plurality of hash functions;

A jaccard distance between a plurality of hash values of each self-generated instruction in the set of self-generated instructions is calculated as the similarity.

7. A data construction apparatus for fine-tuning training of a pre-trained large language model, comprising:

An acquisition module for inputting a plurality of seed instructions in a seed instruction set into a first large language model to acquire a self-generated instruction set;

the classification module is used for classifying each self-generated instruction in the self-generated instruction set by using a preset neural network so as to determine the actual category of each self-generated instruction and determine a self-generated instruction subset of which the actual category meets the preset requirement;

The clustering module is used for clustering the self-generated instruction subsets meeting the preset requirements to obtain a plurality of clusters of the self-generated instructions;

a screening module, configured to screen self-generated instructions meeting a predetermined criterion in each cluster of the plurality of clusters to obtain a screening instruction set;

The construction module is used for obtaining the screening instruction set after manual labeling and constructing a fine tuning training data set of the pre-trained large language model based on the screening instruction set after labeling.

8. An electronic device, comprising: memory, a processor and a computer program stored on the memory and executable on the processor, the processor executing the program to implement the data construction method for fine-tuning training of a pre-trained large language model as claimed in any one of claims 1-6.

9. A computer-readable storage medium having stored thereon a computer program, wherein the program is executed by a processor for implementing a data construction method for fine-tuning training of a pre-trained large language model according to any of claims 1-6.

10. A computer program product comprising a computer program for implementing a data construction method for fine-tuning training of a pre-trained large language model according to any of claims 1-6 when executed by a processor.