CN118114049A - Training data set screening method and device for large language model, electronic equipment and storage medium - Google Patents

Training data set screening method and device for large language model, electronic equipment and storage medium Download PDF

Info

Publication number
CN118114049A
CN118114049A CN202410232115.7A CN202410232115A CN118114049A CN 118114049 A CN118114049 A CN 118114049A CN 202410232115 A CN202410232115 A CN 202410232115A CN 118114049 A CN118114049 A CN 118114049A
Authority
CN
China
Prior art keywords
data
screened
data set
training
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202410232115.7A
Other languages
Chinese (zh)
Inventor
周玮康
邓佳佶
于飞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ant Fortune Shanghai Financial Information Service Co ltd
Original Assignee
Ant Fortune Shanghai Financial Information Service Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ant Fortune Shanghai Financial Information Service Co ltd filed Critical Ant Fortune Shanghai Financial Information Service Co ltd
Priority to CN202410232115.7A priority Critical patent/CN118114049A/en
Publication of CN118114049A publication Critical patent/CN118114049A/en
Pending legal-status Critical Current

Links

Landscapes

  • Machine Translation (AREA)

Abstract

The embodiment of the specification discloses a training data set screening method, device, electronic equipment and storage medium of a large language model. The method comprises the following steps: determining a target application scene of a model to be trained, and acquiring a data set to be screened and sample data corresponding to the target application scene; inputting the sample data and the data set to be screened into the model to be trained, and respectively calculating the influence degree of each piece of data to be screened in the data set to be screened on the target application scene of the model to be trained; the influence degree is used for representing the similarity degree of each piece of data to be screened and the sample data; and screening a target data set from the data set to be screened according to the influence degree corresponding to each piece of data to be screened.

Description

Training data set screening method and device for large language model, electronic equipment and storage medium
Technical Field
One or more embodiments of the present disclosure relate to the field of language model technologies, and in particular, to a training data set screening method, device, electronic apparatus, and storage medium for a large-scale language model.
Background
The large language model (Large Language Models, LLM) refers to: and a deep learning model for training a large amount of text data as a training data set and working in a mode of learning language use. The large language model may generate natural language text or understand the meaning of the language text to handle various natural language tasks such as text classification, questions and answers, conversations, and the like. The effect of a large language model trained from different types of text data in different scenarios is different, for example: the large language model obtained through training by using the financial task data has strong understanding and generating capacity for financial scene texts and weak understanding and generating capacity for sports scene texts. Therefore, the screening of the training data set of the large language model is an important factor affecting the use effect of the large language model.
In the related art, screening of a training data set is achieved by searching keywords corresponding to each scene, for example: in a financial scenario, data containing keywords such as "finance", "economy", "stock" and the like are used as training data. However, the keywords in the method are manually determined, the efficiency is low, the accuracy cannot be guaranteed, and in addition, a lot of data of financial scenes which do not contain the keywords are easy to miss.
Disclosure of Invention
The embodiment of the specification provides a training data set screening method, device, electronic equipment and storage medium of a large language model, and the technical scheme is as follows:
In a first aspect, embodiments of the present disclosure provide a training data set screening method for a large language model, including:
determining a target application scene of a model to be trained, and acquiring a data set to be screened and sample data corresponding to the target application scene;
Inputting the sample data and the data set to be screened into the model to be trained, and respectively calculating the influence degree of each piece of data to be screened in the data set to be screened on the target application scene of the model to be trained; the influence degree is used for representing the similarity degree of each piece of data to be screened and the sample data;
and screening a target data set from the data set to be screened according to the influence degree corresponding to each piece of data to be screened.
In a second aspect, embodiments of the present disclosure provide a training data set screening apparatus for a large language model, including:
the acquisition unit is used for determining a target application scene of the model to be trained and acquiring a data set to be screened and sample data corresponding to the target application scene;
The computing unit is used for inputting the sample data and the data set to be screened into the model to be trained and respectively computing the influence degree of each piece of data to be screened in the data set to be screened on the target application scene of the model to be trained; the influence degree is used for representing the similarity degree of each piece of data to be screened and the sample data;
and the screening unit is used for screening out a target data set from the data set to be screened according to the influence degree corresponding to each piece of data to be screened.
In a third aspect, embodiments of the present disclosure provide an electronic device including a processor and a memory; the processor is connected with the memory; the memory is used for storing executable program codes; the processor executes a program corresponding to the executable program code by reading the executable program code stored in the memory, for performing the steps of the training data set screening method of a large language model according to the first aspect of the above embodiment.
In a fourth aspect, embodiments of the present disclosure provide a computer storage medium storing a plurality of instructions adapted to be loaded by a processor and to perform the steps of the training dataset screening method of a large language model according to the first aspect of the embodiments described above.
The technical scheme provided by some embodiments of the present specification has the following beneficial effects:
The method comprises the steps of screening a training data set for a to-be-trained model applied to a target application scene, inputting the to-be-screened data set and sample data in the target application scene to the to-be-trained model, and respectively calculating the influence degree of each piece of to-be-screened data in the to-be-screened data set on the target application scene, wherein the influence degree can be used for representing the similarity degree of each piece of to-be-screened data and the sample data, so that the target data set screened from the to-be-screened data set based on the influence degree is more similar to the sample data. Compared with a general data set, the training method has the advantages that the model to be trained is trained according to the target data set, the training efficiency is higher, and the processing result of the trained model on the data of the target application scene is more accurate.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present description, the drawings that are required in the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present description, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a block diagram of a data processing system of a large language model provided in an embodiment of the present disclosure.
Fig. 2 is a flowchart of a training data set screening method for a large language model according to an embodiment of the present disclosure.
Fig. 3 is a schematic diagram of a gradient in a feature encoder provided in an embodiment of the present disclosure.
Fig. 4 is a block diagram of a training data set screening apparatus for a large language model according to an embodiment of the present disclosure.
Fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure.
Detailed Description
The technical solutions in the embodiments of the present specification will be clearly and completely described below with reference to the drawings in the embodiments of the present specification.
The terms first, second and the like in the description and in the claims and in the above-described figures are used for distinguishing between different objects and not necessarily for describing a sequential or chronological order. Furthermore, the term "include" and any variations thereof is intended to cover a non-exclusive inclusion. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those listed steps or elements but may include other steps or elements not listed or inherent to such process, method, article, or apparatus.
The present specification, prior to detailing a training dataset screening method for a large language model in connection with one or more embodiments, introduces a framework for processing data for a large language model.
FIG. 1 is a block diagram of a data processing system for a large language model provided in an exemplary embodiment. As shown in fig. 1, a handset 11 and a server 12 may be included.
The mobile phone 11 is one type of electronic device that can be used by a user, but the electronic device used by the user is not limited thereto, and may include, for example: tablet devices, notebook computers, personal computers (PDAs), personal DIGITAL ASSISTANTS, wearable devices (e.g., smart glasses, smart watches, etc.), etc., which are not limited in this specification. In operation, the handset 11 is operated with a client program of the screening system such that the handset 11 is configured as a client of the screening system. The client program may receive the data to be processed input by the user, and forward the data to be processed to the server 12, so that the server 12 processes the data to be processed.
The server 12 may be a physical server comprising a separate host, or the server 12 may be a virtual server carried by a cluster of hosts. A model 121 to be trained and a large language model 122 are deployed on the server 12. The model 121 to be trained is a language model which is not trained and cannot be used for processing data; the large language model 122 is a training completion that can be used to process the language model of the data. Multiple models can be deployed on the same server (such as the server 12), so that the acquired training data can be used for training the multiple models in the model training process, thereby reducing the transmission quantity of the data and saving the transmission cost; only one model can be deployed on one server, so that the situation that a plurality of models occupy the computational power resource of the same server is avoided.
The server 12 has a server program of the screening system running thereon such that the server 12 is configured as a server of the screening system. The server program may be matched with the above-mentioned client, for example, may receive the data to be processed sent by the client, and input the received data to be processed into the large language model 122 deployed by the server. The interaction between the mobile phone 11 and the server 12 may include various types of wired or wireless interactions, which are not limited in this specification.
The effect of a large language model trained from different types of text data in different scenarios is different, for example: the large language model obtained through training by using the financial task data has strong understanding and generating capacity for financial scene texts and weak understanding and generating capacity for sports scene texts. Therefore, the screening of the training data set of the large language model is an important factor affecting the use effect of the large language model.
In the related art, screening of a training data set is achieved by searching keywords corresponding to each scene, for example: in a financial scenario, data containing keywords such as "finance", "economy", "stock" and the like are used as training data. However, the keywords in the method are manually determined, the efficiency is low, the accuracy cannot be guaranteed, and in addition, a lot of data of financial scenes which do not contain the keywords are easy to miss.
In order to solve the problems in the related art, the present specification proposes a training dataset screening method for a large-scale language model.
Referring to fig. 2, fig. 2 is a flow chart illustrating a training data set screening method of a large language model according to an embodiment of the present disclosure, and as shown in fig. 2, the training data set screening method of the language model may at least include the following steps:
Step 202, determining a target application scene of a model to be trained, and acquiring a data set to be screened and sample data corresponding to the target application scene.
As previously mentioned, the large language model may be applied to different application scenarios, such as: financial scenes, sports scenes, medical scenes, and so forth. The target application scenario may be an application scenario corresponding to a training direction of the model to be trained specified by a user (the user mentioned in the present specification may be a software developer of the model to be trained, or other personnel having management rights of the model to be trained).
Sample data corresponding to the target application scene refers to text data specific to the target application scene, for example: in the case that the target application scenario is a financial scenario, the sample data may be financial data (e.g., the financing scale of a certain enterprise is further improved, and a certain enterprise is in imminent bankruptcy); in the case where the target application scene is a sports scene, the sample data may be sports data (e.g., a player obtains a gold plate of a diving event, and a player breaks a world record). The sample data may be manually determined to correspond to the target application scenario in advance. The sample data may be only one text data or may be a plurality of text data, and the present specification does not limit the number of sample data.
The data set to be screened contains a plurality of pieces of data to be screened, and the data to be screened is usually text data, for example: a word, a piece of text, or an article. Of course, the data to be screened may be other forms of data such as voice data and image data, in which case the large language model may extract feature information in the voice data or the image data and convert the voice data or the image data into text data according to the extracted feature information.
There are many ways to obtain the data set to be screened and the sample data, for example: may be obtained from a database that stores training data for language models specifically, or may be manually entered by a user. The present description is not limited thereto.
Step 204, inputting the sample data and the data set to be screened into the model to be trained, and respectively calculating the influence degree of each piece of data to be screened in the data set to be screened on the target application scene of the model to be trained; the influence degree is used for representing the similarity degree of each piece of data to be screened and the sample data.
The similarity degree of the data to be screened and the sample data is represented by the influence degree of the data to be screened on the target application scene of the model to be trained, so that the data more similar to the sample data can be screened out later in the process of screening the data set, and the screened data also corresponds to the target application scene with high probability because the sample data is the data corresponding to the target application scene.
The specific calculation process will be described in detail later, and will not be described here again.
And 206, screening out a target data set from the data set to be screened according to the influence degree corresponding to each piece of data to be screened.
The data to be screened with a large degree of influence can be added to the target data set. Specifically, adding the data to be screened of fifty top bits (the specific bit number can be preset by a user) of the influence degree rank to a target data set; or adding ten percent of the data to be screened with the highest ranking degree (the specific percentage can be preset by a user) to a target data set; or using a specific value to characterize the influence degree, and adding the data to be screened, of which the influence degree exceeds a certain threshold value (the specific value can be preset by a user), to the target data set. The present description is not limited thereto.
In this embodiment, for the to-be-trained model applied to the target application scene, the training data set is screened, the to-be-screened data set and the sample data in the target application scene are input to the to-be-trained model, and the influence degree of each piece of to-be-screened data in the to-be-screened data set on the target application scene is calculated respectively, and because the influence degree can be used for representing the similarity degree of each piece of to-be-screened data and the sample data, the target data set screened from the to-be-screened data set based on the influence degree is more similar to the sample data. Compared with a general data set, the training method has the advantages that the model to be trained is trained according to the target data set, the training efficiency is higher, and the processing result of the trained model on the data is more accurate.
In an embodiment, the calculating the influence degree of each piece of to-be-screened data in the to-be-screened data set on the to-be-trained model on the target application scene includes: respectively calculating a first gradient of the sample data in a feature encoder of the model to be trained and a second gradient of each piece of data to be screened in the data set to be screened in the feature encoder; and calculating cosine similarity of the first gradient and the second gradient, wherein the cosine similarity is used for representing the influence degree of corresponding data to be screened on the target application scene of the model to be trained.
Further, the influence degree is the influence degree of the corresponding data to be screened on the loss of the sample data in the iterative process; the calculating the influence degree of each part of to-be-screened data in the to-be-screened data set on the to-be-trained model on the target application scene includes: using a first-order Taylor formula to develop a loss value of the sample data in an iterative process so as to obtain a loss formula of the sample data in the iterative process; and optimizing the loss formula according to random gradient descent, and determining that the loss value approximates to the cosine similarity under the condition that the learning rate in the iterative process is sufficiently small according to the optimization result.
Since the sample data corresponds to the target application scene, the influence of the data to be screened on the target application scene by the model to be trained can be regarded as: influence of data to be screened on sample data in an iterative process. And the influence of the sample data in the iterative process can be the influence degree of the loss of the sample data in the iterative process. Therefore, the influence degree of the data to be screened on the scene of the model to be trained applied to the target is the influence degree of the data to be screened on the loss of the sample data in the iterative process.
The following describes a specific calculation procedure of the influence degree in combination with a calculation formula:
I(zp,zt)=lt(zpt)-lt(zpt+1);
Wherein z p is data to be screened, z t is sample data, I (z p,zt) is a loss value of z t after t iterations, θ is a parameter of a feature encoder in the model to be trained (the data can affect the parameter when the data is input into the model to be trained for iteration), and parameters of the feature encoder before and after t iterations of z t are θ t and θ t+1 respectively.
Unfolding l t(zpt+1 using the first-order taylor formula) as follows:
The model in this specification uses random gradient descent (SGD) as an optimizer. In this case, the parameter update of the feature encoder follows the following formula:
where η t is the learning rate of the t-th iteration.
Combining the first-order taylor expansion formula with the parameter update formula, the loss value formula can be simplified into:
The conditions for the taylor formula application need to ensure that the update step size of θ is small enough, i.e., the learning rate η t for the t-th iteration is small enough. When η t is sufficiently small, it is possible to obtain:
Whereas, since η t is small enough, then O (|θ t+1t|2) can be ignored, then it can be derived that:
While For the first gradient of the data to be screened in the feature encoder,/>Is the second gradient of the sample data in the feature encoder. The dot product of the first gradient and the second gradient is the cosine similarity of the first gradient and the second gradient. Thus, cosine similarity can characterize the loss value of sample data in cases where the learning rate in the iterative process is sufficiently small.
Further describing the cosine similarity with reference to fig. 3, g 1 and g 2 are gradients of two different data to be screened in the feature encoder, g ' is a gradient of sample data in the feature encoder, and it is obvious that g 2 ·g ' is greater than g 1 ·g ', that is, the data to be screened corresponding to g 2 is more similar to the sample data.
In this embodiment, the degree of influence of the corresponding data to be screened on the target application scene of the model to be trained is represented by the cosine similarity between the first gradient of the sample data in the feature encoder and the second gradient of the data to be screened in the feature encoder, so that the degree of influence is quantized, the calculation flow is simplified, the loss value of the sample data is not required to be calculated specifically, and the loss value is represented by directly using the cosine similarity between the gradients.
In an embodiment, the method further comprises the step of: and taking the target data set as a training data set, and training the model to be trained according to the training data set.
After the target data set is screened, the target data set can be used as training data of the model to be trained to train the model to be trained. Because the data in the screened target data set are all data with higher similarity with the sample data, the training efficiency of training the model to be trained by using the target data set is higher (the influence degree can reflect the point, and the higher the influence degree indicates the faster the training is completed). Moreover, since text data of other application scenes are not used as training data any more, the processing result of the trained model on the data of the target application scene is more accurate.
Further, the sample data is added to the training dataset.
Because the sample data also corresponds to the target application scene, the sample data can be added into the training data set, and the training data set is expanded, so that the training effect is improved.
In an embodiment, the method further comprises the step of: and responding to the acquired data to be processed in the target application scene, and inputting the data to be processed into a large-scale language model obtained through training so that the large-scale language model processes the data to be processed.
The large language model obtained by training the target data set can be used for processing the data to be processed in the target application scene. In this embodiment, for the to-be-trained model applied to the target application scene, the training data set is screened, the to-be-screened data set and the sample data in the target application scene are input to the to-be-trained model, and the influence degree of each piece of to-be-screened data in the to-be-screened data set on the target application scene is calculated respectively, and because the influence degree can be used for representing the similarity degree of each piece of to-be-screened data and the sample data, the target data set screened from the to-be-screened data set based on the influence degree is more similar to the sample data. Compared with a general data set, the training method has the advantages that the model to be trained is trained according to the target data set, the training efficiency is higher, and the processing result of the trained model on the data of the target application scene is more accurate.
Referring to fig. 4, fig. 4 is a block diagram of a training data set screening apparatus for a large language model according to an embodiment of the present disclosure. The device comprises:
an obtaining unit 402, configured to determine a target application scenario of a model to be trained, and obtain a data set to be screened and sample data corresponding to the target application scenario;
The computing unit 404 is configured to input the sample data and the data set to be screened into the model to be trained, and respectively calculate the influence degree of each piece of data to be screened in the data set to be screened on the target application scene of the model to be trained; the influence degree is used for representing the similarity degree of each piece of data to be screened and the sample data;
And a screening unit 406, configured to screen a target data set from the data set to be screened according to the influence degrees corresponding to the data sets to be screened.
Optionally, the computing unit 404 is specifically configured to:
respectively calculating a first gradient of the sample data in a feature encoder of the model to be trained and a second gradient of each piece of data to be screened in the data set to be screened in the feature encoder;
and calculating cosine similarity of the first gradient and the second gradient, wherein the cosine similarity is used for representing the influence degree of corresponding data to be screened on the target application scene of the model to be trained.
Optionally, the influence degree is the influence degree of the corresponding data to be screened on the loss of the sample data in the iterative process; the computing unit 404 is specifically configured to:
using a first-order Taylor formula to develop a loss value of the sample data in an iterative process so as to obtain a loss formula of the sample data in the iterative process;
And optimizing the loss formula according to random gradient descent, and determining that the loss value approximates to the cosine similarity under the condition that the learning rate in the iterative process is sufficiently small according to the optimization result.
Optionally, the target application scenario is a financial scenario, and the sample data is financial task data.
Optionally, the method further comprises:
and the training unit 408 is configured to take the target data set as a training data set, and train the model to be trained according to the training data set.
Optionally, the method further comprises:
an adding unit 410 is configured to add the sample data to the training data set.
Optionally, the method further comprises:
And the input unit 412 is configured to input the data to be processed into the large language model obtained by training in response to the obtained data to be processed in the target application scenario, so that the large language model processes the data to be processed.
According to the training data set screening device of the large language model in the embodiments of the specification, it is known that the training data set is screened for the to-be-trained model applied to the target application scene, the to-be-screened data set and the sample data in the target application scene are input into the to-be-trained model, and the influence degree of each piece of to-be-screened data in the to-be-screened data to the to-be-trained model on the target application scene is calculated respectively, and because the influence degree can be used for representing the similarity degree of each piece of to-be-screened data and the sample data, the target data set screened from the to-be-screened data set based on the influence degree is more similar to the sample data. Compared with a general data set, the training method has the advantages that the model to be trained is trained according to the target data set, the training efficiency is higher, and the processing result of the trained model on the data of the target application scene is more accurate.
In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are mutually referred to, and each embodiment mainly describes differences from other embodiments. In particular, for the training data set screening apparatus embodiment of the large language model, since it is substantially similar to the training data set screening method embodiment of the large language model, the description is relatively simple, and the relevant points are referred to in the description of the method embodiment.
Please refer to fig. 5, which illustrates a schematic structural diagram of an electronic device according to an embodiment of the present disclosure.
As shown in fig. 5, the electronic device 500 may include: at least one processor 501, at least one network interface 504, a user interface 503, a memory 505, and at least one communication bus 502.
Wherein the communication bus 502 may be used to enable connectivity communication of the various components described above.
The user interface 503 may include keys, and the optional user interface may also include a standard wired interface, a wireless interface, among others.
The network interface 504 may include, but is not limited to, a bluetooth module, an NFC module, a Wi-Fi module, and the like.
Wherein the processor 501 may include one or more processing cores. The processor 501 utilizes various interfaces and lines to connect various portions of the overall electronic device 500, perform various functions of the electronic device 500, and process data by executing or executing instructions, programs, code sets, or instruction sets stored in the memory 505, and invoking data stored in the memory 505. Alternatively, the processor 501 may be implemented in at least one hardware form of DSP, FPGA, PLA. The processor 501 may integrate one or a combination of several of a CPU, GPU, modem, etc. The CPU mainly processes an operating system, a user interface, an application program and the like; the GPU is used for rendering and drawing the content required to be displayed by the display screen; the modem is used to handle wireless communications. It will be appreciated that the modem may not be integrated into the processor 501 and may be implemented by a single chip.
The memory 505 may include RAM or ROM. Optionally, the memory 505 comprises a non-transitory computer readable medium. Memory 505 may be used to store instructions, programs, code sets, or instruction sets. The memory 505 may include a stored program area and a stored data area, wherein the stored program area may store instructions for implementing an operating system, instructions for at least one function (such as a touch function, a sound playing function, an image playing function, etc.), instructions for implementing the above-described various method embodiments, etc.; the storage data area may store data or the like referred to in the above respective method embodiments. The memory 505 may also optionally be at least one storage device located remotely from the processor 501. The memory 505, which is a computer storage medium, may include an operating system, a network communication module, a user interface module, and a training data set screening application for large language models. The processor 501 may be configured to invoke the training data set screening application of the large language model stored in the memory 505 and perform the steps of training data set screening and formulation of the large language model as mentioned in the previous embodiments.
Embodiments of the present disclosure also provide a computer-readable storage medium having instructions stored therein, which when executed on a computer or processor, cause the computer or processor to perform the steps of one or more of the embodiments shown in fig. 2-4 described above. The above-described constituent modules of the electronic apparatus may be stored in the computer-readable storage medium if implemented in the form of software functional units and sold or used as independent products.
In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, produces a flow or function in accordance with embodiments of the present description, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in or transmitted across a computer-readable storage medium. The computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by a wired (e.g., coaxial cable, fiber optic, digital subscriber line (Digital Subscriber Line, DSL)), or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., a floppy disk, a hard disk, a magnetic tape), an optical medium (e.g., a digital versatile disk (DIGITAL VERSATILE DISC, DVD)), or a semiconductor medium (e.g., a Solid state disk (Solid STATE DISK, SSD)), or the like.
Those skilled in the art will appreciate that implementing all or part of the above-described embodiment methods may be accomplished by way of a computer program, which may be stored in a computer-readable storage medium, instructing relevant hardware, and which, when executed, may comprise the embodiment methods as described above. And the aforementioned storage medium includes: various media capable of storing program code, such as ROM, RAM, magnetic or optical disks. The technical features in the present examples and embodiments may be arbitrarily combined without conflict.
The above-described embodiments are merely preferred embodiments of the present disclosure, and do not limit the scope of the disclosure, and various modifications and improvements made by those skilled in the art to the technical solutions of the disclosure should fall within the protection scope defined by the claims of the disclosure without departing from the design spirit of the disclosure.

Claims (16)

1. A training dataset screening method for a large language model, comprising:
determining a target application scene of a model to be trained, and acquiring a data set to be screened and sample data corresponding to the target application scene;
Inputting the sample data and the data set to be screened into the model to be trained, and respectively calculating the influence degree of each piece of data to be screened in the data set to be screened on the target application scene of the model to be trained; the influence degree is used for representing the similarity degree of each piece of data to be screened and the sample data;
and screening a target data set from the data set to be screened according to the influence degree corresponding to each piece of data to be screened.
2. The method for screening training data sets of a large language model according to claim 1, wherein the calculating the influence degree of each piece of to-be-screened data in the to-be-screened data sets on the to-be-trained model on the target application scene includes:
respectively calculating a first gradient of the sample data in a feature encoder of the model to be trained and a second gradient of each piece of data to be screened in the data set to be screened in the feature encoder;
and calculating cosine similarity of the first gradient and the second gradient, wherein the cosine similarity is used for representing the influence degree of corresponding data to be screened on the target application scene of the model to be trained.
3. The training data set screening method of a large language model according to claim 2, wherein the influence degree is the influence degree of corresponding data to be screened on the loss of the sample data in the iterative process; the calculating the influence degree of each part of to-be-screened data in the to-be-screened data set on the to-be-trained model on the target application scene includes:
using a first-order Taylor formula to develop a loss value of the sample data in an iterative process so as to obtain a loss formula of the sample data in the iterative process;
And optimizing the loss formula according to random gradient descent, and determining the cosine similarity according to an optimization result to be used for representing the loss value of the sample data under the condition that the learning rate in the iterative process is small enough.
4. The training data set screening method of a large language model according to claim 1, wherein the target application scene is a financial scene, and the sample data is financial data.
5. The training data set screening method of a large language model according to claim 1, further comprising the steps of:
And taking the target data set as a training data set, and training the model to be trained according to the training data set.
6. The training data set screening method of a large language model of claim 5, further comprising the steps of:
The sample data is added to the training dataset.
7. The training data set screening method of a large language model of claim 5, further comprising the steps of:
And responding to the acquired data to be processed in the target application scene, and inputting the data to be processed into a large-scale language model obtained through training so that the large-scale language model processes the data to be processed.
8. A training dataset screening apparatus for a large language model, comprising:
the acquisition unit is used for determining a target application scene of the model to be trained and acquiring a data set to be screened and sample data corresponding to the target application scene;
The computing unit is used for inputting the sample data and the data set to be screened into the model to be trained and respectively computing the influence degree of each piece of data to be screened in the data set to be screened on the target application scene of the model to be trained; the influence degree is used for representing the similarity degree of each piece of data to be screened and the sample data;
and the screening unit is used for screening out a target data set from the data set to be screened according to the influence degree corresponding to each piece of data to be screened.
9. The training data set screening apparatus of a large language model according to claim 8, said calculation unit comprising:
respectively calculating a first gradient of the sample data in a feature encoder of the model to be trained and a second gradient of each piece of data to be screened in the data set to be screened in the feature encoder;
and calculating cosine similarity of the first gradient and the second gradient, wherein the cosine similarity is used for representing the influence degree of corresponding data to be screened on the target application scene of the model to be trained.
10. The training data set screening device of a large language model according to claim 9, wherein the influence degree is the influence degree of corresponding data to be screened on the loss of the sample data in the iterative process; the calculation unit includes:
using a first-order Taylor formula to develop a loss value of the sample data in an iterative process so as to obtain a loss formula of the sample data in the iterative process;
And optimizing the loss formula according to random gradient descent, and determining that the loss value approximates to the cosine similarity under the condition that the learning rate in the iterative process is sufficiently small according to the optimization result.
11. The training data set screening apparatus of a large language model according to claim 8, wherein the target application scenario is a financial scenario, and the sample data is financial task data.
12. The training data set screening apparatus of a large language model of claim 8, further comprising:
And the training unit is used for taking the target data set as a training data set and training the model to be trained according to the training data set.
13. The training data set screening apparatus of a large language model of claim 12, further comprising:
An adding unit for adding the sample data to the training data set.
14. The training data set screening apparatus of a large language model of claim 12, further comprising:
the input unit is used for responding to the acquired data to be processed in the target application scene, inputting the data to be processed into a large-scale language model obtained through training, and enabling the large-scale language model to process the data to be processed.
15. An electronic device includes a processor and a memory;
The processor is connected with the memory;
The memory is used for storing executable program codes;
The processor runs a program corresponding to the executable program code by reading the executable program code stored in the memory for performing the method according to any one of claims 1 to 7.
16. A computer readable storage medium, on which a computer program is stored, which computer program, when being executed by a processor, implements the method according to any one of claims 1-7.
CN202410232115.7A 2024-02-29 2024-02-29 Training data set screening method and device for large language model, electronic equipment and storage medium Pending CN118114049A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410232115.7A CN118114049A (en) 2024-02-29 2024-02-29 Training data set screening method and device for large language model, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410232115.7A CN118114049A (en) 2024-02-29 2024-02-29 Training data set screening method and device for large language model, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN118114049A true CN118114049A (en) 2024-05-31

Family

ID=91214952

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410232115.7A Pending CN118114049A (en) 2024-02-29 2024-02-29 Training data set screening method and device for large language model, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN118114049A (en)

Similar Documents

Publication Publication Date Title
CN111177569B (en) Recommendation processing method, device and equipment based on artificial intelligence
CN106250464B (en) Training method and device of ranking model
US11238050B2 (en) Method and apparatus for determining response for user input data, and medium
CN114861889B (en) Deep learning model training method, target object detection method and device
CN105550206B (en) The edition control method and device of structured query sentence
CN112749300B (en) Method, apparatus, device, storage medium and program product for video classification
EP4006909B1 (en) Method, apparatus and device for quality control and storage medium
CN112579909A (en) Object recommendation method and device, computer equipment and medium
CN116127020A (en) Method for training generated large language model and searching method based on model
US20210209482A1 (en) Method and apparatus for verifying accuracy of judgment result, electronic device and medium
CN113407850B (en) Method and device for determining and acquiring virtual image and electronic equipment
CN114840734B (en) Training method of multi-modal representation model, cross-modal retrieval method and device
CN111582477A (en) Training method and device of neural network model
CN111767394A (en) Abstract extraction method and device based on artificial intelligence expert system
CN109034199B (en) Data processing method and device, storage medium and electronic equipment
JP7291181B2 (en) Industry text increment method, related apparatus, and computer program product
CN112307738A (en) Method and device for processing text
CN112329429A (en) Text similarity learning method, device, equipment and storage medium
CN111026849A (en) Data processing method and device
CN113360672B (en) Method, apparatus, device, medium and product for generating knowledge graph
CN116383340A (en) Information searching method, device, electronic equipment and storage medium
CN118114049A (en) Training data set screening method and device for large language model, electronic equipment and storage medium
JP2017538226A (en) Scalable web data extraction
CN114048315A (en) Method and device for determining document tag, electronic equipment and storage medium
CN109857838B (en) Method and apparatus for generating information

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination