CN116955590A

CN116955590A - Training data screening method, model training method and text generation method

Info

Publication number: CN116955590A
Application number: CN202311213090.8A
Authority: CN
Inventors: 龚昊然; 肖雪松; 陈昶宇; 韩威俊; 严帅
Original assignee: Chengdu Minto Technology Co ltd
Current assignee: Chengdu Minto Technology Co ltd
Priority date: 2023-09-20
Filing date: 2023-09-20
Publication date: 2023-10-27
Anticipated expiration: 2043-09-20
Also published as: CN116955590B

Abstract

The application provides a training data screening method, a model training method and a text generation method, and relates to the technical field of computers. The training data screening method comprises the following steps: obtaining a prompt sentence based on a user portrait of a target user and a preset prompt sentence template; inputting the prompt sentences into a large language model to be trained to obtain comparison texts corresponding to the prompt sentences; and screening at least one target to-be-trained data from the pre-acquired to-be-trained data set based on the comparison text to obtain a training data set, wherein the training data set comprises all target to-be-trained data. Since the prompt sentence is obtained based on the user portrait, the comparison text output by the large language model based on the prompt sentence accords with the characteristics of the user portrait. Therefore, the target data to be trained screened later also accords with the portrait characteristics of the user. Therefore, each piece of target data to be trained in the obtained training data set accords with the image characteristics of the user so as to prevent damage to data distribution.

Description

Training data screening method, model training method and text generation method

Technical Field

The application relates to the technical field of computers, in particular to a training data screening method, a model training method and a text generation method.

Background

Artificial intelligence generation models based on a general large language model have made great progress in the field of text writing. However, because of the universality of the universal large language model, the artificial intelligence generating model based on the universal large language model is difficult to generate different articles according to the characteristics of users in a specific scene.

At present, a large amount of data is usually used for training a general large language model, however, training the general large language model by using a large amount of data may destroy the distribution of the data, so that the text generated by the trained large language model does not conform to the characteristics of a user.

Disclosure of Invention

The application provides a training data screening method, a model training method and a text generation method, which are used for solving the problems that the distribution of data is possibly destroyed when a large amount of data is used for training a general large language model in the prior art, so that the text generated by the trained large language model does not accord with the characteristics of a user.

In a first aspect, the present application provides a training data screening method, including: obtaining a prompt sentence based on a user portrait of a target user and a preset prompt sentence template; inputting the prompt sentences into a large language model to be trained, and obtaining comparison texts corresponding to the prompt sentences; and screening at least one target to-be-trained data from a pre-acquired to-be-trained data set based on the comparison text to obtain a training data set, wherein the training data set comprises all the target to-be-trained data.

In the embodiment of the application, the prompt statement is obtained based on the user portrait, so that the comparison text output by the large language model based on the prompt statement accords with the characteristics of the user portrait. So that at least one target to-be-trained data which is then screened from the to-be-trained data set based on the control text also meets the user portrayal characteristic. Therefore, each piece of target data to be trained in the obtained training data set accords with the user portrait characteristic, and the data distribution is not damaged.

Based on the technical solution provided in the first aspect, in some possible implementation manners, screening at least one target to-be-trained data from a pre-acquired to-be-trained data set based on the control text includes: calculating the similarity between the control text and each piece of data to be trained in the data set to be trained; and obtaining the target data to be trained based on the data to be trained meeting the preset similarity condition.

In the embodiment of the application, the comparison text accords with the characteristics of the user portrait, so that the data to be trained with higher similarity with the comparison text accords with the characteristics of the user portrait. Therefore, by comparing the similarity between the text and each data to be trained in the data set to be trained, the determined characteristic coincidence degree of the target data to be trained and the user image is higher, and the probability of damage to data distribution is reduced.

Based on the technical solution provided in the first aspect, in some possible implementation manners, the obtaining the target data to be trained based on the data to be trained that satisfies a preset similarity condition includes: each piece of data to be trained meeting the preset similarity condition is numbered respectively, wherein each number uniquely corresponds to one piece of data to be trained; carrying out collaborative filtering calculation on the data to be trained meeting the preset similarity condition, the corresponding number and the user portrait by using a preset collaborative filtering algorithm to obtain an output result; wherein the output result includes at least one number; and determining the data to be trained corresponding to each number included in the output result as the target data to be trained.

In the embodiment of the application, the data to be trained meeting the preset similarity condition is further screened through the collaborative filtering algorithm, so that the finally obtained target data to be trained can better accord with the characteristics of the user portrait.

Based on the technical solution provided in the first aspect, in some possible implementation manners, before calculating the similarity between the control text and each data to be trained in the data set to be trained, the method further includes: coding each piece of data to be trained in the data set to be trained based on a general large language model to obtain a training vector corresponding to each piece of data to be trained; correspondingly, calculating the similarity between the control text and each data to be trained in the data set to be trained comprises the following steps: coding the comparison text based on a general large language model to obtain a comparison vector; and calculating the similarity between the comparison vector and each training vector to obtain the similarity between the comparison text and each piece of data to be trained in the data set to be trained.

In the embodiment of the application, each data to be trained in the data set to be trained is encoded based on the universal large language model, so that the obtained training vector can accurately reflect the characteristics of the data to be trained. And further improving the accuracy of the similarity obtained by subsequent calculation.

Based on the technical solution provided in the first aspect, in some possible implementation manners, obtaining the alert sentence based on the pre-acquired user portrait of the target user and the preset alert sentence template includes: and filling the user characteristic data in the user portrait into the corresponding position in the prompt statement template to obtain the prompt statement.

In the embodiment of the application, the user characteristic data is filled into the prompt statement template, so that the obtained prompt statement can fully embody the characteristics of the user image, thereby ensuring that the comparison text output by the large language model can accord with the characteristics of the user image.

Based on the technical solution provided in the first aspect, in some possible implementation manners, the user characteristic data includes at least one data of a user industry, a job position, a responsibility, and a work record.

In the embodiment of the application, the characteristics of the user portrait can be fed back by the user industry, the position, the responsibility and the work record, so that the characteristics of the user portrait can be reflected by the subsequently obtained prompt sentences by selecting at least one data, and the finally obtained target to-be-trained data can be more in accordance with the characteristics of the user portrait.

Based on the technical solution provided in the first aspect, in some possible implementation manners, after screening at least one target data to be trained from a pre-acquired data set to be trained based on the control text, to obtain a training data set, the method further includes: and training the large language model to be trained based on the training data set to obtain a trained large language model.

In the embodiment of the application, because each piece of target data to be trained in the obtained training data set accords with the portrait characteristic of the user, the large language model is trained based on the training data set, and the distribution of the data can not be destroyed. In addition, the prior art trains the large language model by utilizing all the acquired data to be trained, and the acquired data to be trained cannot be screened, so that a part of data which does not accord with the portrait characteristics of the user is included. The obtained data to be trained is screened, and data which does not accord with the portrait characteristics of the user are screened, so that compared with the prior art, the method and the device are finally used for training the large language model, the data size is relatively small, and therefore the data size required for training the large language model can be reduced, and the training cost is reduced.

In a second aspect, the present application provides a model training method, including: obtaining a training dataset, wherein the training dataset is obtained based on the method as described in the first aspect and/or in combination with any possible implementation of the first aspect; and training the large language model to be trained based on the training data set to obtain a trained large language model.

In a third aspect, the present application provides a text generation method, including: acquiring a prompt statement; inputting the prompt sentence into a pre-trained large language model to obtain text data output by the large language model, wherein the large language model is trained based on the method in the second aspect.

In a fourth aspect, the present application provides a training data screening apparatus, including: the first processing module is used for obtaining a prompt sentence based on a user portrait of a target user and a preset prompt sentence template; the second processing module is used for inputting the prompt sentences into a large language model to be trained to obtain comparison texts corresponding to the prompt sentences; the third processing module is used for screening at least one target to-be-trained data from a pre-acquired to-be-trained data set based on the comparison text to obtain a training data set, wherein the training data packet comprises all screened target to-be-trained data.

In a fifth aspect, the present application provides a model training apparatus comprising: the first acquisition module is used for acquiring a training data set, wherein the training data set is obtained based on the method according to the first aspect and/or any possible implementation manner of the first aspect; the fourth processing module is used for training the large language model to be trained based on the training data set to obtain a trained large language model.

In a sixth aspect, the present application provides a text generating apparatus, comprising: the second acquisition module is used for acquiring prompt sentences; and the fifth processing module is used for inputting the prompt sentence into a pre-trained large language model to obtain text data output by the large language model, wherein the large language model is obtained by training based on the method in the second aspect.

In a seventh aspect, the present application provides an electronic device, comprising: the device comprises a memory and a processor, wherein the memory is connected with the processor; the memory is used for storing programs; the processor is configured to invoke a program stored in the memory, to perform the method of the first aspect and/or any possible implementation manner of the first aspect, and/or to perform the method of the second aspect, and/or to perform the method of the third aspect.

In an eighth aspect, the present application provides a computer readable storage medium having stored thereon a computer program which, when executed by a computer, performs the method of the first aspect and/or any possible implementation manner of the first aspect, and/or performs the method of the second aspect, and/or performs the method of the third aspect.

The beneficial effects of the application are as follows: a prompt sentence is obtained through the user portrait, so that the comparison text output by the large language model based on the prompt sentence accords with the characteristics of the user portrait. And at least one target to-be-trained data screened based on the comparison text further accords with the portrait characteristic of the user. Therefore, each piece of target data to be trained in the obtained training data set accords with the portrait characteristic of the user, and the data distribution of a large language model trained by the training data set obtained by the method is not destroyed.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the embodiments will be briefly described below, it being understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and other related drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a flow chart of a training data screening method according to an embodiment of the present application;

FIG. 2 is a block diagram of a training data screening apparatus according to an embodiment of the present application;

FIG. 3 is a block diagram of a model training apparatus according to an embodiment of the present application;

fig. 4 is a block diagram showing a structure of a text generating apparatus according to an embodiment of the present application;

fig. 5 is a block diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be described below with reference to the accompanying drawings in the embodiments of the present application.

It should be noted that: like reference numerals and letters denote like items in the following figures, and thus once an item is defined in one figure, no further definition or explanation thereof is necessary in the following figures. Meanwhile, relational terms such as "first," "second," and the like may be used solely to distinguish one entity or action from another entity or action in the description of the application without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

The technical scheme of the present application will be described in detail with reference to the accompanying drawings.

Because the prior art trains a general large language model by using a large amount of data, the distribution of the data may be destroyed. For example, the probability of "basketball" appearing in sports news will certainly be higher than the probability of "basketball" appearing in financial news. Thus, if text is to be output for the sports aspect, financial class data should not appear, otherwise the word "basketball" distribution in the training set is destroyed.

The application provides a training data screening method, which aims to solve the problem that the distribution of data can be damaged when a large amount of data is used for training a general large language model.

Referring to fig. 1, fig. 1 is a flowchart of a training data screening method according to an embodiment of the present application, and the steps included in the training data screening method will be described with reference to fig. 1.

S100: and obtaining the prompt sentence based on the pre-obtained user portrait of the target user and a preset prompt sentence template.

The user portrait can be obtained in advance and stored in a storage medium, and can be directly called when the user portrait needs to be used. Alternatively, the user representation may be created in real time when needed.

The specific manner and principles of creating the user's portrait are well known to those skilled in the art, and are not described herein for brevity.

In one embodiment, based on the pre-acquired user portrait of the target user and the preset prompt sentence template, the specific process of obtaining the prompt sentence may be: and filling the user characteristic data in the user portrait into the corresponding position in the prompt statement template to obtain the prompt statement.

The user characteristic data is data characterizing the user image, and optionally, the user characteristic data may include at least one of user industry, job position, responsibility, and job record.

The prompt statement template can be obtained in advance and stored in a storage medium, and can be directly called when the prompt statement template is needed to be used. Alternatively, the alert sentence template may be obtained from other devices in real time when needed.

To facilitate understanding the alert sentence templates described above, taking the alert sentence templates as examples where the user industry, job, responsibilities, job records, and other portrait features need to be filled, one implementation of the alert sentence templates may be "write as { job } in { user industry }, requiring performance of { responsibilities }, while { other portrait features }, possibly outputting as { job records }. The "{ }" is user characteristic data to be filled in, for example "{ user industry }" is data representing user industry to be filled in the user characteristic data. And (5) respectively filling the user characteristic data into the corresponding "{ }" to obtain the prompt statement.

For ease of understanding only, the alert sentence templates may be set according to actual needs, and specific implementations of alert sentence templates include, but are not limited to, those illustrated herein.

S200: and inputting the prompt sentences into the large language model to be trained, and obtaining the comparison text corresponding to the prompt sentences.

The specific structure and operation of a large language model are well known to those skilled in the art, and for brevity, the specific structure and operation of a large language model will not be described in detail herein.

S300: and screening at least one target to-be-trained data from the pre-acquired to-be-trained data set based on the comparison text to obtain a training data set.

Wherein the training data set comprises all target data to be trained.

The specific manner of screening at least one target to-be-trained data from the pre-acquired to-be-trained data set based on the control text after the control text is obtained may be the following embodiments.

In the first embodiment, a control keyword which can embody the characteristics of the control text in the control text can be extracted. And screening the data to be trained containing the comparison keywords with the number larger than a preset number threshold value from the data set to be trained as template data to be trained.

The preset number threshold can be set according to actual requirements, and the preset number threshold is not limited here.

In a second embodiment, a specific implementation manner of screening at least one target to-be-trained data from a pre-acquired to-be-trained data set based on a control text may be: and calculating the similarity between the control text and each data to be trained in the data set to be trained. And then obtaining target data to be trained based on the data to be trained meeting the preset similarity condition.

The method for calculating the similarity between the comparison text and each data to be trained in the data set to be trained can be any existing method for calculating the similarity between two texts, for example, the cosine similarity and euclidean distance of the two texts can be calculated.

In order to facilitate calculation of the similarity between the comparison text and each piece of data to be trained in the data set to be trained, optionally, before calculation of the similarity between the comparison text and each piece of data to be trained in the data set to be trained, each piece of data to be trained in the data set to be trained may be encoded based on the general large language model to obtain a training vector corresponding to each piece of data to be trained.

Accordingly, the manner of calculating the similarity between the control text and each data to be trained in the data set to be trained may be: and encoding the comparison text based on the universal large language model to obtain a comparison vector. And then calculating the similarity between the comparison vector and each training vector to obtain the similarity between the comparison text and each piece of data to be trained in the data set to be trained.

The similarity can be calculated conveniently by converting the comparison text and the data to be trained into the form of vectors. And each data to be trained in the data set to be trained is encoded based on the universal large language model, so that the obtained training vector can accurately reflect the characteristics of the data to be trained. And further improving the accuracy of the similarity obtained by subsequent calculation.

The generic large language model may be any existing large language model, for example, a large language model such as ilama, and the example is only for easy understanding, and should not be taken as a limitation of the present application.

In a third embodiment, the specific manner of screening the at least one target to-be-trained data from the pre-acquired to-be-trained data set based on the control text may be: firstly, each data to be trained in a data set to be trained is numbered respectively, wherein each number uniquely corresponds to one data to be trained. Then, carrying out collaborative filtering calculation on each piece of data to be trained in the data set to be trained, the corresponding number and the user portrait by using a preset collaborative filtering algorithm to obtain an output result; wherein the output result includes at least one number. And finally, determining the data to be trained corresponding to each number included in the output result as target data to be trained.

The specific implementation and principles of collaborative filtering algorithms are well known to those skilled in the art, and are not described herein for brevity.

For easy understanding, the data set to be trained includes three data to be trained 1, 2 and 3. Firstly, the data 1 to be trained is numbered 001, the data 2 to be trained is numbered 002, and the data 3 to be trained is numbered 003. And then taking the numbered data to be trained 1, the numbered data to be trained 2, the numbered data to be trained 3 and the user portrait as input data, and processing the input data by utilizing a collaborative filtering algorithm to obtain an output result. Wherein, the output result may include at least one number of 001, 002, 003. If the output result includes 001, determining the data 1 to be trained corresponding to 001 as target data to be trained. The examples herein are for ease of understanding only and should not be construed as limiting the application.

The above three embodiments of screening at least one target to-be-trained data from the pre-acquired to-be-trained data set based on the control text may be implemented separately, or two or three of them may be combined to jointly screen the target to-be-trained data.

Alternatively, the data to be trained containing the number of the control keywords greater than the preset number threshold may be first selected from the data set to be trained as the first intermediate data. And then the similarity between each first intermediate data and the comparison text can be calculated, and then the target data to be trained is obtained based on the first intermediate data meeting the preset similarity condition.

Alternatively, the data to be trained containing the number of the control keywords greater than the preset number threshold may be first selected from the data set to be trained as the first intermediate data. Each first intermediate data is then separately numbered. And then, carrying out collaborative filtering calculation on each first intermediate data and the corresponding number and user portrait thereof by using a preset collaborative filtering algorithm to obtain an output result. And finally, determining the first intermediate data corresponding to each number included in the output result as target data to be trained.

Alternatively, the similarity of the control text to each of the data to be trained in the set of data to be trained may be calculated first. And then numbering each piece of data to be trained meeting the preset similarity condition. And then, carrying out collaborative filtering calculation on the data to be trained, the corresponding numbers and the user portraits which meet the preset similarity condition by using a preset collaborative filtering algorithm to obtain an output result. And finally, determining the data to be trained corresponding to each number included in the output result as target data to be trained.

Alternatively, the data to be trained containing the number of the control keywords greater than the preset number threshold may be first selected from the data set to be trained as the first intermediate data. The similarity of each first intermediate data and the control text may then be calculated. And numbering each first intermediate data meeting the preset similarity condition. And then, carrying out collaborative filtering calculation on the first intermediate data meeting the preset similarity condition, the corresponding number and the user portrait by using a preset collaborative filtering algorithm to obtain an output result. And finally, determining the first intermediate data corresponding to each number included in the output result as target data to be trained.

In one embodiment, after at least one target to-be-trained data is screened from a pre-acquired to-be-trained data set based on a comparison text to obtain a training data set, a large language model to be trained can be trained based on the training data set to obtain a trained large language model.

In practical application, the training data set may be obtained from a third party, and at this time, the specific step of the model training method may be to obtain the training data set first, and then train the large language model to be trained based on the training data set, so as to obtain a trained large language model. The obtained training data set is obtained based on the training data screening method.

Because each piece of target data to be trained in the obtained training data set accords with the portrait characteristic of the user, the large language model is trained based on the training data set, and the distribution of the data can not be destroyed. In addition, the training data set is screened, so that compared with the prior art that the large language model is trained by utilizing all acquired data to be trained, the method and the device can reduce the data quantity required by training the large language model, thereby reducing the training cost.

The specific manner and principles of training a large language model are well known to those skilled in the art, and are not described herein for brevity.

After the trained large language model is obtained, the trained large language model can be applied to generating text data. Specifically, the text generation method may include the following steps: firstly, acquiring a prompt statement; and then inputting the prompt sentence into the pre-trained large language model to obtain text data output by the large language model.

The specific expression form of the prompt sentence generated based on the prompt sentence template can be the same or different. The alert sentence may be directly input by the user, or may be generated based on the manner of generating the alert sentence described above.

Based on the same technical conception, the application also provides a training data screening device, as shown in fig. 2. The training data screening apparatus 100 includes a first processing module 110, a second processing module 120, and a third processing module 130.

The first processing module 110 is configured to obtain a prompt sentence based on a pre-acquired user portrait of the target user and a preset prompt sentence template.

And the second processing module 120 is configured to input the prompt sentence into a large language model to be trained, and obtain a comparison text corresponding to the prompt sentence.

And the third processing module 130 is configured to screen at least one target to-be-trained data from the pre-acquired to-be-trained data set based on the control text, so as to obtain a training data set, where the training data packet includes all the screened target to-be-trained data.

The third processing module 130 is specifically configured to calculate a similarity between the control text and each data to be trained in the data set to be trained; and obtaining the target data to be trained based on the data to be trained meeting the preset similarity condition.

The third processing module 130 is specifically configured to number each data to be trained that meets a preset similarity condition, where each number uniquely corresponds to one data to be trained; carrying out collaborative filtering calculation on the data to be trained meeting the preset similarity condition, the corresponding number and the user portrait by using a preset collaborative filtering algorithm to obtain an output result; wherein the output result includes at least one number; and determining the data to be trained corresponding to each number included in the output result as the target data to be trained.

The third processing module 130 is further configured to encode each data to be trained in the data set to be trained based on a general large language model, so as to obtain a training vector corresponding to each data to be trained; coding the comparison text based on a general large language model to obtain a comparison vector; and calculating the similarity between the comparison vector and each training vector to obtain the similarity between the comparison text and each piece of data to be trained in the data set to be trained.

The first processing module 110 is specifically configured to fill the user characteristic data in the user portrait into a corresponding position in the alert sentence template, so as to obtain the alert sentence.

In one embodiment, the user characteristic data includes at least one of user industry, job position, responsibility, and job record.

The training data screening device 100 further includes a training module, and the training module is configured to, after screening at least one target to-be-trained data from a pre-acquired to-be-trained data set based on the control text, obtain a training data set, train the to-be-trained large language model based on the training data set, and obtain a trained large language model.

The training data screening apparatus 100 according to the embodiment of the present application has the same implementation principle and technical effects as those of the foregoing training data screening method embodiment, and for brevity, reference may be made to corresponding contents in the foregoing training data screening method embodiment where the apparatus embodiment portion is not mentioned.

Based on the same technical concept, the present application further provides a model training apparatus, as shown in fig. 3, where the model training apparatus 200 includes a first obtaining module 210 and a fourth processing module 220.

The first obtaining module 210 is configured to obtain a training data set, where the training data set is obtained based on the training data screening method described above.

And a fourth processing module 220, configured to train the large language model to be trained based on the training data set, so as to obtain a trained large language model.

The model training device 200 according to the embodiment of the present application has the same implementation principle and the same technical effects as those of the foregoing model training method embodiment, and for brevity, reference may be made to corresponding contents in the foregoing model training method embodiment where the device embodiment is not mentioned.

Based on the same technical concept, the present application also provides a text generating apparatus, and as shown in fig. 4, the text generating apparatus 300 includes a second obtaining module 310 and a fifth processing module 320.

A second obtaining module 310, configured to obtain the hint statement.

And a fifth processing module 320, configured to input the prompt sentence into a pre-trained large language model, and obtain text data output by the large language model, where the large language model is obtained by training based on the foregoing model training method.

The text generating device 300 according to the embodiment of the present application has the same implementation principle and technical effects as those of the foregoing text generating method embodiment, and for brevity, reference may be made to corresponding contents in the foregoing text generating method embodiment for a part of the description of the device embodiment that is not mentioned.

Please refer to fig. 5, which illustrates an electronic device 400 according to an embodiment of the present application. The electronic device 400 includes: processor 410, memory 420.

The memory 420 and the processor 410 are electrically connected directly or indirectly to each other to realize data transmission or interaction. For example, the components may be electrically connected to each other via one or more communication buses or signal lines. The memory 420 is used for storing computer programs, such as the software functional modules shown in fig. 2, 3, and 4, that is, the training data screening apparatus 100, the model training apparatus 200, and the text generating apparatus 300. Wherein each device comprises at least one software functional module that may be stored in the memory 420 in the form of software or firmware (firmware) or cured in an Operating System (OS) of the electronic device 400.

The processor 410 is configured to execute executable modules stored in the memory 420, such as software functional modules or computer programs included in the training data screening apparatus 100. At this time, the processor 410 is configured to obtain a prompt sentence based on a user portrait of the target user and a preset prompt sentence template; inputting the prompt sentences into a large language model to be trained, and obtaining comparison texts corresponding to the prompt sentences; and screening at least one target to-be-trained data from a pre-acquired to-be-trained data set based on the comparison text to obtain a training data set, wherein the training data set comprises all the target to-be-trained data.

The processor 410 is configured to execute executable modules stored in the memory 420, such as software functional modules or computer programs included in the model training apparatus 200. At this time, the processor 410 is configured to obtain a training data set, where the training data set is obtained based on the foregoing training data screening method; and training the large language model to be trained based on the training data set to obtain a trained large language model.

The processor 410 is configured to execute executable modules stored in the memory 420, such as software functional modules or computer programs included in the text generating device 300. At this time, the processor 410 is configured to obtain a hint statement; inputting the prompt sentence into a pre-trained large language model to obtain text data output by the large language model, wherein the large language model is obtained by training based on the model training method.

The Memory 420 may be, but is not limited to, a random access Memory (RandomAccessMemory, RAM), a Read Only Memory (ROM), a programmable Read Only Memory (Programmable Read-Only Memory, PROM), an erasable Read Only Memory (Erasable Programmable Read-Only Memory, EPROM), an electrically erasable Read Only Memory (Electric Erasable Programmable Read-Only Memory, EEPROM), etc.

The processor 410 may be an integrated circuit chip having signal processing capabilities. The processor may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU), a network processor (Network Processor, NP), etc.; but also digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), field programmable gate arrays (Field Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components. The disclosed methods, steps, and logic blocks in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor 410 may be any conventional processor or the like.

The electronic device 400 includes, but is not limited to, a personal computer, a server, and the like.

The embodiment of the present application further provides a computer readable storage medium (hereinafter referred to as a storage medium) storing a computer program, where when the computer program is executed by a computer such as the electronic device 400, at least one of the training data screening method, the model training method, and the text generating method described above is executed. The computer-readable storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The above description is only of the preferred embodiments of the present application and is not intended to limit the present application, but various modifications and variations can be made to the present application by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims

1. A training data screening method, comprising:

obtaining a prompt sentence based on a user portrait of a target user and a preset prompt sentence template;

inputting the prompt sentences into a large language model to be trained, and obtaining comparison texts corresponding to the prompt sentences;

and screening at least one target to-be-trained data from a pre-acquired to-be-trained data set based on the comparison text to obtain a training data set, wherein the training data set comprises all the target to-be-trained data.

2. The method of claim 1, wherein screening at least one target to-be-trained data from a pre-acquired to-be-trained data set based on the control text comprises:

calculating the similarity between the control text and each piece of data to be trained in the data set to be trained;

and obtaining the target data to be trained based on the data to be trained meeting the preset similarity condition.

3. The method according to claim 2, wherein obtaining the target data to be trained based on the data to be trained that satisfies a preset similarity condition comprises:

each piece of data to be trained meeting the preset similarity condition is numbered respectively, wherein each number uniquely corresponds to one piece of data to be trained;

carrying out collaborative filtering calculation on the data to be trained meeting the preset similarity condition, the corresponding number and the user portrait by using a preset collaborative filtering algorithm to obtain an output result; wherein the output result includes at least one number;

and determining the data to be trained corresponding to each number included in the output result as the target data to be trained.

4. The method of claim 2, wherein prior to calculating the similarity of the control text to each of the data to be trained in the set of data to be trained, the method further comprises:

coding each piece of data to be trained in the data set to be trained based on a general large language model to obtain a training vector corresponding to each piece of data to be trained;

correspondingly, calculating the similarity between the control text and each data to be trained in the data set to be trained comprises the following steps:

coding the comparison text based on a general large language model to obtain a comparison vector;

and calculating the similarity between the comparison vector and each training vector to obtain the similarity between the comparison text and each piece of data to be trained in the data set to be trained.

5. The method of claim 1, wherein obtaining the alert sentence based on the pre-acquired user representation of the target user and the pre-set alert sentence template comprises:

and filling the user characteristic data in the user portrait into the corresponding position in the prompt statement template to obtain the prompt statement.

6. The method of claim 5, wherein the user characteristic data comprises at least one of user industry, job title, responsibility, job record.

7. The method according to any one of claims 1-6, wherein after screening at least one target data to be trained from a pre-acquired set of data to be trained based on the control text, the method further comprises:

and training the large language model to be trained based on the training data set to obtain a trained large language model.

8. A method of model training, comprising:

obtaining a training dataset, wherein the training dataset is derived based on the method of any of claims 1-7;

9. A text generation method, comprising:

acquiring a prompt statement;

inputting the prompt sentence into a pre-trained large language model to obtain text data output by the large language model, wherein the large language model is trained based on the method as claimed in claim 8.

10. A training data screening apparatus comprising:

the first processing module is used for obtaining a prompt sentence based on a user portrait of a target user and a preset prompt sentence template;

the second processing module is used for inputting the prompt sentences into a large language model to be trained to obtain comparison texts corresponding to the prompt sentences;

and the third processing module is used for screening at least one target to-be-trained data from the pre-acquired to-be-trained data set based on the comparison text to obtain a training data set, wherein the training data packet comprises all screened target to-be-trained data.

11. A model training device, comprising:

a first acquisition module for acquiring a training data set, wherein the training data set is obtained based on the method of any one of claims 1-7;

and the fourth processing module is used for training the large language model to be trained based on the training data set to obtain a trained large language model.

12. A text generating apparatus, comprising:

the second acquisition module is used for acquiring the prompt statement;

and a fifth processing module, configured to input the prompt sentence into a pre-trained large language model, to obtain text data output by the large language model, where the large language model is obtained by training based on the method according to claim 8.

13. An electronic device, comprising: the device comprises a memory and a processor, wherein the memory is connected with the processor;

the memory is used for storing programs;

the processor being adapted to invoke a program stored in the memory for performing the method according to any of claims 1-9.

14. A computer-readable storage medium, on which a computer program is stored, which, when being run by a computer, performs the method according to any one of claims 1-9.