CN116955590A - Training data screening method, model training method and text generation method - Google Patents

Training data screening method, model training method and text generation method Download PDF

Info

Publication number
CN116955590A
CN116955590A CN202311213090.8A CN202311213090A CN116955590A CN 116955590 A CN116955590 A CN 116955590A CN 202311213090 A CN202311213090 A CN 202311213090A CN 116955590 A CN116955590 A CN 116955590A
Authority
CN
China
Prior art keywords
trained
data
training
language model
data set
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202311213090.8A
Other languages
Chinese (zh)
Other versions
CN116955590B (en
Inventor
龚昊然
肖雪松
陈昶宇
韩威俊
严帅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chengdu Minto Technology Co ltd
Original Assignee
Chengdu Minto Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengdu Minto Technology Co ltd filed Critical Chengdu Minto Technology Co ltd
Priority to CN202311213090.8A priority Critical patent/CN116955590B/en
Publication of CN116955590A publication Critical patent/CN116955590A/en
Application granted granted Critical
Publication of CN116955590B publication Critical patent/CN116955590B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/338Presentation of query results

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Human Computer Interaction (AREA)
  • Machine Translation (AREA)

Abstract

The application provides a training data screening method, a model training method and a text generation method, and relates to the technical field of computers. The training data screening method comprises the following steps: obtaining a prompt sentence based on a user portrait of a target user and a preset prompt sentence template; inputting the prompt sentences into a large language model to be trained to obtain comparison texts corresponding to the prompt sentences; and screening at least one target to-be-trained data from the pre-acquired to-be-trained data set based on the comparison text to obtain a training data set, wherein the training data set comprises all target to-be-trained data. Since the prompt sentence is obtained based on the user portrait, the comparison text output by the large language model based on the prompt sentence accords with the characteristics of the user portrait. Therefore, the target data to be trained screened later also accords with the portrait characteristics of the user. Therefore, each piece of target data to be trained in the obtained training data set accords with the image characteristics of the user so as to prevent damage to data distribution.

Description

Training data screening method, model training method and text generation method
Technical Field
The application relates to the technical field of computers, in particular to a training data screening method, a model training method and a text generation method.
Background
Artificial intelligence generation models based on a general large language model have made great progress in the field of text writing. However, because of the universality of the universal large language model, the artificial intelligence generating model based on the universal large language model is difficult to generate different articles according to the characteristics of users in a specific scene.
At present, a large amount of data is usually used for training a general large language model, however, training the general large language model by using a large amount of data may destroy the distribution of the data, so that the text generated by the trained large language model does not conform to the characteristics of a user.
Disclosure of Invention
The application provides a training data screening method, a model training method and a text generation method, which are used for solving the problems that the distribution of data is possibly destroyed when a large amount of data is used for training a general large language model in the prior art, so that the text generated by the trained large language model does not accord with the characteristics of a user.
In a first aspect, the present application provides a training data screening method, including: obtaining a prompt sentence based on a user portrait of a target user and a preset prompt sentence template; inputting the prompt sentences into a large language model to be trained, and obtaining comparison texts corresponding to the prompt sentences; and screening at least one target to-be-trained data from a pre-acquired to-be-trained data set based on the comparison text to obtain a training data set, wherein the training data set comprises all the target to-be-trained data.
In the embodiment of the application, the prompt statement is obtained based on the user portrait, so that the comparison text output by the large language model based on the prompt statement accords with the characteristics of the user portrait. So that at least one target to-be-trained data which is then screened from the to-be-trained data set based on the control text also meets the user portrayal characteristic. Therefore, each piece of target data to be trained in the obtained training data set accords with the user portrait characteristic, and the data distribution is not damaged.
Based on the technical solution provided in the first aspect, in some possible implementation manners, screening at least one target to-be-trained data from a pre-acquired to-be-trained data set based on the control text includes: calculating the similarity between the control text and each piece of data to be trained in the data set to be trained; and obtaining the target data to be trained based on the data to be trained meeting the preset similarity condition.
In the embodiment of the application, the comparison text accords with the characteristics of the user portrait, so that the data to be trained with higher similarity with the comparison text accords with the characteristics of the user portrait. Therefore, by comparing the similarity between the text and each data to be trained in the data set to be trained, the determined characteristic coincidence degree of the target data to be trained and the user image is higher, and the probability of damage to data distribution is reduced.
Based on the technical solution provided in the first aspect, in some possible implementation manners, the obtaining the target data to be trained based on the data to be trained that satisfies a preset similarity condition includes: each piece of data to be trained meeting the preset similarity condition is numbered respectively, wherein each number uniquely corresponds to one piece of data to be trained; carrying out collaborative filtering calculation on the data to be trained meeting the preset similarity condition, the corresponding number and the user portrait by using a preset collaborative filtering algorithm to obtain an output result; wherein the output result includes at least one number; and determining the data to be trained corresponding to each number included in the output result as the target data to be trained.
In the embodiment of the application, the data to be trained meeting the preset similarity condition is further screened through the collaborative filtering algorithm, so that the finally obtained target data to be trained can better accord with the characteristics of the user portrait.
Based on the technical solution provided in the first aspect, in some possible implementation manners, before calculating the similarity between the control text and each data to be trained in the data set to be trained, the method further includes: coding each piece of data to be trained in the data set to be trained based on a general large language model to obtain a training vector corresponding to each piece of data to be trained; correspondingly, calculating the similarity between the control text and each data to be trained in the data set to be trained comprises the following steps: coding the comparison text based on a general large language model to obtain a comparison vector; and calculating the similarity between the comparison vector and each training vector to obtain the similarity between the comparison text and each piece of data to be trained in the data set to be trained.
In the embodiment of the application, each data to be trained in the data set to be trained is encoded based on the universal large language model, so that the obtained training vector can accurately reflect the characteristics of the data to be trained. And further improving the accuracy of the similarity obtained by subsequent calculation.
Based on the technical solution provided in the first aspect, in some possible implementation manners, obtaining the alert sentence based on the pre-acquired user portrait of the target user and the preset alert sentence template includes: and filling the user characteristic data in the user portrait into the corresponding position in the prompt statement template to obtain the prompt statement.
In the embodiment of the application, the user characteristic data is filled into the prompt statement template, so that the obtained prompt statement can fully embody the characteristics of the user image, thereby ensuring that the comparison text output by the large language model can accord with the characteristics of the user image.
Based on the technical solution provided in the first aspect, in some possible implementation manners, the user characteristic data includes at least one data of a user industry, a job position, a responsibility, and a work record.
In the embodiment of the application, the characteristics of the user portrait can be fed back by the user industry, the position, the responsibility and the work record, so that the characteristics of the user portrait can be reflected by the subsequently obtained prompt sentences by selecting at least one data, and the finally obtained target to-be-trained data can be more in accordance with the characteristics of the user portrait.
Based on the technical solution provided in the first aspect, in some possible implementation manners, after screening at least one target data to be trained from a pre-acquired data set to be trained based on the control text, to obtain a training data set, the method further includes: and training the large language model to be trained based on the training data set to obtain a trained large language model.
In the embodiment of the application, because each piece of target data to be trained in the obtained training data set accords with the portrait characteristic of the user, the large language model is trained based on the training data set, and the distribution of the data can not be destroyed. In addition, the prior art trains the large language model by utilizing all the acquired data to be trained, and the acquired data to be trained cannot be screened, so that a part of data which does not accord with the portrait characteristics of the user is included. The obtained data to be trained is screened, and data which does not accord with the portrait characteristics of the user are screened, so that compared with the prior art, the method and the device are finally used for training the large language model, the data size is relatively small, and therefore the data size required for training the large language model can be reduced, and the training cost is reduced.
In a second aspect, the present application provides a model training method, including: obtaining a training dataset, wherein the training dataset is obtained based on the method as described in the first aspect and/or in combination with any possible implementation of the first aspect; and training the large language model to be trained based on the training data set to obtain a trained large language model.
In a third aspect, the present application provides a text generation method, including: acquiring a prompt statement; inputting the prompt sentence into a pre-trained large language model to obtain text data output by the large language model, wherein the large language model is trained based on the method in the second aspect.
In a fourth aspect, the present application provides a training data screening apparatus, including: the first processing module is used for obtaining a prompt sentence based on a user portrait of a target user and a preset prompt sentence template; the second processing module is used for inputting the prompt sentences into a large language model to be trained to obtain comparison texts corresponding to the prompt sentences; the third processing module is used for screening at least one target to-be-trained data from a pre-acquired to-be-trained data set based on the comparison text to obtain a training data set, wherein the training data packet comprises all screened target to-be-trained data.
In a fifth aspect, the present application provides a model training apparatus comprising: the first acquisition module is used for acquiring a training data set, wherein the training data set is obtained based on the method according to the first aspect and/or any possible implementation manner of the first aspect; the fourth processing module is used for training the large language model to be trained based on the training data set to obtain a trained large language model.
In a sixth aspect, the present application provides a text generating apparatus, comprising: the second acquisition module is used for acquiring prompt sentences; and the fifth processing module is used for inputting the prompt sentence into a pre-trained large language model to obtain text data output by the large language model, wherein the large language model is obtained by training based on the method in the second aspect.
In a seventh aspect, the present application provides an electronic device, comprising: the device comprises a memory and a processor, wherein the memory is connected with the processor; the memory is used for storing programs; the processor is configured to invoke a program stored in the memory, to perform the method of the first aspect and/or any possible implementation manner of the first aspect, and/or to perform the method of the second aspect, and/or to perform the method of the third aspect.
In an eighth aspect, the present application provides a computer readable storage medium having stored thereon a computer program which, when executed by a computer, performs the method of the first aspect and/or any possible implementation manner of the first aspect, and/or performs the method of the second aspect, and/or performs the method of the third aspect.
The beneficial effects of the application are as follows: a prompt sentence is obtained through the user portrait, so that the comparison text output by the large language model based on the prompt sentence accords with the characteristics of the user portrait. And at least one target to-be-trained data screened based on the comparison text further accords with the portrait characteristic of the user. Therefore, each piece of target data to be trained in the obtained training data set accords with the portrait characteristic of the user, and the data distribution of a large language model trained by the training data set obtained by the method is not destroyed.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the embodiments will be briefly described below, it being understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and other related drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a flow chart of a training data screening method according to an embodiment of the present application;
FIG. 2 is a block diagram of a training data screening apparatus according to an embodiment of the present application;
FIG. 3 is a block diagram of a model training apparatus according to an embodiment of the present application;
fig. 4 is a block diagram showing a structure of a text generating apparatus according to an embodiment of the present application;
fig. 5 is a block diagram of an electronic device according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be described below with reference to the accompanying drawings in the embodiments of the present application.
It should be noted that: like reference numerals and letters denote like items in the following figures, and thus once an item is defined in one figure, no further definition or explanation thereof is necessary in the following figures. Meanwhile, relational terms such as "first," "second," and the like may be used solely to distinguish one entity or action from another entity or action in the description of the application without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.
The technical scheme of the present application will be described in detail with reference to the accompanying drawings.
Because the prior art trains a general large language model by using a large amount of data, the distribution of the data may be destroyed. For example, the probability of "basketball" appearing in sports news will certainly be higher than the probability of "basketball" appearing in financial news. Thus, if text is to be output for the sports aspect, financial class data should not appear, otherwise the word "basketball" distribution in the training set is destroyed.
The application provides a training data screening method, which aims to solve the problem that the distribution of data can be damaged when a large amount of data is used for training a general large language model.
Referring to fig. 1, fig. 1 is a flowchart of a training data screening method according to an embodiment of the present application, and the steps included in the training data screening method will be described with reference to fig. 1.
S100: and obtaining the prompt sentence based on the pre-obtained user portrait of the target user and a preset prompt sentence template.
The user portrait can be obtained in advance and stored in a storage medium, and can be directly called when the user portrait needs to be used. Alternatively, the user representation may be created in real time when needed.
The specific manner and principles of creating the user's portrait are well known to those skilled in the art, and are not described herein for brevity.
In one embodiment, based on the pre-acquired user portrait of the target user and the preset prompt sentence template, the specific process of obtaining the prompt sentence may be: and filling the user characteristic data in the user portrait into the corresponding position in the prompt statement template to obtain the prompt statement.
The user characteristic data is data characterizing the user image, and optionally, the user characteristic data may include at least one of user industry, job position, responsibility, and job record.
The prompt statement template can be obtained in advance and stored in a storage medium, and can be directly called when the prompt statement template is needed to be used. Alternatively, the alert sentence template may be obtained from other devices in real time when needed.
To facilitate understanding the alert sentence templates described above, taking the alert sentence templates as examples where the user industry, job, responsibilities, job records, and other portrait features need to be filled, one implementation of the alert sentence templates may be "write as { job } in { user industry }, requiring performance of { responsibilities }, while { other portrait features }, possibly outputting as { job records }. The "{ }" is user characteristic data to be filled in, for example "{ user industry }" is data representing user industry to be filled in the user characteristic data. And (5) respectively filling the user characteristic data into the corresponding "{ }" to obtain the prompt statement.
For ease of understanding only, the alert sentence templates may be set according to actual needs, and specific implementations of alert sentence templates include, but are not limited to, those illustrated herein.
S200: and inputting the prompt sentences into the large language model to be trained, and obtaining the comparison text corresponding to the prompt sentences.
The specific structure and operation of a large language model are well known to those skilled in the art, and for brevity, the specific structure and operation of a large language model will not be described in detail herein.
S300: and screening at least one target to-be-trained data from the pre-acquired to-be-trained data set based on the comparison text to obtain a training data set.
Wherein the training data set comprises all target data to be trained.
The specific manner of screening at least one target to-be-trained data from the pre-acquired to-be-trained data set based on the control text after the control text is obtained may be the following embodiments.
In the first embodiment, a control keyword which can embody the characteristics of the control text in the control text can be extracted. And screening the data to be trained containing the comparison keywords with the number larger than a preset number threshold value from the data set to be trained as template data to be trained.
The preset number threshold can be set according to actual requirements, and the preset number threshold is not limited here.
In a second embodiment, a specific implementation manner of screening at least one target to-be-trained data from a pre-acquired to-be-trained data set based on a control text may be: and calculating the similarity between the control text and each data to be trained in the data set to be trained. And then obtaining target data to be trained based on the data to be trained meeting the preset similarity condition.
The method for calculating the similarity between the comparison text and each data to be trained in the data set to be trained can be any existing method for calculating the similarity between two texts, for example, the cosine similarity and euclidean distance of the two texts can be calculated.
In order to facilitate calculation of the similarity between the comparison text and each piece of data to be trained in the data set to be trained, optionally, before calculation of the similarity between the comparison text and each piece of data to be trained in the data set to be trained, each piece of data to be trained in the data set to be trained may be encoded based on the general large language model to obtain a training vector corresponding to each piece of data to be trained.
Accordingly, the manner of calculating the similarity between the control text and each data to be trained in the data set to be trained may be: and encoding the comparison text based on the universal large language model to obtain a comparison vector. And then calculating the similarity between the comparison vector and each training vector to obtain the similarity between the comparison text and each piece of data to be trained in the data set to be trained.
The similarity can be calculated conveniently by converting the comparison text and the data to be trained into the form of vectors. And each data to be trained in the data set to be trained is encoded based on the universal large language model, so that the obtained training vector can accurately reflect the characteristics of the data to be trained. And further improving the accuracy of the similarity obtained by subsequent calculation.
The generic large language model may be any existing large language model, for example, a large language model such as ilama, and the example is only for easy understanding, and should not be taken as a limitation of the present application.
In a third embodiment, the specific manner of screening the at least one target to-be-trained data from the pre-acquired to-be-trained data set based on the control text may be: firstly, each data to be trained in a data set to be trained is numbered respectively, wherein each number uniquely corresponds to one data to be trained. Then, carrying out collaborative filtering calculation on each piece of data to be trained in the data set to be trained, the corresponding number and the user portrait by using a preset collaborative filtering algorithm to obtain an output result; wherein the output result includes at least one number. And finally, determining the data to be trained corresponding to each number included in the output result as target data to be trained.
The specific implementation and principles of collaborative filtering algorithms are well known to those skilled in the art, and are not described herein for brevity.
For easy understanding, the data set to be trained includes three data to be trained 1, 2 and 3. Firstly, the data 1 to be trained is numbered 001, the data 2 to be trained is numbered 002, and the data 3 to be trained is numbered 003. And then taking the numbered data to be trained 1, the numbered data to be trained 2, the numbered data to be trained 3 and the user portrait as input data, and processing the input data by utilizing a collaborative filtering algorithm to obtain an output result. Wherein, the output result may include at least one number of 001, 002, 003. If the output result includes 001, determining the data 1 to be trained corresponding to 001 as target data to be trained. The examples herein are for ease of understanding only and should not be construed as limiting the application.
The above three embodiments of screening at least one target to-be-trained data from the pre-acquired to-be-trained data set based on the control text may be implemented separately, or two or three of them may be combined to jointly screen the target to-be-trained data.
Alternatively, the data to be trained containing the number of the control keywords greater than the preset number threshold may be first selected from the data set to be trained as the first intermediate data. And then the similarity between each first intermediate data and the comparison text can be calculated, and then the target data to be trained is obtained based on the first intermediate data meeting the preset similarity condition.
Alternatively, the data to be trained containing the number of the control keywords greater than the preset number threshold may be first selected from the data set to be trained as the first intermediate data. Each first intermediate data is then separately numbered. And then, carrying out collaborative filtering calculation on each first intermediate data and the corresponding number and user portrait thereof by using a preset collaborative filtering algorithm to obtain an output result. And finally, determining the first intermediate data corresponding to each number included in the output result as target data to be trained.
Alternatively, the similarity of the control text to each of the data to be trained in the set of data to be trained may be calculated first. And then numbering each piece of data to be trained meeting the preset similarity condition. And then, carrying out collaborative filtering calculation on the data to be trained, the corresponding numbers and the user portraits which meet the preset similarity condition by using a preset collaborative filtering algorithm to obtain an output result. And finally, determining the data to be trained corresponding to each number included in the output result as target data to be trained.
Alternatively, the data to be trained containing the number of the control keywords greater than the preset number threshold may be first selected from the data set to be trained as the first intermediate data. The similarity of each first intermediate data and the control text may then be calculated. And numbering each first intermediate data meeting the preset similarity condition. And then, carrying out collaborative filtering calculation on the first intermediate data meeting the preset similarity condition, the corresponding number and the user portrait by using a preset collaborative filtering algorithm to obtain an output result. And finally, determining the first intermediate data corresponding to each number included in the output result as target data to be trained.
In one embodiment, after at least one target to-be-trained data is screened from a pre-acquired to-be-trained data set based on a comparison text to obtain a training data set, a large language model to be trained can be trained based on the training data set to obtain a trained large language model.
In practical application, the training data set may be obtained from a third party, and at this time, the specific step of the model training method may be to obtain the training data set first, and then train the large language model to be trained based on the training data set, so as to obtain a trained large language model. The obtained training data set is obtained based on the training data screening method.
Because each piece of target data to be trained in the obtained training data set accords with the portrait characteristic of the user, the large language model is trained based on the training data set, and the distribution of the data can not be destroyed. In addition, the training data set is screened, so that compared with the prior art that the large language model is trained by utilizing all acquired data to be trained, the method and the device can reduce the data quantity required by training the large language model, thereby reducing the training cost.
The specific manner and principles of training a large language model are well known to those skilled in the art, and are not described herein for brevity.
After the trained large language model is obtained, the trained large language model can be applied to generating text data. Specifically, the text generation method may include the following steps: firstly, acquiring a prompt statement; and then inputting the prompt sentence into the pre-trained large language model to obtain text data output by the large language model.
The specific expression form of the prompt sentence generated based on the prompt sentence template can be the same or different. The alert sentence may be directly input by the user, or may be generated based on the manner of generating the alert sentence described above.
Based on the same technical conception, the application also provides a training data screening device, as shown in fig. 2. The training data screening apparatus 100 includes a first processing module 110, a second processing module 120, and a third processing module 130.
The first processing module 110 is configured to obtain a prompt sentence based on a pre-acquired user portrait of the target user and a preset prompt sentence template.
And the second processing module 120 is configured to input the prompt sentence into a large language model to be trained, and obtain a comparison text corresponding to the prompt sentence.
And the third processing module 130 is configured to screen at least one target to-be-trained data from the pre-acquired to-be-trained data set based on the control text, so as to obtain a training data set, where the training data packet includes all the screened target to-be-trained data.
The third processing module 130 is specifically configured to calculate a similarity between the control text and each data to be trained in the data set to be trained; and obtaining the target data to be trained based on the data to be trained meeting the preset similarity condition.
The third processing module 130 is specifically configured to number each data to be trained that meets a preset similarity condition, where each number uniquely corresponds to one data to be trained; carrying out collaborative filtering calculation on the data to be trained meeting the preset similarity condition, the corresponding number and the user portrait by using a preset collaborative filtering algorithm to obtain an output result; wherein the output result includes at least one number; and determining the data to be trained corresponding to each number included in the output result as the target data to be trained.
The third processing module 130 is further configured to encode each data to be trained in the data set to be trained based on a general large language model, so as to obtain a training vector corresponding to each data to be trained; coding the comparison text based on a general large language model to obtain a comparison vector; and calculating the similarity between the comparison vector and each training vector to obtain the similarity between the comparison text and each piece of data to be trained in the data set to be trained.
The first processing module 110 is specifically configured to fill the user characteristic data in the user portrait into a corresponding position in the alert sentence template, so as to obtain the alert sentence.
In one embodiment, the user characteristic data includes at least one of user industry, job position, responsibility, and job record.
The training data screening device 100 further includes a training module, and the training module is configured to, after screening at least one target to-be-trained data from a pre-acquired to-be-trained data set based on the control text, obtain a training data set, train the to-be-trained large language model based on the training data set, and obtain a trained large language model.
The training data screening apparatus 100 according to the embodiment of the present application has the same implementation principle and technical effects as those of the foregoing training data screening method embodiment, and for brevity, reference may be made to corresponding contents in the foregoing training data screening method embodiment where the apparatus embodiment portion is not mentioned.
Based on the same technical concept, the present application further provides a model training apparatus, as shown in fig. 3, where the model training apparatus 200 includes a first obtaining module 210 and a fourth processing module 220.
The first obtaining module 210 is configured to obtain a training data set, where the training data set is obtained based on the training data screening method described above.
And a fourth processing module 220, configured to train the large language model to be trained based on the training data set, so as to obtain a trained large language model.
The model training device 200 according to the embodiment of the present application has the same implementation principle and the same technical effects as those of the foregoing model training method embodiment, and for brevity, reference may be made to corresponding contents in the foregoing model training method embodiment where the device embodiment is not mentioned.
Based on the same technical concept, the present application also provides a text generating apparatus, and as shown in fig. 4, the text generating apparatus 300 includes a second obtaining module 310 and a fifth processing module 320.
A second obtaining module 310, configured to obtain the hint statement.
And a fifth processing module 320, configured to input the prompt sentence into a pre-trained large language model, and obtain text data output by the large language model, where the large language model is obtained by training based on the foregoing model training method.
The text generating device 300 according to the embodiment of the present application has the same implementation principle and technical effects as those of the foregoing text generating method embodiment, and for brevity, reference may be made to corresponding contents in the foregoing text generating method embodiment for a part of the description of the device embodiment that is not mentioned.
Please refer to fig. 5, which illustrates an electronic device 400 according to an embodiment of the present application. The electronic device 400 includes: processor 410, memory 420.
The memory 420 and the processor 410 are electrically connected directly or indirectly to each other to realize data transmission or interaction. For example, the components may be electrically connected to each other via one or more communication buses or signal lines. The memory 420 is used for storing computer programs, such as the software functional modules shown in fig. 2, 3, and 4, that is, the training data screening apparatus 100, the model training apparatus 200, and the text generating apparatus 300. Wherein each device comprises at least one software functional module that may be stored in the memory 420 in the form of software or firmware (firmware) or cured in an Operating System (OS) of the electronic device 400.
The processor 410 is configured to execute executable modules stored in the memory 420, such as software functional modules or computer programs included in the training data screening apparatus 100. At this time, the processor 410 is configured to obtain a prompt sentence based on a user portrait of the target user and a preset prompt sentence template; inputting the prompt sentences into a large language model to be trained, and obtaining comparison texts corresponding to the prompt sentences; and screening at least one target to-be-trained data from a pre-acquired to-be-trained data set based on the comparison text to obtain a training data set, wherein the training data set comprises all the target to-be-trained data.
The processor 410 is configured to execute executable modules stored in the memory 420, such as software functional modules or computer programs included in the model training apparatus 200. At this time, the processor 410 is configured to obtain a training data set, where the training data set is obtained based on the foregoing training data screening method; and training the large language model to be trained based on the training data set to obtain a trained large language model.
The processor 410 is configured to execute executable modules stored in the memory 420, such as software functional modules or computer programs included in the text generating device 300. At this time, the processor 410 is configured to obtain a hint statement; inputting the prompt sentence into a pre-trained large language model to obtain text data output by the large language model, wherein the large language model is obtained by training based on the model training method.
The Memory 420 may be, but is not limited to, a random access Memory (RandomAccessMemory, RAM), a Read Only Memory (ROM), a programmable Read Only Memory (Programmable Read-Only Memory, PROM), an erasable Read Only Memory (Erasable Programmable Read-Only Memory, EPROM), an electrically erasable Read Only Memory (Electric Erasable Programmable Read-Only Memory, EEPROM), etc.
The processor 410 may be an integrated circuit chip having signal processing capabilities. The processor may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU), a network processor (Network Processor, NP), etc.; but also digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), field programmable gate arrays (Field Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components. The disclosed methods, steps, and logic blocks in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor 410 may be any conventional processor or the like.
The electronic device 400 includes, but is not limited to, a personal computer, a server, and the like.
The embodiment of the present application further provides a computer readable storage medium (hereinafter referred to as a storage medium) storing a computer program, where when the computer program is executed by a computer such as the electronic device 400, at least one of the training data screening method, the model training method, and the text generating method described above is executed. The computer-readable storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
The above description is only of the preferred embodiments of the present application and is not intended to limit the present application, but various modifications and variations can be made to the present application by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims (14)

1. A training data screening method, comprising:
obtaining a prompt sentence based on a user portrait of a target user and a preset prompt sentence template;
inputting the prompt sentences into a large language model to be trained, and obtaining comparison texts corresponding to the prompt sentences;
and screening at least one target to-be-trained data from a pre-acquired to-be-trained data set based on the comparison text to obtain a training data set, wherein the training data set comprises all the target to-be-trained data.
2. The method of claim 1, wherein screening at least one target to-be-trained data from a pre-acquired to-be-trained data set based on the control text comprises:
calculating the similarity between the control text and each piece of data to be trained in the data set to be trained;
and obtaining the target data to be trained based on the data to be trained meeting the preset similarity condition.
3. The method according to claim 2, wherein obtaining the target data to be trained based on the data to be trained that satisfies a preset similarity condition comprises:
each piece of data to be trained meeting the preset similarity condition is numbered respectively, wherein each number uniquely corresponds to one piece of data to be trained;
carrying out collaborative filtering calculation on the data to be trained meeting the preset similarity condition, the corresponding number and the user portrait by using a preset collaborative filtering algorithm to obtain an output result; wherein the output result includes at least one number;
and determining the data to be trained corresponding to each number included in the output result as the target data to be trained.
4. The method of claim 2, wherein prior to calculating the similarity of the control text to each of the data to be trained in the set of data to be trained, the method further comprises:
coding each piece of data to be trained in the data set to be trained based on a general large language model to obtain a training vector corresponding to each piece of data to be trained;
correspondingly, calculating the similarity between the control text and each data to be trained in the data set to be trained comprises the following steps:
coding the comparison text based on a general large language model to obtain a comparison vector;
and calculating the similarity between the comparison vector and each training vector to obtain the similarity between the comparison text and each piece of data to be trained in the data set to be trained.
5. The method of claim 1, wherein obtaining the alert sentence based on the pre-acquired user representation of the target user and the pre-set alert sentence template comprises:
and filling the user characteristic data in the user portrait into the corresponding position in the prompt statement template to obtain the prompt statement.
6. The method of claim 5, wherein the user characteristic data comprises at least one of user industry, job title, responsibility, job record.
7. The method according to any one of claims 1-6, wherein after screening at least one target data to be trained from a pre-acquired set of data to be trained based on the control text, the method further comprises:
and training the large language model to be trained based on the training data set to obtain a trained large language model.
8. A method of model training, comprising:
obtaining a training dataset, wherein the training dataset is derived based on the method of any of claims 1-7;
and training the large language model to be trained based on the training data set to obtain a trained large language model.
9. A text generation method, comprising:
acquiring a prompt statement;
inputting the prompt sentence into a pre-trained large language model to obtain text data output by the large language model, wherein the large language model is trained based on the method as claimed in claim 8.
10. A training data screening apparatus comprising:
the first processing module is used for obtaining a prompt sentence based on a user portrait of a target user and a preset prompt sentence template;
the second processing module is used for inputting the prompt sentences into a large language model to be trained to obtain comparison texts corresponding to the prompt sentences;
and the third processing module is used for screening at least one target to-be-trained data from the pre-acquired to-be-trained data set based on the comparison text to obtain a training data set, wherein the training data packet comprises all screened target to-be-trained data.
11. A model training device, comprising:
a first acquisition module for acquiring a training data set, wherein the training data set is obtained based on the method of any one of claims 1-7;
and the fourth processing module is used for training the large language model to be trained based on the training data set to obtain a trained large language model.
12. A text generating apparatus, comprising:
the second acquisition module is used for acquiring the prompt statement;
and a fifth processing module, configured to input the prompt sentence into a pre-trained large language model, to obtain text data output by the large language model, where the large language model is obtained by training based on the method according to claim 8.
13. An electronic device, comprising: the device comprises a memory and a processor, wherein the memory is connected with the processor;
the memory is used for storing programs;
the processor being adapted to invoke a program stored in the memory for performing the method according to any of claims 1-9.
14. A computer-readable storage medium, on which a computer program is stored, which, when being run by a computer, performs the method according to any one of claims 1-9.
CN202311213090.8A 2023-09-20 2023-09-20 Training data screening method, model training method and text generation method Active CN116955590B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311213090.8A CN116955590B (en) 2023-09-20 2023-09-20 Training data screening method, model training method and text generation method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311213090.8A CN116955590B (en) 2023-09-20 2023-09-20 Training data screening method, model training method and text generation method

Publications (2)

Publication Number Publication Date
CN116955590A true CN116955590A (en) 2023-10-27
CN116955590B CN116955590B (en) 2023-12-08

Family

ID=88451472

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311213090.8A Active CN116955590B (en) 2023-09-20 2023-09-20 Training data screening method, model training method and text generation method

Country Status (1)

Country Link
CN (1) CN116955590B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117725414A (en) * 2023-12-13 2024-03-19 北京海泰方圆科技股份有限公司 Training content generation model method, device and equipment for determining output content

Citations (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106919661A (en) * 2017-02-13 2017-07-04 腾讯科技(深圳)有限公司 A kind of affective style recognition methods and relevant apparatus
CN110610705A (en) * 2019-09-20 2019-12-24 上海数鸣人工智能科技有限公司 Voice interaction prompter based on artificial intelligence
CN112560996A (en) * 2020-12-24 2021-03-26 北京百度网讯科技有限公司 User portrait recognition model training method, device, readable storage medium and product
CN112818082A (en) * 2019-11-15 2021-05-18 北京沃东天骏信息技术有限公司 Evaluation text pushing method and device
CN113378970A (en) * 2021-06-28 2021-09-10 平安普惠企业管理有限公司 Sentence similarity detection method and device, electronic equipment and storage medium
CN113807074A (en) * 2021-03-12 2021-12-17 京东科技控股股份有限公司 Similar statement generation method and device based on pre-training language model
CN113807809A (en) * 2021-08-24 2021-12-17 姚玲 Method for constructing audit user portrait based on machine learning technology
CN114757176A (en) * 2022-05-24 2022-07-15 上海弘玑信息技术有限公司 Method for obtaining target intention recognition model and intention recognition method
KR102442435B1 (en) * 2021-11-11 2022-09-14 주식회사 파블로아트컴퍼니 An artificial intelligence system capable of collecting interactive picture data for building artificial intelligence data, evaluating art competency, and predicting academic achievement
CN115081501A (en) * 2021-03-15 2022-09-20 中国电信股份有限公司 User classification method and device, cascaded user classification model and equipment
US20230042683A1 (en) * 2021-08-04 2023-02-09 International Business Machines Corporation Identifying and transforming text difficult to understand by user
CN116127049A (en) * 2023-04-17 2023-05-16 平安银行股份有限公司 Model training method, text generation method, terminal device and computer medium
CN116187324A (en) * 2023-04-28 2023-05-30 西湖大学 Method, system and medium for generating cross-language abstract for long text of source language
CN116821287A (en) * 2023-08-28 2023-09-29 湖南创星科技股份有限公司 Knowledge graph and large language model-based user psychological portrait system and method
CN116823410A (en) * 2023-08-29 2023-09-29 阿里巴巴(成都)软件技术有限公司 Data processing method, object processing method, recommending method and computing device

Patent Citations (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106919661A (en) * 2017-02-13 2017-07-04 腾讯科技(深圳)有限公司 A kind of affective style recognition methods and relevant apparatus
CN110610705A (en) * 2019-09-20 2019-12-24 上海数鸣人工智能科技有限公司 Voice interaction prompter based on artificial intelligence
CN112818082A (en) * 2019-11-15 2021-05-18 北京沃东天骏信息技术有限公司 Evaluation text pushing method and device
CN112560996A (en) * 2020-12-24 2021-03-26 北京百度网讯科技有限公司 User portrait recognition model training method, device, readable storage medium and product
CN113807074A (en) * 2021-03-12 2021-12-17 京东科技控股股份有限公司 Similar statement generation method and device based on pre-training language model
CN115081501A (en) * 2021-03-15 2022-09-20 中国电信股份有限公司 User classification method and device, cascaded user classification model and equipment
CN113378970A (en) * 2021-06-28 2021-09-10 平安普惠企业管理有限公司 Sentence similarity detection method and device, electronic equipment and storage medium
US20230042683A1 (en) * 2021-08-04 2023-02-09 International Business Machines Corporation Identifying and transforming text difficult to understand by user
CN113807809A (en) * 2021-08-24 2021-12-17 姚玲 Method for constructing audit user portrait based on machine learning technology
KR102442435B1 (en) * 2021-11-11 2022-09-14 주식회사 파블로아트컴퍼니 An artificial intelligence system capable of collecting interactive picture data for building artificial intelligence data, evaluating art competency, and predicting academic achievement
CN114757176A (en) * 2022-05-24 2022-07-15 上海弘玑信息技术有限公司 Method for obtaining target intention recognition model and intention recognition method
CN116127049A (en) * 2023-04-17 2023-05-16 平安银行股份有限公司 Model training method, text generation method, terminal device and computer medium
CN116187324A (en) * 2023-04-28 2023-05-30 西湖大学 Method, system and medium for generating cross-language abstract for long text of source language
CN116821287A (en) * 2023-08-28 2023-09-29 湖南创星科技股份有限公司 Knowledge graph and large language model-based user psychological portrait system and method
CN116823410A (en) * 2023-08-29 2023-09-29 阿里巴巴(成都)软件技术有限公司 Data processing method, object processing method, recommending method and computing device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
王国桥等: "基于TF-IDF的社交电商文本信息分类研究", 《网络空间安全》, pages 32 - 38 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117725414A (en) * 2023-12-13 2024-03-19 北京海泰方圆科技股份有限公司 Training content generation model method, device and equipment for determining output content

Also Published As

Publication number Publication date
CN116955590B (en) 2023-12-08

Similar Documents

Publication Publication Date Title
CN111814466A (en) Information extraction method based on machine reading understanding and related equipment thereof
US10311099B2 (en) Method and system for 3D model database retrieval
CN113051371B (en) Chinese machine reading understanding method and device, electronic equipment and storage medium
CN116955590B (en) Training data screening method, model training method and text generation method
CN108897852B (en) Method, device and equipment for judging continuity of conversation content
CN110866391A (en) Title generation method, title generation device, computer readable storage medium and computer equipment
CN111368037A (en) Text similarity calculation method and device based on Bert model
CN115114919A (en) Method and device for presenting prompt information and storage medium
CN111178064A (en) Information pushing method and device based on field word segmentation processing and computer equipment
CN117197271A (en) Image generation method, device, electronic equipment and storage medium
CN114330251A (en) Text generation method, model training method, device and storage medium
CN111523301B (en) Contract document compliance checking method and device
CN113505595A (en) Text phrase extraction method and device, computer equipment and storage medium
CN111898363B (en) Compression method, device, computer equipment and storage medium for long and difficult text sentence
CN117787226A (en) Label generation model training method and device, electronic equipment and storage medium
CN112527967A (en) Text matching method, device, terminal and storage medium
CN112668325A (en) Machine translation enhancing method, system, terminal and storage medium
CN117173269A (en) Face image generation method and device, electronic equipment and storage medium
CN116127049A (en) Model training method, text generation method, terminal device and computer medium
CN113704452B (en) Data recommendation method, device, equipment and medium based on Bert model
CN112528646B (en) Word vector generation method, terminal device and computer-readable storage medium
CN113204629A (en) Text matching method and device, computer equipment and readable storage medium
CN111767395A (en) Abstract generation method and system based on picture
CN114091662B (en) Text image generation method and device and electronic equipment
CN113268997B (en) Text translation method, device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
PE01 Entry into force of the registration of the contract for pledge of patent right

Denomination of invention: Training data filtering methods, model training methods, and text generation methods

Granted publication date: 20231208

Pledgee: Shanghai Pudong Development Bank Co.,Ltd. Chengdu Branch

Pledgor: CHENGDU MINTO TECHNOLOGY CO.,LTD.

Registration number: Y2024980021746