CN117113083A

CN117113083A - Sample data generation method and training method of large language model

Info

Publication number: CN117113083A
Application number: CN202311042082.1A
Authority: CN
Inventors: 路香菊; 闫贺; 朱俊敏
Original assignee: Beijing IQIYI Science and Technology Co Ltd
Current assignee: Beijing IQIYI Science and Technology Co Ltd
Priority date: 2023-08-17
Filing date: 2023-08-17
Publication date: 2023-11-24

Abstract

The embodiment of the application relates to a sample data generation method and a large language model training method, which comprise the following steps: acquiring a target text; extracting characters from the target text to obtain extracted characters corresponding to the target text; inputting the extracted characters into a pre-trained first large language model to obtain a first abstract of the target text, wherein the first large language model is used for extracting the abstract of the text; determining whether the first abstract meets preset abstract conditions or not; and generating sample data based on the target text and the first abstract under the condition that the first abstract meets the preset abstract condition, wherein the sample data is used for fine tuning a third large language model, and the third large language model is used for extracting the abstract of the text. Thus, by extracting characters from a text with a large number of characters, token expansion is realized, and the efficiency of obtaining sample data of a large language model for extracting text summaries can be improved.

Description

Sample data generation method and training method of large language model

Technical Field

The application relates to the technical field of computers, in particular to a sample data generation method and a large language model training method.

Background

Large language models (Large Language Model, LLM), which are deep learning models capable of processing large amounts of natural language data, have shown great potential in many fields of natural language processing, text generation, machine translation, and the like.

However, in some situations, the labeling is difficult in the prior art. For example, when extracting a digest of a long text, the long text cannot be directly used as input data of the large language model due to the token number limitation of some large language models, so that it is difficult to obtain sample data of the large language model for extracting a text digest or the efficiency of obtaining sample data of the large language model for extracting a text digest is low.

Disclosure of Invention

In view of this, in order to solve some or all of the above-mentioned technical problems, an embodiment of the present application provides a method for generating sample data and a training method for a large language model.

In a first aspect, an embodiment of the present application provides a method for generating sample data, where the method includes:

Acquiring a target text, wherein the number of characters contained in the target text is greater than or equal to a first number;

extracting characters from the target text to obtain extracted characters corresponding to the target text;

inputting the extracted characters into a pre-trained first large language model to obtain a first abstract of the target text, wherein the first large language model is used for extracting the abstract of the text;

determining whether the first abstract meets preset abstract conditions or not;

and generating sample data based on the target text and the first abstract under the condition that the first abstract meets the preset abstract condition, wherein the sample data is used for fine tuning a third large language model, and the third large language model is used for extracting the abstract of the text.

In one possible implementation manner, the inputting the extracted character into a first large language model trained in advance, to obtain a first abstract of the target text, includes:

the following abstract extraction steps are performed:

selecting a first large language model which is not selected from a large language model set trained in advance;

inputting the extracted characters into the first large language model to obtain a first abstract of the target text; and

After the determining whether the first abstract meets the preset abstract condition, the method further comprises the following steps:

and executing the abstract extraction step under the condition that the first abstract does not accord with the preset abstract condition.

In one possible implementation manner, the extracting the characters from the target text to obtain the extracted characters corresponding to the target text includes:

the following character extraction steps are performed:

randomly extracting a second number of characters from the target text to obtain extracted characters corresponding to the target text; and

and executing the character extraction step under the condition that the first abstract does not accord with the preset abstract condition.

In one possible implementation manner, after the target text is obtained, the method further includes:

dividing the target text into a text segment set;

inputting the text segment into a pre-trained second large language model aiming at the text segment in the text segment set to obtain a abstract of the text segment, wherein the second large language model is used for extracting the abstract of the text segment;

Merging the abstracts of the obtained text segments to obtain a second abstract;

sample data is generated based on the target text and the second summary.

In a second aspect, an embodiment of the present application provides a training method for a large language model, where the method includes:

acquiring a training sample set, wherein training samples in the training sample set comprise sample data generated by adopting the generation method of any sample data;

and adopting the training sample set to perform instruction fine tuning training on the third large language model after the pre-training to obtain the third large language model after the training.

In one possible embodiment of the present application,

the third large language model comprises a transform sub-layer, and input data of the transform sub-layer is normalized by adopting an RMSNorm normalization function; and/or

The activation function of the third large language model is a SwiGLU activation function.

In a third aspect, an embodiment of the present application provides a method for extracting a digest, where the method includes:

Inputting the extracted characters into a pre-trained third large language model to extract the abstract of the target text, wherein the third large language model is trained by adopting the training method of any large language model, and the third large language model is used for extracting the abstract of the text.

In a fourth aspect, an embodiment of the present application provides a generating apparatus for sample data, the apparatus including:

a first obtaining unit, configured to obtain a target text, where the number of characters included in the target text is greater than or equal to a first number;

the first extraction unit is used for extracting characters from the target text to obtain extracted characters corresponding to the target text;

the first input unit is used for inputting the extracted characters into a first large-scale language model trained in advance to obtain a first abstract of the target text, wherein the first large-scale language model is used for extracting the abstract of the text;

the determining unit is used for determining whether the first abstract meets the preset abstract conditions or not;

the first generation unit is used for generating sample data based on the target text and the first abstract under the condition that the first abstract meets the preset abstract condition, wherein the sample data is used for fine tuning a third large language model, and the third large language model is used for extracting the abstract of the text.

the following abstract extraction steps are performed:

after the determining whether the first abstract meets the preset abstract condition, the device further comprises:

and the first execution unit is used for executing the digest extraction step under the condition that the first digest does not accord with the preset digest condition.

the following character extraction steps are performed:

and the second execution unit is used for executing the character extraction step under the condition that the first abstract does not accord with the preset abstract condition.

In one possible implementation manner, after the target text is obtained, the apparatus further includes:

the dividing unit is used for dividing the target text into a text segment set;

the second input unit is used for inputting the text segment into a pre-trained second large language model aiming at the text segment in the text segment set to obtain the abstract of the text segment, wherein the second large language model is used for extracting the abstract of the text segment;

the merging unit is used for merging the abstracts of the obtained text segments to obtain a second abstract;

and the second generation unit is used for generating sample data based on the target text and the second abstract.

In a fifth aspect, an embodiment of the present application provides a training apparatus for a large language model, the apparatus including:

the second acquisition unit is used for acquiring a training sample set, wherein training samples in the training sample set comprise sample data generated by adopting the generation method of any sample data;

and the training unit is used for carrying out instruction fine tuning training on the third large language model subjected to pre-training by adopting the training sample set to obtain the third large language model subjected to training.

In one possible embodiment of the present application,

In a sixth aspect, an embodiment of the present application provides a summary extracting apparatus, including:

a third obtaining unit, configured to obtain a target text, where the number of characters included in the target text is greater than or equal to the first number;

the second extraction unit is used for extracting characters from the target text to obtain extracted characters corresponding to the target text;

and the third input unit is used for inputting the extracted characters into a pre-trained third large language model so as to extract the abstract of the target text, wherein the third large language model is trained by adopting the training method of any large language model, and the third large language model is used for extracting the abstract of the text.

In a seventh aspect, an embodiment of the present application provides an electronic device, including:

a memory for storing a computer program;

a processor, configured to execute a computer program stored in the memory, where the computer program is executed to implement a method according to any embodiment of the method for generating sample data according to the first aspect of the present application, or implement a method according to any embodiment of the training method for a large language model according to the second aspect of the present application, or implement a method according to any embodiment of the method for extracting a digest according to the third aspect of the present application.

In an eighth aspect, an embodiment of the present application provides a computer readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements a method according to any one of the embodiments of the method for generating sample data according to the first aspect of the present application, or implements a method according to any one of the embodiments of the training method for a large language model according to the second aspect of the present application, or implements a method according to any one of the embodiments of the method for extracting a digest according to the third aspect of the present application.

In a ninth aspect, an embodiment of the present application provides a computer program, where the computer program includes computer readable code, where the computer readable code when executed on a device causes a processor in the device to implement a method according to any embodiment of the method for generating sample data according to the first aspect of the present application, or implement a method according to any embodiment of the training method for a large language model according to the second aspect of the present application, or implement a method according to any embodiment of the method for extracting a digest according to the third aspect of the present application.

According to the method for generating the sample data, the target text can be obtained, the number of characters contained in the target text is larger than or equal to the first number, then the characters are extracted from the target text, the extracted characters corresponding to the target text are obtained, then the extracted characters are input into a first large-scale language model trained in advance, the first abstract of the target text is obtained, the first large-scale language model is used for extracting the abstract of the text, then whether the first abstract meets the preset abstract condition is determined, and then sample data are generated based on the target text and the first abstract under the condition that the first abstract meets the preset abstract condition, wherein the sample data are used for fine-tuning a third large-scale language model which is used for extracting the abstract of the text. Thus, by extracting characters from a text with a large number of characters, token expansion is realized, and the efficiency of obtaining sample data of a large language model for extracting text summaries can be improved.

According to the training method of the large-scale language model, a training sample set can be obtained, wherein training samples in the training sample set comprise sample data generated by adopting the generation method of any sample data, and then instruction fine tuning training is carried out on a pre-trained third large-scale language model by adopting the training sample set to obtain the trained third large-scale language model. Therefore, the large language model expanded by the token can be adopted to extract the abstract of the long text, so that the characteristic of special reasoning and semantic understanding of the model is maintained, and the method has more robust performance because of adding industry data when applied in a specific scene.

According to the abstract extraction method provided by the embodiment of the application, the target text can be obtained, wherein the number of characters contained in the target text is larger than or equal to the first number, then, the characters are extracted from the target text to obtain the extracted characters corresponding to the target text, and then, the extracted characters are input into a pre-trained third large language model to extract the abstract of the target text, wherein the third large language model is trained by adopting the training method of any large language model, and the third large language model is used for extracting the abstract of the text. Therefore, the large language model after token expansion can be used for extracting the abstract of the long text, and the efficiency of extracting the abstract of the long text is improved.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application.

In order to more clearly illustrate the embodiments of the application or the technical solutions of the prior art, the drawings which are used in the description of the embodiments or the prior art will be briefly described, and it will be obvious to a person skilled in the art that other drawings can be obtained from these drawings without inventive effort.

One or more embodiments are illustrated by way of example and not limitation in the figures of the accompanying drawings, in which like references indicate similar elements, and in which the figures of the drawings are not to be taken in a limiting sense, unless otherwise indicated.

Fig. 1 is a flow chart of a method for generating sample data according to an embodiment of the present application;

FIG. 2 is a flowchart illustrating another method for generating sample data according to an embodiment of the present application;

FIG. 3 is a schematic flow chart of a training method of a large language model according to an embodiment of the present application;

FIG. 4 is a flowchart of a training method of a large language model according to an embodiment of the present application;

fig. 5 is a schematic flow chart of a summary extracting method according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of a device for generating sample data according to an embodiment of the present application;

FIG. 7 is a schematic structural diagram of a training device for a large language model according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of a summary extracting device according to an embodiment of the present application;

fig. 9 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

Various exemplary embodiments of the application will now be described in detail with reference to the accompanying drawings, it being apparent that the described embodiments are some, but not all embodiments of the application. It should be noted that: the relative arrangement of the parts and steps, numerical expressions and numerical values set forth in these embodiments do not limit the scope of the present application unless it is specifically stated otherwise.

It will be appreciated by those skilled in the art that terms such as "first," "second," and the like in the embodiments of the present application are used merely to distinguish between different steps, devices or modules and the like, and do not represent any particular technical meaning or logical sequence therebetween.

It should also be understood that in this embodiment, "plurality" may refer to two or more, and "at least one" may refer to one, two or more.

It should also be appreciated that any component, data, or structure referred to in an embodiment of the application may be generally understood as one or more without explicit limitation or the contrary in the context.

In addition, the term "and/or" in the present application is merely an association relationship describing the association object, and indicates that three relationships may exist, for example, a and/or B may indicate: a exists alone, A and B exist together, and B exists alone. In the present application, the character "/" generally indicates that the front and rear related objects are an or relationship.

It should also be understood that the description of the embodiments of the present application emphasizes the differences between the embodiments, and that the same or similar features may be referred to each other, and for brevity, will not be described in detail.

The following description of at least one exemplary embodiment is merely exemplary in nature and is in no way intended to limit the application, its application, or uses.

Techniques, methods, and apparatus known to one of ordinary skill in the relevant art may not be discussed in detail, but are intended to be part of the specification where appropriate.

It should be noted that: like reference numerals and letters denote like items in the following figures, and thus once an item is defined in one figure, no further discussion thereof is necessary in subsequent figures.

It should be noted that, without conflict, the embodiments of the present application and features of the embodiments may be combined with each other. For an understanding of embodiments of the present application, the present application will be described in detail below with reference to the drawings in conjunction with the embodiments. It will be apparent that the described embodiments are some, but not all, embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

In order to solve the technical problems of how to improve the acquisition efficiency of sample data and reduce the acquisition cost of the sample data in the prior art, the application provides a generation method of the sample data, which improves the acquisition efficiency of the sample data and reduces the acquisition cost of the sample data.

Fig. 1 is a flow chart of a method for generating sample data according to an embodiment of the present application. The method can be applied to one or more electronic devices such as smart phones, notebook computers, desktop computers, portable computers, servers and the like. The main execution body of the method may be hardware or software. When the execution body is hardware, the execution body may be one or more of the electronic devices. For example, a single electronic device may perform the method, or a plurality of electronic devices may cooperate with one another to perform the method. When the execution subject is software, the method may be implemented as a plurality of software or software modules, or may be implemented as a single software or software module. The present application is not particularly limited herein.

As shown in fig. 1, the method specifically includes:

and 101, acquiring target texts, wherein the number of characters contained in the target texts is greater than or equal to the first number.

In this embodiment, the target text may be any text. As an example, the target text may be a script, paper, or the like.

More characters may be included in the target text.

As an example, the first number may be an integer greater than 1 ten thousand. For example, the first number may be tens of thousands, hundreds of thousands or hundreds of thousands.

And 102, extracting characters from the target text to obtain extracted characters corresponding to the target text.

In this embodiment, the extracted character corresponding to the target text may be a character extracted from the target text.

As an example, characters may be randomly extracted from the target text, thereby obtaining extracted characters corresponding to the target text.

As yet another example, characters may also be extracted from the target text according to a policy, so as to obtain extracted characters corresponding to the target text.

As yet another example, a target extraction policy may be selected from a preset extraction policy set, and then a character is extracted from the target text by using the target extraction policy, so as to obtain an extracted character corresponding to the target text.

The target extraction strategy can be determined by the following method:

firstly, determining unselected extraction strategies in a preset extraction strategy set to obtain an extraction strategy subset.

And then, aiming at each extraction strategy in the extraction strategy subset, determining the frequency that the abstracts obtained after the characters extracted by the extraction strategy are input into the first large language model accord with the preset abstracting conditions.

Then, the extraction strategy with the highest frequency is determined as the target extraction strategy.

And step 103, inputting the extracted characters into a pre-trained first large language model to obtain a first abstract of the target text, wherein the first large language model is used for extracting the abstract of the text.

In this embodiment, the first abstract may be an abstract of the target text obtained via a first large language model.

The first large language model may be any large language model that is trained in advance. For example, the first large language model may be a large language model that is pre-trained but not trimmed, or may be a large language model that is pre-trained and trimmed.

Here, in some cases, the abstracts extracted by the first large language model (including the large language models in the subsequent large language model set) are not required to have a higher accuracy. In other words, the accuracy of the abstract extracted by the first large language model (including the large language models in the subsequent large language model set) may be less than or equal to the preset accuracy threshold.

Step 104, determining whether the first abstract meets a preset abstract condition.

In this embodiment, the preset summary condition may be a predetermined condition for determining whether the extracted summary meets the expectations. For example, different preset summary conditions may be set for different target texts. As an example, the preset digest condition may include at least one of: whether the keyword is a keyword preset for the target text or not and whether the domain is a domain preset for the target text or not.

Here, a discriminant model may be employed to determine whether the first digest meets a preset digest condition. Wherein the discriminant model may include, but is not limited to, the following models: logistic regression, naive bayes, decision trees, support vector machines, random forests, gradient lifting trees, and the like.

In some optional implementations of this embodiment, the following manner may be further adopted to determine whether the first summary meets a preset summary condition:

and determining whether the first abstract meets the preset abstract conditions or not through the first large language model.

Here, in addition to the first large language model, it may obtain a digest (for example, the first digest described above), it may also obtain a probability that the digest meets a preset digest condition. Therefore, whether the first abstract meets the preset abstract conditions or not can be determined through the first large language model.

It can be appreciated that in the above alternative implementation manner, whether the first abstract meets the expectation or not may be automatically evaluated through the first large language model, so that the efficiency and accuracy of judging whether the first abstract meets the preset abstract condition are improved.

And step 105, generating sample data based on the target text and the first abstract under the condition that the first abstract meets the preset abstract condition, wherein the sample data is used for fine tuning a third large language model, and the third large language model is used for extracting the abstract of the text.

In this embodiment, when the first summary meets the preset summary condition, the target text and the first summary may be determined as sample data; alternatively, the extracted characters corresponding to the target text and the first abstract may be determined as sample data.

In some optional implementations of this embodiment, the extracting characters from the target text may be performed in the following manner, so as to obtain extracted characters corresponding to the target text:

the following character extraction steps are performed: randomly extracting a second number of characters from the target text to obtain extracted characters corresponding to the target text.

Wherein the second number may be smaller than the first number.

In some cases, the number of characters of the input data of the first large language model is less than the first number and greater than or equal to the second number.

On this basis, after determining whether the first digest meets a preset digest condition, the character extraction step may be further performed in a case where the first digest does not meet the preset digest condition.

Further, after the character extraction step described above is performed, the steps 103 to 105 described above may also be performed.

It may be appreciated that in the above alternative implementation manner, in the case that the first abstract does not meet the preset abstract condition, the second number of characters may be randomly extracted from the target text again until the first abstract obtained after inputting the extracted characters into the first large language model trained in advance meets the preset abstract condition.

In some optional implementations of the present embodiment, after the target text is obtained, the following steps may be further performed:

first, the target text is divided into a set of text segments.

As an example, the target text may be divided into a set of text segments by paragraphs. Alternatively, the target text may be divided into a set of text segments by the number of words.

And inputting the text segment into a pre-trained second large language model aiming at the text segment in the text segment set to obtain the abstract of the text segment, wherein the second large language model is used for extracting the abstract of the text segment.

The second large language model can be any pre-trained abstract large language model for extracting text segments.

And then merging the abstracts of the obtained text segments to obtain a second abstract.

Here, the second summary may be a summary generated by combining the summaries of the respective obtained text segments.

Here, the abstracts of the respective obtained text segments may be combined using a fourth large language model, thereby obtaining a second abstract. Or, the abstracts of the obtained text segments can be directly spliced, so that a second abstract is obtained.

Finally, sample data is generated based on the target text and the second summary.

Here, the target text and the second digest may be determined as sample data; alternatively, the extracted characters corresponding to the target text and the second abstract may be determined as sample data.

It can be understood that in the above alternative implementation manner, long text may be further divided, further, digests of each text segment obtained by division are determined, and then each digest obtained by division is combined, so that sample data is obtained, and thus, the obtaining efficiency of the sample data may be improved.

Fig. 2 is a flowchart of another method for generating sample data according to an embodiment of the present application. As shown in fig. 2, the method specifically includes:

In step 201, a target text is acquired, wherein the number of characters contained in the target text is greater than or equal to a first number. Thereafter, step 202 is continued.

In this embodiment, step 201 is substantially identical to step 101 in the corresponding embodiment of fig. 1, and will not be described herein.

And 202, extracting characters from the target text to obtain extracted characters corresponding to the target text. Thereafter, step 203 is continued.

In this embodiment, step 202 is substantially identical to step 102 in the corresponding embodiment of fig. 1, and will not be described here again.

Step 203, selecting a first large language model which is not selected from a large language model set trained in advance, wherein the first large language model is used for extracting a abstract of a text. Thereafter, step 204 is continued.

In this embodiment, the large language model set may include a plurality of large language models trained in advance. For example, the large language model in the large language model set may be a pre-trained but not trimmed large language model, or may be a pre-trained and trimmed large language model. The large language model set may include large language models such as GPT4 (generated Pre-trained Transformer 4) and customer.

Here, when the first large language model is selected from the large language model set for the first time, random selection may be performed, and when the first large language model is selected from the non-first large language model set, random selection may be performed from the large language models which have not been selected.

In some cases, the first large language model may be selected from the large language model set according to parameters such as training times, training time and the like of each large language model in the large language model set. For example, the first large language model may be selected from the large language model set in order of the number of training times from more to less. For another example, the first large language model may be selected from the large language model set in order of more training time periods. For another example, the training duration and the training times may be comprehensively considered, and the first large language model may be selected from the large language model set.

Here, because the large language model with more training times and longer training time under the normal condition has higher probability that the determined character extraction result accords with the preset extraction standard, the efficiency of obtaining more accurate sample data can be improved by adopting the mode.

And step 204, inputting the extracted characters into the first large language model to obtain a first abstract of the target text. Thereafter, step 205 is continued.

In this embodiment, step 204 is substantially identical to step 103 in the corresponding embodiment of fig. 1, and will not be described herein.

Step 205, determining whether the first abstract meets a preset abstract condition. If yes, go on to step 206; if not, go on to step 203.

In this embodiment, step 205 is substantially identical to step 104 in the corresponding embodiment of fig. 1, and will not be described herein.

And 206, generating sample data based on the target text and the first abstract, wherein the sample data is used for fine tuning a third large language model, and the third large language model is used for extracting the abstract of the text.

In this embodiment, step 206 is substantially identical to step 105 in the corresponding embodiment of fig. 1, and will not be described herein.

It should be noted that, in addition to the above descriptions, the present embodiment may further include the corresponding technical features described in the embodiment corresponding to fig. 1, so as to achieve the technical effects of the method for generating sample data shown in fig. 1, and the detailed description with reference to fig. 1 is omitted herein for brevity.

The method for generating sample data provided by the embodiment of the application can continuously redefine the first abstract through other large language models under the condition that the first abstract does not accord with the preset abstract condition until the first abstract accords with the preset abstract condition is obtained, so that the problem of lower accuracy of the obtained first abstract caused by low stability and poor controllability of a single large language model can be overcome to a certain extent

Fig. 3 is a schematic flow chart of a training method of a large language model according to an embodiment of the present application.

Specifically, as shown in fig. 3, the method specifically includes:

step 301, obtaining a training sample set, wherein training samples in the training sample set include sample data generated by adopting any sample data generation method.

In this embodiment, in addition to the sample data generated by the method for generating sample data as described in any of the above, the training sample set may further include sample data generated by other manners in some optional cases.

Each training sample in the training sample set comprises: text, and a summary of the text.

And 302, performing instruction fine tuning training on the pre-trained third large language model by using the training sample set to obtain the trained third large language model.

In this embodiment, the third large language model may be any large language model that is pre-trained but not fine-tuned.

Here, the machine learning algorithm may be used to perform instruction fine tuning on the pre-trained third large language model based on the training sample set, to obtain the trained third large language model.

In some optional implementations of this embodiment, the third large language model includes a transform sub-layer, and the input data of the transform sub-layer is normalized using an RMSNorm normalization function.

It can be appreciated that in the above alternative implementation manner, by normalizing the input data of the transducer sublayer by adopting the RMSNorm normalization function, data pre-regularization can be implemented, so as to improve training stability.

In some optional implementations of this embodiment, the activation function of the third large language model is a SwiGLU activation function.

It will be appreciated that in the alternative implementation described above, the SwiGLU activation function may be introduced instead of the ReLU nonlinear function, which may improve the performance of the third large language model.

It should be noted that, in addition to the above descriptions, the present embodiment may further include the technical features described in the above embodiments, so as to achieve the technical effects of the method for generating sample data shown above, and the detailed description is referred to above, and is omitted herein for brevity.

The following description is given by way of example of the embodiments of the present application, but it should be noted that the embodiments of the present application may have the features described below, and the following description does not limit the scope of protection of the embodiments of the present application.

In the prior art, the original script outline is often the most original material for evaluating one script, but because the outline story is longer, in the actual evaluation of the original script, the fine reading of the outline needs longer time, and is time-consuming and labor-consuming. It is a significant and challenging matter how to rationally, high-quality, keep the main storyline and critical development of characters to extract the outline abstract. It can be said that for token-constrained generative language big models, this processing of very long text presents new challenges, mainly facing the following problems:

a) First, the outline is often a very long text, and tens of thousands to hundreds of thousands of words are different, which is larger than the input word number limit of the large model of the generated language.

b) Secondly, the subject materials of the outline are not limited, and belong to the field of art, no fixed paradigm exists, the difference between the outline of different types of different subject materials is large, for example, a person often passes through a plurality of story lines of a person, and the relationship of the person is complicated.

c) Characters in the outline are ambiguous, text expressions need more context, and one character may have multiple identities or names.

In addition, the existing generative language big model has the following problems:

1) the token is limited and the number of text words entered is limited.

2) The output result is unstable, random and uncontrollable.

3) The output results do not conform to the expected results of the probes, the focus is not held, and the events of the core are not well defined and strict.

In order to solve the problem of overlong text in a specific application scene, some schemes process the overlong text in sections or compress the overlong text step by step, but the uncontrollable factors of the generated result are increased due to too many processing procedures in the mode, and the processing efficiency is lower.

The generated language big model (i.e. the large language model) is not very accurate or stable due to its "universality", and the reliable result is often required in specific people in specific industries or practical applications under specific conditions. The large model of the language to be generated has 'professional' capability, and professional data finefune which is stable and reliable must be given to the large model of the language, but the marking of the data and the acquisition of high-quality data are difficult problems.

And whether the generated result is by spectrum or not, if the accurate definition is given, the large model can be well completed due to the simplicity and clarity of human beings, and the characteristic is utilized, so that different large models are utilized for cross validation to automatically or semi-automatically (with a small amount of manual labeling verification) quickly acquire a large amount of data, and the accurate finetune large model enables the large model to have good capability of processing specific tasks.

Specifically, referring to fig. 4, fig. 4 is a flow chart of a training method of a large language model according to an embodiment of the application. The method comprises the following steps:

training data preparation stage:

1. and crawling a large number of outline and abstract, and evaluating and screening the data by using a generated language big model, namely preprocessing the data.

2. The data is enhanced by using the existing open source large model (namely, the large language model in the large language model set), and the specific mode is hierarchical compression, step-by-step generation of abstracts, and self-evaluation of abstracts is carried out.

3. All data are cross evaluated with an open-source large model to avoid the bias problem in the reproduction and amplification training data that the large language model has proven.

4. The maximum token is set to 100K and the data is encoded.

A local training language model phase, i.e. a training phase of an industry model:

1. data pre-regularization, in order to improve training stability, the input to each transducer sub-layer is normalized, using RMSNorm normalization functions.

2. The introduced SwiGLU activation function replaces ReLU nonlinearity to improve performance.

3. Efficient implementation of the causel multi-headed attention is used to reduce memory usage and runtime.

4. Based on crawling and generated ultra-long text data pairs, an open-source 65B LLAMA model is pre-trained by a TencentPretrain framework, finetune is performed on data with the context length of 100K, and a standard optimizer is used for training a large-scale transducer on a large amount of text data.

5. And verifying the model by adopting a closed-coil question-answering mode.

It should be noted that, in addition to the above descriptions, the present embodiment may further include the technical features described in the above embodiments, so as to achieve the technical effects of the training method of the large language model shown above, and the detailed description is referred to above, and is omitted herein for brevity.

According to the training method of the large language model, provided by the embodiment of the application, the input token number is directly expanded through the ultra-long text fine large model (namely the third large language model), so that the characteristic of specific reasoning and semantic understanding of the model is reserved, and the training method is applied in a specific scene, and has more robust performance because of adding industry data.

Fig. 5 is a schematic flow chart of a summary extracting method according to an embodiment of the present application.

Specifically, as shown in fig. 5, the method specifically includes:

Step 401, obtaining a target text, wherein the number of characters contained in the target text is greater than or equal to a first number.

More characters may be included in the target text.

As an example, the first number may be an integer greater than 1 ten thousand.

And step 402, extracting characters from the target text to obtain extracted characters corresponding to the target text.

The target extraction strategy can be determined by the following method:

Step 403, inputting the extracted characters into a pre-trained third large language model to extract the abstract of the target text, wherein the third large language model is trained by adopting the training method of any large language model, and the third large language model is used for extracting the abstract of the text.

In this embodiment, since the large language model (including the third large language model) has generalization capability, a more accurate abstract of the target text can be obtained by using the third large language model trained in advance regardless of how the extracted characters are obtained.

Fig. 6 is a schematic structural diagram of a device for generating sample data according to an embodiment of the present application. The method specifically comprises the following steps:

a first obtaining unit 501, configured to obtain a target text, where the number of characters included in the target text is greater than or equal to a first number;

a first extracting unit 502, configured to extract characters from the target text, and obtain extracted characters corresponding to the target text;

A first input unit 503, configured to input the extracted characters into a first large language model trained in advance, to obtain a first abstract of the target text, where the first large language model is used to extract the abstract of the text;

a determining unit 504, configured to determine whether the first summary meets a preset summary condition;

a first generating unit 505, configured to generate sample data based on the target text and the first abstract if the first abstract meets the preset abstract condition, where the sample data is used to fine tune a third large language model, and the third large language model is used to extract an abstract of the text.

the following abstract extraction steps are performed:

A first execution unit (not shown in the figure) for executing the digest extracting step in case that the first digest does not meet the preset digest condition.

the following character extraction steps are performed:

and a second execution unit (not shown in the figure) for executing the character extraction step in case that the first digest does not meet the preset digest condition.

a dividing unit (not shown in the figure) for dividing the target text into a set of text segments;

a second input unit (not shown in the figure) for inputting the text segment to a pre-trained second large language model for the text segment in the text segment set, so as to obtain a abstract of the text segment, wherein the second large language model is used for extracting the abstract of the text segment;

A merging unit (not shown in the figure) for merging the abstracts of the obtained text segments to obtain a second abstract;

a second generating unit (not shown in the figure) for generating sample data based on the target text and the second abstract.

The sample data generating device provided in this embodiment may be a sample data generating device as shown in fig. 6, and may perform all the steps of the above-described method for generating sample data, so as to achieve the technical effects of the above-described method for generating sample data, and specific reference should be made to the above-described related description, which is omitted herein for brevity.

Fig. 7 is a schematic structural diagram of a training device for a large language model according to an embodiment of the present application. The method specifically comprises the following steps:

a second obtaining unit 601, configured to obtain a training sample set, where a training sample in the training sample set includes sample data generated by using any one of the sample data generating methods described above;

and the training unit 602 is configured to perform instruction fine tuning training on the pre-trained third large language model by using the training sample set, so as to obtain a trained third large language model.

In one possible embodiment of the present application,

The training device for a large language model provided in this embodiment may be a training device for a large language model as shown in fig. 7, and may perform all the steps of the training method for each large language model, thereby achieving the technical effects of the training method for each large language model described above, and the detailed description is omitted herein for brevity.

Fig. 8 is a schematic structural diagram of a summary extracting apparatus according to an embodiment of the present application. The method specifically comprises the following steps:

a third obtaining unit 701, configured to obtain a target text, where the number of characters included in the target text is greater than or equal to the first number;

a second extracting unit 702, configured to extract characters from the target text, so as to obtain extracted characters corresponding to the target text;

a third input unit 703, configured to input the extracted character into a pre-trained third large language model to extract the abstract of the target text, where the third large language model is trained by using a training method such as any one of the large language models described above, and the third large language model is used to extract the abstract of the text.

The summary extracting device provided in this embodiment may be a summary extracting device as shown in fig. 8, and may perform all the steps of each summary extracting method described above, so as to achieve the technical effects of each summary extracting method described above, and specific reference should be made to the above related description, which is omitted herein for brevity.

Fig. 9 is a schematic structural diagram of an electronic device according to an embodiment of the present application, and the electronic device 800 shown in fig. 9 includes: at least one processor 801, memory 802, at least one network interface 804, and other user interfaces 803. The various components in the electronic device 800 are coupled together by a bus system 805. It is appreciated that the bus system 805 is used to enable connected communications between these components. The bus system 805 includes a power bus, a control bus, and a status signal bus in addition to the data bus. But for clarity of illustration, the various buses are labeled as bus system 805 in fig. 5.

The user interface 803 may include, among other things, a display, a keyboard, or a pointing device (e.g., a mouse, a trackball, a touch pad, or a touch screen, etc.).

It will be appreciated that the memory 802 in embodiments of the application can be either volatile memory or nonvolatile memory, or can include both volatile and nonvolatile memory. The nonvolatile Memory may be a Read-Only Memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an Electrically Erasable EPROM (EEPROM), or a flash Memory. The volatile memory may be random access memory (Random Access Memory, RAM) which acts as an external cache. By way of example, and not limitation, many forms of RAM are available, such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (Double Data Rate SDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), and Direct memory bus RAM (DRRAM). The memory 802 described herein is intended to comprise, without being limited to, these and any other suitable types of memory.

In some implementations, the memory 802 stores the following elements, executable units or data structures, or a subset thereof, or an extended set thereof: an operating system 8021 and application programs 8022.

The operating system 8021 includes various system programs, such as a framework layer, a core library layer, a driver layer, and the like, for implementing various basic services and processing hardware-based tasks. The application 8022 includes various application programs such as a Media Player (Media Player), a Browser (Browser), and the like for realizing various application services. The program for implementing the method of the embodiment of the present application may be contained in the application program 8022.

In this embodiment, by calling a program or an instruction stored in the memory 802, specifically, a program or an instruction stored in the application 8022, the processor 801 is configured to perform the method steps provided by the method embodiments, for example, including:

Determining whether the first abstract meets preset abstract conditions or not;

Or,

Still alternatively, or in addition to the above,

The method disclosed in the above embodiment of the present application may be applied to the processor 801 or implemented by the processor 801. The processor 801 may be an integrated circuit chip with signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuitry in hardware in the processor 801 or by instructions in software. The processor 801 described above may be a general purpose processor, digital signal processor (Digital Signal Processor, DSP), application specific integrated circuit (Application Specific Integrated Circuit, ASIC), off-the-shelf programmable gate array (Field Programmable Gate Array, FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components. The disclosed methods, steps, and logic blocks in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be embodied directly in the execution of a hardware decoding processor, or in the execution of a combination of hardware and software elements in a decoding processor. The software elements may be located in a random access memory, flash memory, read-only memory, programmable read-only memory or electrically erasable programmable memory, registers, etc. as well known in the art. The storage medium is located in a memory 802, and the processor 801 reads information in the memory 802 and, in combination with its hardware, performs the steps of the above method.

It is to be understood that the embodiments described herein may be implemented in hardware, software, firmware, middleware, microcode, or a combination thereof. For a hardware implementation, the processing units may be implemented within one or more application specific integrated circuits (Application Specific Integrated Circuits, ASIC), digital signal processors (Digital Signal Processing, DSP), digital signal processing devices (dspev, DSPD), programmable logic devices (Programmable Logic Device, PLD), field programmable gate arrays (Field-Programmable Gate Array, FPGA), general purpose processors, controllers, microcontrollers, microprocessors, other electronic units configured to perform the above-described functions of the application, or a combination thereof.

For a software implementation, the techniques described herein may be implemented by means of units that perform the functions described herein. The software codes may be stored in a memory and executed by a processor. The memory may be implemented within the processor or external to the processor.

The electronic device provided in this embodiment may be an electronic device as shown in fig. 5, and may perform all the steps of the above-described method for generating sample data, so as to achieve the technical effects of the above-described method for generating sample data, and specific reference should be made to the above-described related description, which is omitted herein for brevity.

The embodiment of the application also provides a storage medium (computer readable storage medium). The storage medium here stores one or more programs. Wherein the storage medium may comprise volatile memory, such as random access memory; the memory may also include non-volatile memory, such as read-only memory, flash memory, hard disk, or solid state disk; the memory may also comprise a combination of the above types of memories.

When one or more programs in the storage medium may be executed by one or more processors, to implement the above method for generating sample data executed on the electronic device side, or the training method of the large language model, or the abstract extraction method.

The processor is configured to execute a generation program of sample data stored in the memory, or a training program of a large language model, or further, a digest extraction program, so as to implement the following method for generating sample data executed on the electronic device side, or a training method of a large language model, or further, a step of the digest extraction method:

determining whether the first abstract meets preset abstract conditions or not;

Or,

Still alternatively, or in addition to the above,

Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative elements and steps are described above generally in terms of function in order to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied in hardware, in a software module executed by a processor, or in a combination of the two. The software modules may be disposed in Random Access Memory (RAM), memory, read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.

It is to be understood that the terminology used herein is for the purpose of describing particular example embodiments only, and is not intended to be limiting. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. The terms "comprises," "comprising," "includes," "including," and "having" are inclusive and therefore specify the presence of stated features, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, steps, operations, elements, components, and/or groups thereof. The method steps, processes, and operations described herein are not to be construed as necessarily requiring their performance in the particular order described or illustrated, unless an order of performance is explicitly stated. It should also be appreciated that additional or alternative steps may be used.

The foregoing is only a specific embodiment of the invention to enable those skilled in the art to understand or practice the invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method of generating sample data, the method comprising:

determining whether the first abstract meets preset abstract conditions or not;

2. The method of claim 1, wherein said inputting the extracted character into a pre-trained first large language model results in a first abstract of the target text, comprising:

the following abstract extraction steps are performed:

3. The method of claim 1, wherein the extracting the characters from the target text to obtain the extracted characters corresponding to the target text comprises:

the following character extraction steps are performed:

4. A method according to one of claims 1-3, characterized in that after the acquisition of the target text, the method further comprises:

dividing the target text into a text segment set;

sample data is generated based on the target text and the second summary.

5. A method for training a large language model, the method comprising:

obtaining a set of training samples, wherein training samples in the set of training samples comprise sample data generated using the method of one of claims 1-4;

6. The method of claim 5, wherein the step of determining the position of the probe is performed,

7. A method for extracting a digest, the method comprising:

Inputting the extracted characters into a pre-trained third large language model to extract the abstract of the target text, wherein the third large language model is trained by the method as claimed in claim 5 or 6, and the third large language model is used for extracting the abstract of the text.

8. A sample data generating apparatus, the apparatus comprising:

9. A training apparatus for a large language model, the apparatus comprising:

a second obtaining unit, configured to obtain a set of training samples, where training samples in the set of training samples include sample data generated by a method according to one of claims 1-4;

10. A digest extraction apparatus, the apparatus comprising:

a third input unit, configured to input the extracted character into a pre-trained third large language model to extract the abstract of the target text, where the third large language model is trained by using the method according to claim 5 or 6, and the third large language model is used to extract the abstract of the text.

11. An electronic device, comprising:

a memory for storing a computer program;

a processor for executing a computer program stored in said memory, and which, when executed, implements the method of any of the preceding claims 1-7.

12. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the method of any of the preceding claims 1-7.