CN117787257A

CN117787257A - Training method, text reasoning method and system for large language model of OTA scene

Info

Publication number: CN117787257A
Application number: CN202311814467.5A
Authority: CN
Inventors: 屈垠岑; 江小林; 罗超
Original assignee: Ctrip Travel Network Technology Shanghai Co Ltd
Current assignee: Ctrip Travel Network Technology Shanghai Co Ltd
Priority date: 2023-12-26
Filing date: 2023-12-26
Publication date: 2024-03-29

Abstract

The invention discloses a training method, a text reasoning method and a system of a large language model of an OTA scene, wherein the training method comprises the following steps: acquiring a pre-training sample set and a multi-task instruction data set; screening and de-duplication processing are carried out on the field data and the general data to obtain a first sample set, and the initial large model is pre-trained according to the first sample set to generate a vertical field large model; setting instruction data and task input text data are used as input, and task output text data is used as output fine tuning training of the vertical field large model so as to obtain a fine tuning large language model. According to the method, the vertical domain large model is generated based on the preprocessed OTA domain professional data training, accuracy and generalized understanding of domain professional vocabulary text reasoning are enhanced, the vertical domain large model is trained based on the multitask instruction data set to generate the fine-tuning large language model, multiple text tasks are processed simultaneously, the problem of high deployment cost caused by deployment of multiple small models is solved, and prediction accuracy is improved.

Description

Training method, text reasoning method and system for large language model of OTA scene

Technical Field

The invention relates to the technical field of machine learning, in particular to a training method, a text reasoning method and a system for a large language model of an OTA scene.

Background

In the OTA (Online Travel Agency), online travel scenarios employ multiple small models to handle various relevant types of online tasks. Aiming at each type of on-line task, an independent small model is required to be deployed for service, and when a plurality of on-line tasks of different types are simultaneously performed, a plurality of different small models are required to be deployed, so that the deployment cost is high; the small model is not trained in the OTA field usually, the expertise in the OTA field is not known, but related tasks in the OTA field often have stronger business logic and need strong expertise, the small model has poor natural language understanding and reasoning capability, and classification and scoring tasks in the OTA field need strong reasoning capability, so that the accuracy of a text prediction effect is low. For example, a customer service dialogue shows that the customer indicates "have arrived but not live" that the small language model should reason "no store" but that the small model only has the benefit of identifying keywords and is weak in generalization and reasoning.

Disclosure of Invention

The invention aims to overcome the defects of high text prediction deployment cost and low prediction accuracy in the OTA field by utilizing a small model in the prior art, and provides a training method, a text reasoning method and a system for a large language model of an OTA scene.

The invention solves the technical problems by the following technical scheme:

in a first aspect, the present invention provides a training method for a large language model of an OTA scene, where the training method includes:

acquiring a pre-training sample set and a multi-task instruction data set; the pre-training sample set comprises field data and general data of the OTA field, and the multi-task instruction data set comprises setting instruction data, task input text data and task output text data corresponding to a plurality of different related tasks of the OTA field;

screening and de-duplication processing are carried out on the field data and the general data to obtain a first sample set, and pre-training is carried out on an initial large model according to the first sample set to generate a vertical field large model;

taking the setting instruction data and the task input text data as input, and taking the task output text data as output fine tuning to train the vertical field large model so as to obtain a fine tuning large language model; the fine-tuning large language model is used for simultaneously serving a plurality of text reasoning request tasks according to a plurality of setting instruction data.

Preferably, the filtering process includes a text validity process and a heuristic rule process, and the deduplication process includes at least one of an exact deduplication process, a quality deduplication process, and a fuzzy deduplication process;

the text validity process is used for representing that invalid text data with the total number of symbols in the pre-training sample set being larger than a first preset threshold value is screened out;

the heuristic rule processing is used for representing and screening out invalid text data of a first number of continuous non-Chinese character fragments set in the pre-training sample set;

the accurate duplicate removal process is used for representing that invalid text data with the repetition length being greater than a set second number in the pre-training sample set is screened out;

the quality deduplication process is used for representing screening out invalid text data with poor quality in the pre-training sample set;

and the fuzzy de-duplication process is used for representing and screening out invalid text data with similarity greater than a second preset threshold value in the pre-training sample set.

Preferably, the field data comprises at least one of OTA training data, travel attack data, hotel and scenic spot profile data, hotel and scenic spot comment data and customer service dialogue data; the general data includes at least one of encyclopedia data material, book material, web blog material and news material; the setting instruction data comprises at least one of an extraction instruction, a classification instruction, a summary instruction and an emotion scoring instruction.

In a second aspect, the present invention provides a text reasoning method, the text reasoning method comprising:

training a fine tuning large language model by using the training method of the large language model of the OTA scene according to the first aspect;

acquiring target instruction data of a plurality of tasks and corresponding task original text;

and simultaneously inputting the target instruction data and the corresponding task original text into the fine-tuning large language model, and reasoning to obtain the corresponding task target text.

In a third aspect, the present invention provides a training system for a large language model of an OTA scene, the training system comprising:

the text acquisition module is used for acquiring a pre-training sample set and a multi-task instruction data set; the pre-training sample set comprises field data and general data of the OTA field, and the multi-task instruction data set comprises setting instruction data, task input text data and task output text data corresponding to a plurality of different related tasks of the OTA field;

the processing module is used for carrying out screening processing and deduplication processing on the field data and the general data to obtain a first sample set, and pre-training an initial large model according to the first sample set to generate a vertical field large model;

the fine tuning module is used for taking the setting instruction data and the task input text data as input, and taking the task output text data as output fine tuning to train the vertical field large model so as to obtain a fine tuning large language model; the fine-tuning large language model is used for simultaneously serving a plurality of text reasoning request tasks according to a plurality of setting instruction data.

Preferably, the field data comprises at least one of OTA training data, travel attack data, hotel and scenic spot profile data, hotel and scenic spot comment data and customer service dialogue data; the general data includes at least one of encyclopedia data material, book material, web blog material and news material; the setting instruction data includes at least one of extraction task data, classification task data, summary task data, and scoring task data.

In a fourth aspect, the present invention provides a text reasoning system, comprising:

the training system for a large language model of an OTA scene of any of the third aspects for training a fine-tuned large language model;

the text acquisition module is used for acquiring target instruction data of a plurality of tasks and corresponding task original texts;

and the input module is used for inputting the target instruction data and the corresponding task original text into the fine-tuning large language model at the same time, and reasoning to obtain the corresponding task target text.

In a fifth aspect, the present invention further provides an electronic device, including a processor, a memory, and a computer program stored on the memory and executable on the processor, the computer program when executed by the processor implementing a training method for a large language model of an OTA scene as described in any one of the above, or performing a text reasoning method as described in the above.

In a sixth aspect, the present invention further provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a method of training a large language model of an OTA scene as described in any of the above, or performs a method of text reasoning as described above.

The invention has the positive progress effects that: the training method is used for training and generating a vertical field large model based on preprocessed OTA field professional data, enhancing accuracy and generalized understanding of OTA field professional vocabulary text reasoning, generating a fine-tuning large language model based on training the vertical field large model by a multitask instruction data set, realizing simultaneous processing of multiple text tasks in the OTA field, solving the problem of high deployment cost caused by deployment of multiple small models, and improving prediction accuracy.

Drawings

Fig. 1 is a flowchart of a training method of a large language model of an OTA scenario in embodiment 1 of the present invention.

Fig. 2 is a schematic block diagram of a training system for a large language model of an OTA scenario according to embodiment 2 of the present invention.

Fig. 3 is a flowchart of a text reasoning method of embodiment 3 of the present invention.

Fig. 4 is a schematic block diagram of a text reasoning system in embodiment 4 of the present invention.

Fig. 5 is a schematic hardware structure of an electronic device according to embodiment 5 of the present invention.

Detailed Description

The invention is further illustrated by means of the following examples, which are not intended to limit the scope of the invention.

Example 1

The training method of the large language model of the OTA scene of the embodiment, as shown in fig. 1, includes:

s11, acquiring a pre-training sample set and a multi-task instruction data set; the pre-training sample set comprises field data and general data in the OTA field, and the multi-task instruction data set comprises setting instruction data, task input text data and task output text data corresponding to a plurality of different related tasks in the OTA field;

s12, screening and de-duplication processing are carried out on the field data and the general data to obtain a first sample set, and the initial large model is pre-trained according to the first sample set to generate a vertical field large model;

s13, taking the set instruction data and the task input text data as input, and taking the task output text data as output fine tuning training vertical field large model to obtain a fine tuning large language model; the fine-tuning large language model is used for simultaneously serving a plurality of text-reasoning request tasks according to a plurality of setting instruction data.

For the step S11, the number of corpus samples contained in the pre-training sample set may be set based on actual requirements, for example, the field data may include massive customer service dialogue data and a small amount of OTA training data, massive travel attack data and a small amount of hotel and scenic spot comment data, and after the initial large model is trained by the pre-training sample set, the large model may have a stronger understanding ability on the field expertise in the OTA field. The corpus sample settings of the general data and the domain data are similar, and are not described in detail herein. The multi-tasking instruction data set may contain a plurality of different but associated tasks corresponding setting instruction data, task input text data and task output text data.

For the above step S12, in this example, the Baichuan model is used as the initial large model, the first sample set formed by the field data and the general data that are subjected to the filtering process and the deduplication process is used as the input data of the Baichuan model, and each word of the text is used as the tag of the Baichuan model, so as to generate the vertical field large model. The vertical field large model obtained through training has the capability of identifying the field expertise and the general encyclopedia knowledge in the OTA field, and the understanding force of the model on the corpus information in the OTA field is improved.

Aiming at the step S13, instruction fine tuning training is performed on the vertical domain large model based on a plurality of different parallel setting instruction data and corresponding task input text data in the multi-task instruction data set, so that the fine tuning large language model can understand and feed back different setting instruction data input by a user and learn the knowledge corresponding relation of the OTA domain among different instructions, thereby realizing simultaneous processing of a plurality of text tasks and improving model prediction accuracy.

In one embodiment, the screening process includes a text validity process and a heuristic rule process, and the deduplication process includes at least one of an exact deduplication process, a quality deduplication process, and a fuzzy deduplication process;

heuristic rule processing is used for representing that invalid text data of a first number of continuous non-Chinese character fragments set in a pre-training sample set are screened out;

the fuzzy de-duplication process is used for representing that invalid text data with similarity larger than a second preset threshold value in the pre-training sample set is screened out.

Specifically, the process of preprocessing the pre-training sample set sequentially comprises text validity processing, heuristic rule processing, accurate deduplication processing, quality filtering processing and fuzzy deduplication processing.

In the text validity processing, the pre-training sample set filters texts with the number of rounds being higher than 3 and the dialogue length being greater than 50 words, further, the total number of characters of the symbol is calculated to be the ratio of all text data, and invalid text data with the total number of characters of the symbol being greater than 0.3 is filtered. In the heuristic process, invalid text data containing more than ten consecutive non-Chinese character segments is removed, and invalid text data that does not end with a termination punctuation mark (i.e., period, exclamation mark, question mark, or end quotation mark) is removed.

In the exact deduplication process, invalid text data having a repetition length of greater than 100 words is removed using a suffix array algorithm. In the quality deduplication process, invalid text data of low quality that is not smooth is filtered. In the blur deduplication process, invalid text data having a similarity of more than 0.8 is removed using an LSH algorithm (approximate search algorithm).

Illustratively, the specific steps of the exact deduplication process using a suffix array algorithm to remove invalid text data having a repetition length greater than 100 words may include, but are not limited to: constructing a suffix array, traversing the suffix array, comparing the suffixes, and removing the repeated text.

The suffix array is constructed by sequencing the suffixes of all text data according to dictionary sequence, and constructing a suffix array. The suffix array is an array in which the starting position of each suffix is stored, and the suffix of the character string can be quickly searched and compared by sequencing the suffix array. Traversing the suffix array specifically starts from the first suffix of the suffix array, and traverses the suffixes in the suffix array one by one. The comparison suffix is specifically that for each suffix, the suffix is compared with the suffix behind the suffix, and whether repeated text data exists or not is judged. Whether the prefixes are identical can be determined by comparing the prefixes of the suffixes. The repeated text is removed, specifically, if repeated text data is found and the length is greater than 100 words, by recording the starting position and the length of the repeated text.

Illustratively, the specific steps of the blur deduplication process using the LSH algorithm to remove invalid text data having a similarity greater than 0.8 may include, but are not limited to: segmentation and feature extraction, minhash signature construction, LSH bucket construction, data mapping, similarity calculation and similar data removal.

The word segmentation and feature extraction specifically comprises the steps of word segmentation of text data and extraction of key features. The Minhash signature construction is specifically that for each feature, a Minhash signature is generated by using a plurality of hash functions, minhash is a method for approximately calculating similarity of a set, elements in the set are randomly arranged, and the first element after arrangement is selected as Minhash value. The construction of the LSH barrels is specifically to construct a corresponding number of LSH barrels according to the number of Minhash signatures. Each bucket is used to store similar data. The data mapping is specifically to calculate a Minhash signature of each text data, and distribute the data to the corresponding LSH bucket according to the signature. The similarity calculation is specifically to calculate, for data in each bucket, the similarity thereof with other data. A common similarity calculation method such as cosine similarity may be used. The similar data is removed, specifically, for data with similarity greater than 0.8.

In one embodiment, the domain data includes at least one of OTA training data, travel aggression data, hotel and attraction profile data, hotel and attraction critique data, and customer service dialogue data.

In one embodiment, the general data includes at least one of encyclopedia data material, book material, web blog material, and news material.

The understanding of the vertical field large model to natural language is enhanced through the field data and the general data, and the generalization capability and the adaptability of the vertical field large model are improved.

In one embodiment, the setting instruction data includes at least one of a draw instruction, a sort instruction, a summary instruction, and an emotion scoring instruction.

Specifically, the classifying task is used for representing classifying processing of task input text data, the extracting instruction is used for representing classifying processing of task input text data to extract keywords, the summarizing instruction is used for representing core content summarization of task input text data, and the emotion scoring instruction is used for representing scoring of the emotion state of a user of the task input text data.

In this embodiment, a training method for a large language model of an OTA scene is provided, a large vertical field model is generated based on preprocessed OTA field professional data training, accuracy and generalized understanding of text reasoning of an OTA field professional vocabulary are enhanced, a fine-tuning large language model is generated based on a multitask instruction data set training of the large vertical field model, simultaneous processing of multiple text tasks in the OTA field is achieved, the problem of high deployment cost caused by deployment of multiple small models is solved, and prediction accuracy is improved.

Example 2

The training system of the large language model of the OTA scene of the present embodiment, as shown in fig. 2, includes: an acquisition module 210, a processing module 220, and a fine tuning module 230.

Wherein, the obtaining module 210 is configured to obtain a pre-training sample set and a multitasking instruction data set; the pre-training sample set comprises field data and general data in the OTA field, and the multi-task instruction data set comprises setting instruction data, task input text data and task output text data corresponding to a plurality of different related tasks in the OTA field;

the processing module 220 is configured to perform screening processing and deduplication processing on the domain data and the general data to obtain a first sample set, and pretrain the initial large model according to the first sample set to generate a vertical domain large model;

the fine tuning module 230 is configured to take the setting instruction data and the task input text data as inputs, and the task output text data as output fine tuning training the vertical domain large model to obtain a fine tuning large language model; the fine-tuning large language model is used for simultaneously serving a plurality of text-reasoning request tasks according to a plurality of setting instruction data.

The corpus samples contained in the pre-training sample set obtained by the obtaining module 210 can be set based on actual requirements, for example, the field data can include massive customer service dialogue data and a small amount of OTA training data, massive travel attack data and a small amount of hotel and scenic spot comment data, and the large model can have stronger understanding ability on the field professional knowledge in the OTA field after the initial large model is trained by the pre-training sample set. The corpus sample settings of the general data and the domain data are similar, and are not described in detail herein. The multi-tasking instruction data set may contain a plurality of different but associated tasks corresponding setting instruction data, task input text data and task output text data.

In this example, the Baichuan model is used as an initial large model, the processing module 220 uses a first sample set formed by the field data and the general data which are subjected to the filtering process and the deduplication process as input data of the Baichuan model, and uses each word of the text as a tag of the Baichuan model to generate a vertical field large model. The vertical field large model obtained through training has the capability of identifying the field expertise and the general encyclopedia knowledge in the OTA field, and the understanding force of the model on the corpus information in the OTA field is improved.

In this embodiment, a training system for a large language model of an OTA scene is provided, a processing module trains and generates a large vertical field model based on preprocessed professional data of the OTA field, accuracy and generalized understanding of professional vocabulary text reasoning in the OTA field are enhanced, and a fine tuning module trains and generates a fine tuning large language model based on a multitask instruction data set, so that multiple text tasks in the OTA field are processed simultaneously, the problem of high deployment cost caused by deployment of multiple small models is solved, and prediction accuracy is improved.

Example 3

As shown in fig. 3, the text reasoning method of this embodiment includes:

s31, training a fine tuning large language model by using the training method of the large language model of the OTA scene as in the embodiment 1;

s32, acquiring target instruction data of a plurality of tasks and corresponding task original texts;

s33, inputting the target instruction data and the corresponding task original text into a fine-tuning large language model at the same time, and reasoning to obtain the corresponding task target text.

Specifically, the target instruction data includes at least one of an extraction instruction, a classification instruction, a summary instruction and an emotion scoring instruction in an OTA scene, a plurality of tasks have field correlation, and a plurality of tasks have certain difference and similarity. The target instruction data of a plurality of tasks and corresponding task original texts are input to the fine-tuning large language model obtained through two-stage domain pre-training and multi-task fine tuning, so that the fine-tuning large language model can be simultaneously suitable for a plurality of different tasks with strong correlation in the OTA domain.

In this embodiment, a text reasoning system is provided, and a corresponding task target text is obtained by inputting target instruction data and a corresponding task original text into a fine-tuning large language model and utilizing the fine-tuning large language model to perform reasoning. The method and the device realize the simultaneous adaptation to the conditions of a plurality of different tasks in the OTA field by using the same large model, improve the accuracy of text reasoning in the OTA field and reduce the deployment cost of the model.

Example 4

The text reasoning system of the present embodiment, as shown in fig. 4, includes:

the training system for a large language model of an OTA scene in the above embodiment 3 is configured to train a fine-tuned large language model;

the acquisition module 410 is configured to acquire target instruction data of a plurality of tasks and corresponding task original text;

and the input module 420 is used for inputting the target instruction data and the corresponding task original text into the fine-tuning large language model at the same time, and reasoning to obtain the corresponding task target text.

In this embodiment, a text reasoning system is provided, where an obtaining module inputs target instruction data and a corresponding task original text into a fine-tuning large language model, and an output module uses the fine-tuning large language model to infer and obtain a corresponding task target text. The method and the device realize the simultaneous adaptation to the conditions of a plurality of different tasks in the OTA field by using the same large model, improve the accuracy of text reasoning in the OTA field and reduce the deployment cost of the model.

Example 5

Fig. 5 is a schematic structural diagram of an electronic device according to the present embodiment. The electronic device includes a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the steps in the training method of the large language model of the OTA scenario of embodiment 1 or the text reasoning method of embodiment 3 when the processor executes the program. The electronic device 90 shown in fig. 5 is merely an example and should not be construed as limiting the functionality and scope of use of embodiments of the present invention.

As shown in fig. 5, the electronic device 90 may be embodied in the form of a general purpose computing device, which may be a server device, for example. Components of the electronic device 90 may include, but are not limited to: the at least one processor 91, the at least one memory 92, a bus 93 connecting the different system components, including the memory 92 and the processor 91.

The bus 93 includes a data bus, an address bus, and a control bus.

The memory 92 may include volatile memory such as Random Access Memory (RAM) 921 and/or cache memory 922, and may further include Read Only Memory (ROM) 923.

Memory 92 may also include a program/utility 925 having a set (at least one) of program modules 924, such program modules 924 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each or some combination of which may include an implementation of a network environment.

The processor 91 executes various functional applications and data processing, such as the training method of the large language model of the OTA scene of embodiment 1 of the present invention or the text reasoning method of embodiment 3, by running a computer program stored in the memory 92.

The electronic device 90 may also communicate with one or more external devices 94 (e.g., keyboard, pointing device, etc.). Such communication may occur through an input/output (I/O) interface 95. Also, model-generating device 90 may also communicate with one or more networks such as a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network such as the internet via network adapter 96. As shown in fig. 5, the network adapter 96 communicates with other modules of the model-generating device 90 via a bus 93. It should be appreciated that although not shown in the figures, other hardware and/or software modules may be used in connection with the model-generating device 90, including, but not limited to: microcode, device drivers, redundant processors, external disk drive arrays, RAID (disk array) systems, tape drives, data backup storage systems, and the like.

It should be noted that although several units/modules or sub-units/modules of an electronic device are mentioned in the above detailed description, such a division is merely exemplary and not mandatory. Indeed, the features and functionality of two or more units/modules described above may be embodied in one unit/module in accordance with embodiments of the present invention. Conversely, the features and functions of one unit/module described above may be further divided into ones that are embodied by a plurality of units/modules.

Example 4

The present embodiment provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the training method of the large language model of the OTA scene of embodiment 1 or the text reasoning method of embodiment 3.

More specifically, among others, readable storage media may be employed including, but not limited to: portable disk, hard disk, random access memory, read only memory, erasable programmable read only memory, optical storage device, magnetic storage device, or any suitable combination of the foregoing.

In a possible embodiment, the invention may also be implemented in the form of a program product comprising program code for causing a terminal device to carry out the steps of the training method implementing the large language model of the OTA scenario of embodiment 1 or the text reasoning method of embodiment 3 when the program product is run on the terminal device.

Wherein the program code for carrying out the invention may be written in any combination of one or more programming languages, the program code may execute entirely on the user device, partly on the user device, as a stand-alone software package, partly on the user device, partly on a remote device or entirely on the remote device.

While specific embodiments of the invention have been described above, it will be appreciated by those skilled in the art that this is by way of example only, and the scope of the invention is defined by the appended claims. Various changes and modifications to these embodiments may be made by those skilled in the art without departing from the principles and spirit of the invention, but such changes and modifications fall within the scope of the invention.

Claims

1. A training method for a large language model of an OTA scene, the training method comprising:

2. The method of training a large language model of an OTA scene of claim 1 wherein the screening process includes a text validity process and a heuristic process, the deduplication process including at least one of an exact deduplication process, a quality deduplication process, and a fuzzy deduplication process;

3. The method of training a large language model of an OTA scene of claim 1 wherein the domain data includes at least one of OTA training data, travel attack data, hotel and attraction profile data, hotel and attraction commentary data, and customer service dialogue data; the general data includes at least one of encyclopedia data material, book material, web blog material and news material; the setting instruction data comprises at least one of an extraction instruction, a classification instruction, a summary instruction and an emotion scoring instruction.

4. A text reasoning method, characterized in that the text reasoning method comprises:

training a fine-tune large language model using the training method of the large language model of an OTA scene according to any one of claims 1-3;

5. A training system for a large language model of an OTA scene, the training system comprising:

the acquisition module is used for acquiring a pre-training sample set and a multi-task instruction data set; the pre-training sample set comprises field data and general data of the OTA field, and the multi-task instruction data set comprises setting instruction data, task input text data and task output text data corresponding to a plurality of different related tasks of the OTA field;

6. The training system of a large language model of an OTA scene of claim 5 wherein the screening process comprises a text validity process and a heuristic process, the deduplication process comprising at least one of a exact deduplication process, a quality deduplication process, and a fuzzy deduplication process;

7. The training system of the large language model of an OTA scene of claim 5 wherein said domain data comprises at least one of OTA training data, travel attack data, hotel and attraction profile data, hotel and attraction commentary data, and customer service dialogue data; the general data includes at least one of encyclopedia data material, book material, web blog material and news material; the setting instruction data includes at least one of extraction task data, classification task data, summary task data, and scoring task data.

8. A text reasoning system, the text reasoning system comprising:

the training system of a large language model of an OTA scene of any one of claims 5-7 for training a fine-tuned large language model;

9. An electronic device comprising a processor, a memory, and a computer program stored on the memory and executable on the processor, which when executed by the processor implements a method of training a large language model of an OTA scene as claimed in any one of claims 1-3 or performs a method of text reasoning as claimed in claim 4.

10. A computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a method of training a large language model of an OTA scene according to any one of claims 1-3 or performs a method of text reasoning according to claim 4.