CN117272052B

CN117272052B - Large language model training method, device, equipment and storage medium

Info

Publication number: CN117272052B
Application number: CN202311557959.0A
Authority: CN
Inventors: 张程剀; 刘泽恩; 刘晓华; 陈小梅
Original assignee: Beijing Yiyong Technology Co ltd
Current assignee: Beijing Yiyong Technology Co ltd
Priority date: 2023-11-22
Filing date: 2023-11-22
Publication date: 2024-02-09
Anticipated expiration: 2043-11-22
Also published as: CN117272052A

Abstract

The present disclosure provides a large language model training method, apparatus, device, and storage medium. The large language model training method based on the loss of similarity comprises the following steps: a data preprocessing step, namely constructing a high-precision data set and a low-precision data set; a data classification step of classifying the low-precision data set into n sub-data sets and classifying the high-precision data set into a development data set and a test data set; a data analysis step of screening one or more sub-data sets based on the loss similarity between the development data set and the n sub-data sets; a model updating step, including replacing the basic large language model with the well-represented candidate large language model to serve as an updated basic large language model; and training, namely repeating the data analysis and model updating, and outputting the updated basic large language model as a final large language model. The method can optimize a large language model and simultaneously lighten the requirement for a large amount of manually marked accurate curative effect evaluation data.

Description

Large language model training method, device, equipment and storage medium

Technical Field

The present disclosure relates generally to the field of data processing, and in particular, to a large language model training method based on loss of similarity.

Background

The evaluation of the curative effect after tumor treatment is the basis for changing the treatment scheme and is also an objective index for comparing the effects of various treatment schemes. The efficacy evaluation criteria of solid tumors mainly include WHO criteria and efficacy evaluation criteria of solid tumors (RECIST). At present, RECIST has replaced the WHO standard adopted originally, and becomes a general curative effect evaluation standard in the international tumor world.

At present, intelligent medical treatment still has pain points such as limited high-quality medical resources, and how to alleviate the working pressure of doctors (especially the great workload of tumor specialists) through artificial intelligence, so that the medical resources are distributed more reasonably and efficiently, and the problem of insufficient medical resources is also an important research direction.

A large language model is an artificial intelligence model that aims to understand and generate human language. They train on a large amount of text data and can perform a wide range of tasks including text summarization, translation, emotion analysis, and so forth. The emerging capabilities of the current large language model, including context, instruction compliance, and progressive reasoning capabilities, are ideal tools for tumor efficacy assessment.

The problems of limited medical resources, heavy work, large pressure, multidimensional nature of tumor curative effect evaluation, large number of medical texts/data, large data dimension, large workload of the tumor specialist and the like are caused by the fact that the tumor specialist is limited, so that the defect of lack/improper decision on the tumor curative effect evaluation is easily caused.

The context understanding capability and reasoning capability of the large language model on the medical text can assist in realizing automatic tumor curative effect evaluation, and can reduce the workload and pressure of a tumor specialist while enhancing the accuracy of tumor curative effect evaluation. However, since the underlying large language model is not optimized for tumor efficacy assessment, direct use of the large language model does not accurately provide reliable tumor efficacy assessment. Meanwhile, specific optimization for a large language model requires high quality and a large amount of efficacy evaluation data. Preparing large amounts of data can increase the burden and pressure on oncologists. How to ensure the optimization effect of a large language model and simultaneously alleviate the requirement of a large amount of manually marked accurate curative effect evaluation data becomes an important research direction.

Disclosure of Invention

The tumor curative effect evaluation has a relatively perfect evaluation standard at present. However, due to the multidimensional nature of tumor efficacy evaluation, and the problems of large number of medical texts/data, large data dimension, large workload of tumor specialists and the like, the defect of lack/improper decision of tumor efficacy evaluation is easily caused. The large language model has strong summarizing, logical understanding and reasoning capabilities, and is an ideal tool for realizing automatic auxiliary judgment of curative effect evaluation and reducing the workload of tumor specialists.

In order to better customize a large language model serving tumor efficacy evaluation, the present disclosure provides a large language model training method based on loss similarity, wherein one or more sub-data sets are screened out through calculating the loss similarity between a development data set and a low-precision data set, the low-precision data set and the screened one or more sub-data sets are used for adjusting a basic large language model to obtain a candidate large language model, then the basic large language model is replaced by the candidate large language model with good performance, the above procedure is repeated, and finally the large language model with the best effect in the tumor efficacy evaluation scene is obtained.

According to an aspect of the present disclosure, there is provided a large language model training method based on loss of similarity, the method including: a data preprocessing step of partially sampling the full-sample medical data to obtain partially sampled medical data, and labeling the partially sampled medical data to construct a high-precision data set, the full-sample medical data other than the partially sampled medical data being constructed as a low-precision data set, wherein labeling includes labeling a efficacy evaluation result; a data classification step of classifying the low-precision data set into n sub-data sets and classifying the high-precision data set into a development data set and a test data set; a data analysis step of analyzing the n sub-data sets to screen one or more sub-data sets from the n sub-data sets based on a loss similarity between the development data set and the n sub-data sets; a model updating step, comprising: adjusting the basic large language model using the low-precision dataset to obtain a first candidate large language model, and adjusting the basic large language model using the one or more sub-datasets after screening to obtain a second candidate large language model; testing the performance of the basic large language model, the first candidate large language model and the second candidate large language model based on the test dataset; replacing the basic large language model with the well-represented candidate large language model to serve as an updated basic large language model; and a training step of repeating the data analysis step and the model updating step until the updated basic large language model performs better than both the first candidate large language model and the second candidate large language model, and outputting the updated basic large language model as a final large language model.

In accordance with an embodiment of the present disclosure, in the data analysis step, screening the one or more sub-data sets includes: calculating a first loss value on each of the n sub-data sets and a second loss value on the development data set; a loss similarity is calculated from the first loss value and the second loss value on each of the sub-data sets, and one or more of the n sub-data sets having a similar loss reduction direction to the development data set is screened based on the loss similarity.

According to an embodiment of the present disclosure, in the model updating step, testing the behavior of the basic large language model, the first candidate large language model, and the second candidate large language model based on the test data set includes: comparing whether the curative effect evaluation results output by the basic large language model, the first candidate large language model and the second candidate large language model are consistent with the curative effect evaluation results marked in the test data set or not according to the medical samples corresponding to each tumor type in the test data set; scoring the models with consistent efficacy evaluation results, and not scoring the models with inconsistent efficacy evaluation results; and calculating a total score for the base large language model, the first candidate large language model, and the second candidate large language model for all tumor types in the test dataset, wherein the total score is used to indicate the performance of the model.

According to an embodiment of the present disclosure, in the data analysis step, the first loss value after z-round iterations for each sub-data set is calculated by formula (1):

，（1）

wherein,，represented at input x _n And y _n The loss value of the time model is calculated,is a parameter of the basic model, which is a parameter of the basic model,is the learning rate, x _n Is the nth sub-data set, y _n Is the result of evaluating the efficacy of the treatment corresponding to the nth sub-data set.

According to the embodiment of the present disclosure, the learning rate has a value ranging from 0.0001 to 0.1.

In accordance with an embodiment of the present disclosure, in the data analysis step, a second loss value for the developed dataset in each round of iteration is calculated by equation (2):

，（2）

wherein,is a development dataset in a high-precision dataset, < >>Is the result of evaluating the efficacy of the corresponding development dataset.

According to an embodiment of the present disclosure, in the data analysis step, a loss similarity between the first loss value and the second loss value is calculated from a cosine function。

According to an embodiment of the present disclosure, screening one or more sub-data sets includes: when the loss similarity is greater than 0, the loss falling direction of the nth sub data set is similar to the loss falling direction of the development data set, and the nth sub data set is reserved.

According to an embodiment of the present disclosure, screening one or more sub-data sets includes: when the loss similarity is less than or equal to 0, indicating that the loss falling direction of the nth sub data set is not similar to the loss falling direction of the development data set, discarding the nth sub data set.

According to an embodiment of the present disclosure, in the training step, the retained sub-data sets are used instead of the n sub-data sets when the data analysis step is repeated.

According to an embodiment of the present disclosure, in the data preprocessing step, constructing the low-precision data set includes: acquiring evaluation baseline information in the whole-sample medical data, wherein the evaluation baseline information comprises tumor types, medical examinations aiming at the tumor types, treatment schemes aiming at the tumor types and curative effect evaluation; determining a curative effect evaluation result according to the curative effect evaluation based on a curative effect evaluation standard RECIST of the solid tumor; and combining the tumor type, medical examination, treatment regimen, and efficacy assessment results to construct a low-precision dataset.

According to an embodiment of the present disclosure, in the data preprocessing step, constructing the high-precision data set includes: the full sample medical data is partially sampled so that a medical sample exists for each tumor type and corresponding evaluation criteria types including four types of efficacy evaluation results in RECIST: complete alleviation of CR, partial alleviation of PR, stable disease SD and disease progression PD; labeling the partially sampled medical data according to RECIST, wherein the labeling comprises labeling one of CR, PR, SD and PD as a curative effect evaluation result; and combining the tumor type, the medical examination, the treatment plan, and the noted efficacy assessment results to construct a high-precision dataset.

According to another aspect of the present disclosure, there is provided a large language model training apparatus based on loss of similarity, the apparatus comprising: a data preprocessing module configured to partially sample the full-sample medical data to obtain partially sampled medical data, and to annotate the partially sampled medical data to construct a high-precision dataset, the full-sample medical data other than the partially sampled medical data being constructed as a low-precision dataset, wherein annotating includes annotating the efficacy assessment results; a data classification module configured to classify the low-precision data set into n sub-data sets and to classify the high-precision data set into a development data set and a test data set; a data analysis module configured to analyze the n sub-data sets to screen one or more sub-data sets from the n sub-data sets based on a loss similarity between the development data set and the n sub-data sets; a model update module configured to perform the following operations: adjusting the basic large language model using the low-precision dataset to obtain a first candidate large language model, and adjusting the basic large language model using the screened one or more sub-datasets to obtain a second candidate large language model; testing the performance of the basic large language model, the first candidate large language model and the second candidate large language model based on the test dataset; replacing the basic large language model with the well-represented candidate large language model to serve as an updated basic large language model; and a training module configured to repeat operations of the data analysis module and the model update module until the updated basic large language model performs better than both the first candidate large language model and the second candidate large language model, and to output the updated basic large language model as a final large language model.

According to yet another aspect of the present disclosure, there is provided a computer device comprising a memory and a processor, the memory having stored therein computer readable instructions which, when executed by the processor, implement the above-described large language model training method based on loss of similarity.

According to another aspect of the present disclosure, there is provided a computer-readable storage medium having stored thereon computer-readable instructions that, when executed by a processor, implement the above-described large language model training method based on loss of similarity.

Thus, according to the large language model training method, device, equipment and storage medium based on the loss similarity according to the embodiment of the disclosure, a high-precision data set is obtained by sampling the desensitization information part of the patient and adding doctor audit marks, and is further classified into a development data set and a test data set; constructing a low-precision dataset using full-sample medical data other than the sampled medical data; obtaining a screened data set by calculating the loss similarity between the low-precision data set and the development data set; respectively adjusting the large language model for the low-precision data set and the screened data set to obtain two candidate large language models, and selecting a new large language model by using the test data set; repeating the above flow, and finally obtaining the large language model with the best effect under the tumor curative effect evaluation scene.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings that are required to be used in the description of the embodiments will be briefly described below. It should be apparent that the drawings in the following description are only some exemplary embodiments of the present disclosure, and that other drawings may be obtained from these drawings by those of ordinary skill in the art without undue effort.

FIG. 1 illustrates a flow chart of a large language model training method based on loss of similarity according to an embodiment of the present disclosure;

FIG. 2 illustrates a schematic block diagram of training a large language model based on loss of similarity in accordance with an embodiment of the present disclosure;

FIG. 3 illustrates a flow chart of constructing a high precision data set and a low precision data set according to an embodiment of the present disclosure;

FIG. 4 illustrates a block diagram of a large language model training apparatus based on loss of similarity according to an embodiment of the present disclosure;

FIG. 5 illustrates a flowchart of a method of obtaining efficacy assessment results based on a large language model, according to an embodiment of the present disclosure;

FIG. 6 illustrates a block diagram of an apparatus for obtaining efficacy assessment results based on a large language model, according to an embodiment of the present disclosure; and

Fig. 7 illustrates a block diagram of a computer device, according to some embodiments of the present disclosure.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present disclosure more apparent, the technical solutions of the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings of the embodiments of the present disclosure. It will be apparent that the described embodiments are some, but not all, of the embodiments of the present disclosure. All other embodiments, which can be made by one of ordinary skill in the art without the need for inventive faculty, are within the scope of the present disclosure, based on the described embodiments of the present disclosure.

Unless defined otherwise, technical or scientific terms used in this disclosure should be given the ordinary meaning as understood by one of ordinary skill in the art to which this disclosure belongs. The terms "first," "second," and the like, as used in this disclosure, do not denote any order, quantity, or importance, but rather are used to distinguish one element from another. The word "comprising" or "comprises", and the like, means that elements or items preceding the word are included in the element or item listed after the word and equivalents thereof, but does not exclude other elements or items. The terms "connected" or "connected," and the like, are not limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. "upper", "lower", "left", "right", etc. are used merely to indicate relative positional relationships, which may also be changed when the absolute position of the object to be described is changed. In order to keep the following description of the embodiments of the present disclosure clear and concise, the present disclosure omits a detailed description of some known functions and known components.

A flowchart is used in this disclosure to describe the steps of a method according to an embodiment of the present disclosure. It should be understood that the steps that follow or before do not have to be performed in exact order. Rather, the various steps may be processed in reverse order or simultaneously. Also, other operations may be added to or removed from these processes.

In the description and drawings of the present disclosure, elements are described in the singular or plural form according to an embodiment. However, the singular and plural forms are properly selected for the proposed case only for convenience of explanation and are not intended to limit the present disclosure thereto. Accordingly, the singular may include the plural and the plural may include the singular unless the context clearly indicates otherwise.

The method, apparatus, device and storage medium for training a large language model based on loss of similarity provided in the present disclosure will be described in detail with reference to the accompanying drawings.

First embodiment

Fig. 1 shows a flowchart S100 of a large language model training method based on loss of similarity according to a first embodiment of the present disclosure. FIG. 2 shows a schematic block diagram of training a large language model based on loss of similarity, according to an embodiment of the present disclosure. The steps of the large language model training method based on the loss of similarity according to the embodiment of the present disclosure will be described in detail with reference to fig. 1 and 2.

First, as shown in fig. 1, the large language model training method based on the loss of similarity includes a data preprocessing step S102, a data classifying step S104, a data analyzing step S106, a model updating step S108, and a training step S110.

In the data preprocessing step S102 of fig. 1, the full-sample medical data may be partially sampled to obtain partially sampled medical data, and the partially sampled medical data may be labeled to construct a high-precision data set, and the full-sample medical data other than the partially sampled medical data may be constructed as a low-precision data set, wherein labeling includes labeling the efficacy evaluation result.

In one example, the full-sample medical data may be patient desensitization information that desensitizes patient medical record data.

As shown in fig. 2, the patient desensitization information is used as the whole-sample medical data, the patient desensitization information is partially sampled, then the doctor marks the partially sampled patient desensitization information to mark the curative effect evaluation result, and the partially sampled patient desensitization information marked by the doctor can be used for constructing a high-precision data set. In addition, full-sample medical data other than partially sampled patient desensitization information may be used to construct a low-precision dataset.

In accordance with an embodiment of the present disclosure, in the data preprocessing step S102, constructing the low-precision data set may include: acquiring evaluation baseline information in the whole sample medical data, wherein the evaluation baseline information can comprise tumor types, medical examinations aiming at the tumor types, treatment schemes aiming at the tumor types and curative effect evaluation; determining a curative effect evaluation result according to the curative effect evaluation based on a curative effect evaluation standard RECIST of the solid tumor; and combining the tumor type, medical examination, treatment regimen, and efficacy assessment results to construct a low-precision dataset. In addition, in the data preprocessing step S102, constructing the high-precision data set may include: the full sample medical data is partially sampled so that a medical sample exists for each tumor type and corresponding evaluation criteria types including four types of efficacy evaluation results in RECIST: complete alleviation of CR, partial alleviation of PR, stable disease SD and disease progression PD; labeling the partially sampled medical data according to RECIST, wherein the labeling comprises labeling one of CR, PR, SD and PD as a curative effect evaluation result; and combining the tumor type, the medical examination, the treatment plan, and the noted efficacy assessment results to construct a high-precision dataset.

In connection with fig. 3, fig. 3 shows a flow chart of constructing a high precision data set and a low precision data set according to an embodiment of the present disclosure.

As shown in fig. 3, the patient desensitization information may include imaging examinations, pathology examinations, surgical treatments, radiation treatments, and the like. Imaging examinations typically include X-ray imaging, computed Tomography (CT), magnetic Resonance Imaging (MRI), ultrasound imaging (US), nuclear medicine imaging (ECT). Pathology examinations generally include conventional pathomorphology examinations including sloughed cytological examinations and biopsies, and newly developed examination methods including immunohistochemical examinations, electron microscopy, flow cytometry, image analysis techniques, and molecular biology techniques.

Evaluation baseline information in the full-sample medical data may be obtained from the patient desensitization information, which may include tumor type, medical examination for tumor type, treatment regimen for tumor type, and efficacy assessment. For example, according to a segment of a patient medical record that has been desensitized: "2022 9 and 8 days, the patient is first diagnosed in our hospital because of repeated cough and chest pain. The result of CT examination carried out on the lung shows that the lung has a solid tumor, and the maximum diameter is 5.8cm. After detailed analysis and multidisciplinary discussion, the decision to take targeted drug therapy is made. After two months of treatment, the patient was reviewed at day 8 of 11 of 2022, and CT showed that the maximum tumor diameter was reduced to 3.5cm "from which evaluation baseline information could be obtained. The evaluation baseline information for the segment of medical records may include: primary lung cancer, CT examination, targeted drug and tumor reduction.

At present, RECIST is a general curative effect evaluation standard in the international tumor kingdom. RECIST is mainly used to record tumor volume changes and is divided into four categories: complete Remission (CR), partial Remission (PR), stable Disease (SD), and disease Progression (PD). Based on RECIST, the "tumor reduction" can be determined as a "Partial Remission (PR)" from the above-described efficacy evaluation. Thus, for the example medical records described above, "primary lung cancer," "CT exam," "targeted drug," and "Partial Remission (PR)" can be combined to construct a low-precision dataset.

Referring to fig. 2 and 3, first, patient desensitization information is sampled such that a medical sample exists for each tumor type and corresponding evaluation criteria type. Then, the tumor specialist marks the sampled sample according to RECIST: 1) Evaluating the standard type; 2) And supporting evidence and logic reasoning processes for judging the evaluation standard types. For example, according to a segment of a patient medical record that has been desensitized: "2023, 3 and 1 days, the patient is first diagnosed in our hospital because of the discomfort of the abdomen. The result of CT examination shows that the liver has a solid tumor with a maximum diameter of 6.0cm. After detailed analysis and multidisciplinary discussion, it was decided to administer chemotherapy regimen a treatment. After two months of treatment, the patient was reviewed at 2023, 5 months and 1 day, and CT showed that the maximum diameter of the tumor was reduced to 4.5cm ", as noted below by the oncologist: conclusion of efficacy evaluation: the therapeutic response of a patient can be classified as "Partial Remission (PR)" according to RECIST criteria. Evidence and reasoning process: first, we refer to the results of CT examinations of the patient before and after treatment. Examination prior to initiation of treatment showed that the solid tumor maximum diameter of the liver of the patient was 6.0cm. After two months of chemotherapy regimen A treatment, the results of the 2023, 5 and 1 re-examination showed that the maximum diameter of the tumor was reduced to 4.5cm. Next, we used RECIST criteria for efficacy evaluation. According to this criterion, if the overall diameter of the target lesion is reduced by 30% or more, it may be defined as "Partial Remission (PR)". In this case, the maximum diameter of the tumor was reduced from 6.0cm to 4.5cm by 25% and the "Partial Remission (PR)" criteria were met. Thus, for the example medical records described above, "liver cancer," "CT exam," "chemotherapy regimen a," and "Partial Remission (PR)" noted by a physician can be combined to construct a high-precision dataset.

From the above procedure of constructing a low-precision data set and a high-precision data set, it can be found that: the number of samples of the low-precision data set is large, and the accuracy of the curative effect evaluation is low; the high-precision data set is obtained by labeling part of sampled samples by a tumor specialist, so that the number of the samples is small, and the accuracy of the curative effect evaluation is high.

In the data classifying step S104 of fig. 1, the low-precision data set may be classified into n sub-data sets, and the high-precision data set may be classified into a development data set and a test data set.

As shown in fig. 2, the low-precision data set is numbered and classified into a sub-data set 1, sub-data set 2, … sub-data set n. In addition, the high-precision data set is classified into a development data set for efficacy evaluation development and a test data set for efficacy evaluation test.

In the data analysis step S106 of fig. 1, the n sub-data sets may be analyzed to screen one or more sub-data sets from the n sub-data sets based on the loss similarity between the development data set and the n sub-data sets.

In accordance with an embodiment of the present disclosure, in the data analysis step S106, screening one or more sub-data sets may include: calculating a first loss value on each of the n sub-data sets and a second loss value on the development data set; a loss similarity is calculated from the first loss value and the second loss value on each of the sub-data sets, and one or more of the n sub-data sets having a similar loss reduction direction to the development data set is screened based on the loss similarity.

For example, the first loss value after z-round iterations for each sub-dataset may be calculated by equation (1):

，（1）

wherein,，represented at input x _n And y _n The loss value of the time model is calculated,is a parameter of the basic model, which is a parameter of the basic model,is a learning rate and has a value ranging from 0.0001 to 0.1, x _n Is the nth sub-data set, y _n Is the result of evaluating the efficacy of the treatment corresponding to the nth sub-data set.

Then, a second loss value for the development dataset in each iteration is calculated by equation (2):

，（2）

According to embodiments of the present disclosureThe cosine function may be used to calculate the loss similarity between the first loss value and the second loss value。

In other embodiments, the algorithm for calculating the loss similarity may also include other algorithms besides cosine functions. It should be appreciated by those skilled in the art that the algorithm for calculating the loss similarity may include cosine similarity, adjusting cosine similarity, pearson correlation coefficient, jaccard similarity coefficient, tanimoto coefficient, log likelihood similarity, and the like.

According to an embodiment of the present disclosure, screening one or more sub-data sets may include: when the loss similarity is greater than 0, the loss falling direction of the nth sub data set is similar to the loss falling direction of the development data set, and the nth sub data set is reserved. When the loss similarity is less than or equal to 0, indicating that the loss falling direction of the nth sub data set is not similar to the loss falling direction of the development data set, discarding the nth sub data set.

As shown in fig. 2, a loss similarity analysis may be performed by calculating a loss similarity from a first loss value on each sub-dataset and a second loss value on the development dataset, and screening one or more sub-datasets similar to the loss dip direction of the development dataset based on the loss similarity analysis.

The model updating step S108 of fig. 1 may include: adjusting the basic large language model using the low-precision dataset to obtain a first candidate large language model, and adjusting the basic large language model using the screened one or more sub-datasets to obtain a second candidate large language model; testing the performance of the basic large language model, the first candidate large language model and the second candidate large language model based on the test dataset; and replacing the basic large language model with the well-behaved candidate large language model as an updated basic large language model.

According to an embodiment of the present disclosure, in the model updating step S108, testing the performances of the basic large language model, the first candidate large language model, and the second candidate large language model based on the test data set may include: comparing whether the curative effect evaluation results output by the basic large language model, the first candidate large language model and the second candidate large language model are consistent with the curative effect evaluation results marked in the test data set or not according to the medical samples corresponding to each tumor type in the test data set; scoring the models with consistent efficacy evaluation results, and not scoring the models with inconsistent efficacy evaluation results; and calculating a total score for the base large language model, the first candidate large language model, and the second candidate large language model for all tumor types in the test dataset, wherein the total score is used to indicate the performance of the model.

For example, multiple segments of course records may be included in the patient desensitization information. For the first stage of course record, the test data set marked by doctors shows that the curative effect evaluation result is 'partial remission (PR'); the curative effect evaluation result output by the basic large language model is 'none'; the efficacy evaluation result output by the first candidate large language model is "disease Progression (PD)"; and the efficacy evaluation result output by the second candidate large language model is "Partial Remission (PR)". Thus, the score of the basic large language model and the first candidate large language model is 0 score, and the score of the second candidate large language model is 1 score. Because the patient desensitization information contains a plurality of sections of disease course records, the total scores of the basic large language model, the first candidate large language model and the second candidate large language model are calculated for the plurality of sections of disease course records respectively, and then the candidate large language model with the highest total score is used for replacing the basic large language model as the updated basic large language model.

In the training step S110 of fig. 1, the data analysis step S106 and the model updating step S108 may be repeated until the updated basic large language model performs better than both the first candidate large language model and the second candidate large language model, and the updated basic large language model is output as the final large language model. In repeating the data analysis step S106, the n sub-data sets may be replaced with the retained sub-data set.

For example, for a cold start version, a basic large language model is used. After the basic large language model is trained by using the large language model training method based on the loss similarity, replacing the basic large language model by using a second candidate large language model with optimal performance, and replacing n sub-data sets with the screened m sub-data sets for the next iteration, wherein m is less than n.

The method for training the large language model based on the loss similarity is described in detail above with reference to fig. 1 to 3, one or more sub-data sets are screened out by calculating the loss similarity between the development data set and the low-precision data set, the basic large language model is adjusted by using the low-precision data set and the screened one or more sub-data sets to obtain a candidate large language model, then the basic large language model is replaced by the candidate large language model with good performance, the above procedure is repeated, and finally the large language model with the best effect under the tumor curative effect evaluation scene is obtained.

Second embodiment

The present disclosure provides a large language model training apparatus based on the loss of similarity, in addition to the large language model training method based on the loss of similarity described above, which will be described in detail below with reference to fig. 4.

Fig. 4 shows a block diagram of a large language model training apparatus 400 based on loss of similarity according to a second embodiment of the present disclosure.

As shown in fig. 4, the large language model training apparatus 400 based on the loss of similarity includes a data preprocessing module 402, a data classification module 404, a data analysis module 406, a model update module 408, and a training module 410.

In the data preprocessing module 402 of fig. 4, the full-sample medical data may be partially sampled to obtain partially sampled medical data, and the partially sampled medical data may be labeled to construct a high-precision dataset, and the full-sample medical data other than the partially sampled medical data may be constructed as a low-precision dataset, wherein labeling includes labeling the efficacy assessment results.

In accordance with an embodiment of the present disclosure, in the data preprocessing module 402, constructing the low-precision data set may include: acquiring evaluation baseline information in the whole sample medical data, wherein the evaluation baseline information can comprise tumor types, medical examinations aiming at the tumor types, treatment schemes aiming at the tumor types and curative effect evaluation; determining a curative effect evaluation result according to the curative effect evaluation based on a curative effect evaluation standard RECIST of the solid tumor; and combining the tumor type, medical examination, treatment regimen, and efficacy assessment results to construct a low-precision dataset. In addition, in the data preprocessing module 402, constructing the high precision data set may include: the full sample medical data is partially sampled so that a medical sample exists for each tumor type and corresponding evaluation criteria types including four types of efficacy evaluation results in RECIST: complete alleviation of CR, partial alleviation of PR, stable disease SD and disease progression PD; labeling the partially sampled medical data according to RECIST, wherein the labeling comprises labeling one of CR, PR, SD and PD as a curative effect evaluation result; and combining the tumor type, the medical examination, the treatment plan, and the noted efficacy assessment results to construct a high-precision dataset.

In the data classification module 404 of fig. 4, the low-precision data set may be classified into n sub-data sets, and the high-precision data set may be classified into a development data set and a test data set.

In the data analysis module 406 of fig. 4, the n sub-data sets may be analyzed to screen one or more sub-data sets from the n sub-data sets based on the loss similarity between the development data set and the n sub-data sets.

In accordance with an embodiment of the present disclosure, in the data analysis module 406, screening one or more sub-data sets may include: calculating a first loss value on each of the n sub-data sets and a second loss value on the development data set; a loss similarity is calculated from the first loss value and the second loss value on each of the sub-data sets, and one or more of the n sub-data sets having a similar loss reduction direction to the development data set is screened based on the loss similarity.

，（1）

，（2）

According to embodiments of the present disclosure, a cosine function may be utilized to calculate a loss similarity between the first loss value and the second loss value。

The model update module 408 of fig. 4 may include: adjusting the basic large language model using the low-precision dataset to obtain a first candidate large language model, and adjusting the basic large language model using the screened one or more sub-datasets to obtain a second candidate large language model; testing the performance of the basic large language model, the first candidate large language model and the second candidate large language model based on the test dataset; and replacing the basic large language model with the well-behaved candidate large language model as an updated basic large language model.

According to an embodiment of the present disclosure, in the model update module 408, testing the behavior of the base large language model, the first candidate large language model, and the second candidate large language model based on the test dataset may include: comparing whether the curative effect evaluation results output by the basic large language model, the first candidate large language model and the second candidate large language model are consistent with the curative effect evaluation results marked in the test data set or not according to the medical samples corresponding to each tumor type in the test data set; scoring the models with consistent efficacy evaluation results, and not scoring the models with inconsistent efficacy evaluation results; and calculating a total score for the base large language model, the first candidate large language model, and the second candidate large language model for all tumor types in the test dataset, wherein the total score is used to indicate the performance of the model.

In training module 410 of fig. 4, the operations of data analysis module 406 and model update module 408 may be repeated until the updated underlying large language model performs better than both the first candidate large language model and the second candidate large language model, and the updated underlying large language model is output as the final large language model. The retained sub-data sets may be used instead of the n sub-data sets when the data analysis module 408 is repeated.

Fig. 5 illustrates a flowchart S500 of a method of obtaining efficacy assessment results based on a large language model according to an embodiment of the present disclosure.

As shown in fig. 5, the method of obtaining the efficacy evaluation result based on the loss similarity includes a data preprocessing step S102, a data classifying step S104, a data analyzing step S106, a model updating step S108, a training step S110, and an efficacy evaluation step S512. Steps S102 to S110 in fig. 5 are the same as or similar to steps S102 to S110 in fig. 1, and are not described here again.

In the efficacy evaluation step S512, the medical text may be input to the trained large language model to obtain an efficacy evaluation result corresponding to the medical text.

Fig. 6 illustrates a block diagram of an apparatus 600 for obtaining efficacy assessment results based on a large language model, according to an embodiment of the present disclosure.

As shown in fig. 6, the apparatus 600 for obtaining efficacy evaluation results based on a large language model includes a data preprocessing module 402, a data classification module 404, a data analysis module 406, a model update module 408, a training module 410, and an efficacy evaluation module 612. The modules 402 to 410 in fig. 6 are identical or similar to the modules 402 to 410 of fig. 4, and are not described in detail herein.

In the efficacy evaluation module 612, the medical text can be input to a trained large language model to obtain efficacy evaluation results corresponding to the medical text.

Fig. 7 shows a block diagram of a computer device according to an embodiment of the present disclosure.

Referring to fig. 7, a computer device 700 may include a processor 701 and a memory 702. The processor 701 and the memory 702 may be connected by a bus 703. The computer device 700 may be any type of portable device (e.g., smart camera, smart phone, tablet, etc.) or any type of stationary device (e.g., desktop computer, server, etc.).

The processor 701 may perform various actions and processes in accordance with programs stored in the memory 702. In particular, the processor 701 may be an integrated circuit chip having signal processing capabilities. The processor may be a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components. The disclosed methods, steps, and logic blocks in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like, and may be of the X86 architecture or ARM architecture.

The memory 702 stores computer executable instructions that, when executed by the processor 701, implement the semi-supervised learning based medical named entity recognition method described above. The memory 702 may be volatile memory or non-volatile memory, or may include both volatile and non-volatile memory. The non-volatile memory may be read-only memory (ROM), programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), or flash memory. Volatile memory can be Random Access Memory (RAM), which acts as external cache memory. By way of example, and not limitation, many forms of RAM are available, such as Static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), synchronous Dynamic Random Access Memory (SDRAM), double data rate synchronous dynamic random access memory (ddr SDRAM), enhanced Synchronous Dynamic Random Access Memory (ESDRAM), synchronous Link Dynamic Random Access Memory (SLDRAM), and direct memory bus random access memory (DR RAM). It should be noted that the memory of the methods described herein is intended to comprise, without being limited to, these and any other suitable types of memory.

Further, the large language model training method based on the loss of similarity according to the present disclosure may be stored in a computer-readable storage medium. In particular, in accordance with the present disclosure, a computer-readable storage medium may be provided having stored thereon computer-readable instructions which, when executed by a processor, may cause the processor to perform a large language model training method based on loss of similarity as described above.

It is noted that the flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises at least one executable instruction for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

In general, the various example embodiments of the disclosure may be implemented in hardware or special purpose circuits, software, firmware, logic, or any combination thereof. Some aspects may be implemented in hardware, while other aspects may be implemented in firmware or software which may be executed by a controller, microprocessor or other computing device. While aspects of the embodiments of the present disclosure are illustrated or described as block diagrams, flow charts, or using some other pictorial representation, it is well understood that the blocks, apparatus, systems, techniques or methods described herein may be implemented in, as non-limiting examples, hardware, software, firmware, special purpose circuits or logic, general purpose hardware or controller or other computing devices, or some combination thereof.

Unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the relevant art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

The foregoing is illustrative of the present disclosure and is not to be construed as limiting thereof. Although a few exemplary embodiments of this disclosure have been described, those skilled in the art will readily appreciate that many modifications are possible in the exemplary embodiments without materially departing from the novel teachings and advantages of this disclosure. Accordingly, all such modifications are intended to be included within the scope of this disclosure as defined in the claims. It is to be understood that the foregoing is illustrative of the present disclosure and is not to be construed as limited to the specific embodiments disclosed, and that modifications to the disclosed embodiments, as well as other embodiments, are intended to be included within the scope of the appended claims. The disclosure is defined by the claims and their equivalents.

Claims

1. A large language model training method based on loss of similarity, wherein the method comprises:

a data preprocessing step of partially sampling the full-sample medical data to obtain partially sampled medical data, and labeling the partially sampled medical data to construct a high-precision data set, the full-sample medical data other than the partially sampled medical data being constructed as a low-precision data set, wherein the labeling includes labeling a efficacy evaluation result;

A data classification step of classifying the low-precision data set into n sub-data sets, and classifying the high-precision data set into a development data set and a test data set;

a data analysis step of analyzing the n sub-data sets to screen one or more sub-data sets from the n sub-data sets based on a loss similarity between the development data set and the n sub-data sets;

a model updating step, comprising:

adjusting the basic large language model using the low-precision dataset to obtain a first candidate large language model, and adjusting the basic large language model using the one or more sub-datasets after screening to obtain a second candidate large language model;

testing the performance of the basic large language model, the first candidate large language model, and the second candidate large language model based on the test dataset; and is also provided with

Replacing the basic large language model with the well-represented candidate large language model to serve as an updated basic large language model; and

training, namely repeating the data analysis step and the model updating step until the updated basic large language model has better performance than both the first candidate large language model and the second candidate large language model, and outputting the updated basic large language model as a final large language model; wherein in the data analysis step, screening the one or more sub-data sets comprises:

Calculating a first loss value on each of the n sub-data sets and a second loss value on the development data set;

calculating a loss similarity from the first and second loss values on each sub-dataset, and

screening the one or more of the n sub-data sets for similarity to a loss dip direction of the development data set based on the loss similarity.

2. The method of claim 1, wherein in the model updating step, testing the behavior of the base large language model, the first candidate large language model, and the second candidate large language model based on the test dataset comprises:

comparing whether the curative effect evaluation results output by the basic large language model, the first candidate large language model and the second candidate large language model are consistent with the curative effect evaluation results marked in the test data set or not according to the medical samples corresponding to each tumor type in the test data set;

scoring the models with consistent efficacy evaluation results and not scoring the models with inconsistent efficacy evaluation results; and is also provided with

For all tumor types in the test dataset, calculating a total score for the base large language model, the first candidate large language model, and the second candidate large language model, wherein the total score is used to indicate the performance of a model.

3. The method of claim 1, wherein in the data analysis step, the first loss value after z-round iterations for each sub-dataset is calculated by:

wherein,L(θ,x _n ,y _n ) Represented at input x _n And y _n Loss value of time model, theta is basic model parameter, alpha is learning rate, x _n Is the nth sub-data set, y _n Is the result of evaluating the efficacy of the treatment corresponding to the nth sub-data set.

4. A method according to claim 3, wherein the learning rate has a value in the range of 0.0001 to 0.1.

5. A method according to claim 3, wherein in the data analysis step, the second loss value for the developed dataset in each iteration is calculated by the following formula:

wherein x is _{High precision} Is the development dataset in the high-precision dataset, y _{High precision} Is the curative effect evaluation result corresponding to the development data set.

6. The method of claim 5, wherein in the data analysis step, the loss similarity factor (C) between the first loss value and the second loss value is calculated from a Cosine function _{n high precision} ，C _n )。

7. The method of any of claims 3 to 6, wherein screening the one or more sub-data sets comprises:

When the loss similarity is greater than 0, the loss falling direction of the nth sub data set is similar to the loss falling direction of the development data set, and the nth sub data set is reserved.

8. The method of any of claims 3 to 6, wherein screening the one or more sub-data sets comprises:

when the loss similarity is less than or equal to 0, indicating that the loss falling direction of the nth sub data set is not similar to the loss falling direction of the development data set, discarding the nth sub data set.

9. The method of claim 7, wherein in the training step, the n sub-data sets are replaced with the retained sub-data sets while repeating the data analysis step.

10. The method of claim 1, wherein in the data preprocessing step, constructing the low precision data set comprises:

acquiring evaluation baseline information in the full-sample medical data, wherein the evaluation baseline information comprises a tumor type, a medical examination aiming at the tumor type, a treatment scheme aiming at the tumor type and a curative effect evaluation;

determining the efficacy evaluation result according to the efficacy evaluation based on an efficacy evaluation standard RECIST of the solid tumor; and is also provided with

Combining the tumor type, the medical examination, the treatment regimen, and the efficacy assessment results to construct the low-precision dataset.

11. The method of claim 10, wherein in the data preprocessing step, constructing the high precision data set comprises:

the full sample medical data is partially sampled such that a medical sample exists for each tumor type and corresponding evaluation criteria types including four types of efficacy evaluation results in the RECIST: complete alleviation of CR, partial alleviation of PR, stable disease SD and disease progression PD;

labeling the partially sampled medical data according to the RECIST, wherein the labeling comprises labeling one of CR, PR, SD and PD as the curative effect evaluation result; and

combining the tumor type, the medical examination, the treatment regimen, and the noted efficacy assessment results to construct the high-precision dataset.

12. A large language model training apparatus based on loss of similarity, wherein the apparatus comprises:

a data preprocessing module configured to partially sample full-sample medical data to obtain partially sampled medical data, and to annotate the partially sampled medical data to construct a high-precision data set, the full-sample medical data other than the partially sampled medical data being constructed as a low-precision data set, wherein the annotating includes annotating a efficacy evaluation result;

A data classification module configured to classify the low-precision data set into n sub-data sets and the high-precision data set into a development data set and a test data set;

a data analysis module configured to analyze the n sub-data sets to screen one or more sub-data sets from the n sub-data sets based on a loss similarity between the development data set and the n sub-data sets;

a model update module configured to perform the following operations:

a training module configured to repeat operations of the data analysis module and the model update module until the updated underlying large language model performs better than both the first candidate large language model and the second candidate large language model, and to output the updated underlying large language model as a final large language model;

Wherein in the data analysis module, screening the one or more sub-data sets comprises:

13. A method for obtaining efficacy assessment results based on a large language model, wherein the method comprises:

a efficacy evaluation step of inputting a medical text to a trained large language model to obtain efficacy evaluation results corresponding to the medical text,

wherein the trained large language model is obtained based on the method of any one of claims 1 to 11.

14. An apparatus for obtaining efficacy of treatment assessment results based on a large language model, wherein the apparatus comprises:

a efficacy evaluation module that inputs medical text to the trained large language model to obtain efficacy evaluation results corresponding to the medical text,

15. A computer device comprising a memory and a processor, wherein the memory has stored therein computer readable instructions which, when executed by the processor, implement the method of any of claims 1 to 11.

16. A computer readable storage medium having stored thereon computer readable instructions which, when executed by a processor, implement the method according to any of claims 1 to 11.