CN118155860A - Method, equipment and medium for aligning traditional Chinese medicine large model preference - Google Patents

Method, equipment and medium for aligning traditional Chinese medicine large model preference Download PDF

Info

Publication number
CN118155860A
CN118155860A CN202410437148.5A CN202410437148A CN118155860A CN 118155860 A CN118155860 A CN 118155860A CN 202410437148 A CN202410437148 A CN 202410437148A CN 118155860 A CN118155860 A CN 118155860A
Authority
CN
China
Prior art keywords
model
sequence
chinese medicine
preference
traditional chinese
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202410437148.5A
Other languages
Chinese (zh)
Inventor
张明川
柴龙飞
朱军龙
吴庆涛
王琳
刘牧华
李美雯
冯嘉美
葛又铭
夏丽晔
尚智伟
李涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Henan University of Science and Technology
Original Assignee
Henan University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Henan University of Science and Technology filed Critical Henan University of Science and Technology
Priority to CN202410437148.5A priority Critical patent/CN118155860A/en
Publication of CN118155860A publication Critical patent/CN118155860A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/70ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/043Architecture, e.g. interconnection topology based on fuzzy logic, fuzzy membership or fuzzy inference, e.g. adaptive neuro-fuzzy inference systems [ANFIS]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/0895Weakly supervised learning, e.g. semi-supervised or self-supervised learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/092Reinforcement learning
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H20/00ICT specially adapted for therapies or health-improving plans, e.g. for handling prescriptions, for steering therapy or for monitoring patient compliance
    • G16H20/90ICT specially adapted for therapies or health-improving plans, e.g. for handling prescriptions, for steering therapy or for monitoring patient compliance relating to alternative medicines, e.g. homeopathy or oriental medicines
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Medical Informatics (AREA)
  • Biophysics (AREA)
  • Public Health (AREA)
  • Epidemiology (AREA)
  • Primary Health Care (AREA)
  • Evolutionary Biology (AREA)
  • Alternative & Traditional Medicine (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Computational Mathematics (AREA)
  • Automation & Control Theory (AREA)
  • Fuzzy Systems (AREA)
  • Pharmacology & Pharmacy (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Databases & Information Systems (AREA)
  • Pathology (AREA)
  • Medical Treatment And Welfare Office Work (AREA)

Abstract

The invention discloses a method, equipment and medium for aligning traditional Chinese medicine large model preference, and relates to the technical field of natural language processing. The method comprises the following steps: constructing a standardized corpus, and training a first pre-training language model on the standardized corpus by adopting a self-supervision learning strategy and a supervised learning strategy to obtain a preliminarily aligned Chinese medicine large model; constructing a data partial order pair, and training a second pre-training language model based on the data partial order pair by adopting a reinforcement learning technology to obtain a trained rewarding model; according to the preliminarily aligned Chinese medicine big model and the trained reward model, carrying out preference alignment on the Chinese medicine big model based on reinforcement learning to obtain a Chinese medicine big model subjected to preference alignment; and according to the traditional Chinese medicine large model aligned by preference, performing model feedback optimization based on a neural network to obtain a final optimized traditional Chinese medicine large model. The invention can realize the personalized preference alignment of the large traditional Chinese medicine model, so that the model can generate answers more consistent with human preferences.

Description

Method, equipment and medium for aligning traditional Chinese medicine large model preference
Technical Field
The invention relates to the technical field of natural language processing, in particular to a method, equipment and medium for aligning Chinese medicine large model preference.
Background
Along with the advancement of the modern progress of traditional Chinese medicine and the rapid development of natural language processing technology, the development of a language model capable of understanding and applying traditional Chinese medicine knowledge becomes a research hotspot. However, the theory system of traditional Chinese medicine is unique, and contains a great deal of knowledge such as terms and philosophy, which is clearly a serious challenge for understanding and applying the knowledge accurately for a large model of traditional Chinese medicine driven by data. Thus, the problem of alignment of the preferences of the large model of traditional Chinese medicine becomes particularly critical.
The core goal of preference alignment is to make the machine learning model output results consistent with human preferences, which means that for large models of traditional Chinese medicine, the results generated by the model when doing the task of question-answering in traditional Chinese medicine should meet the standards of traditional Chinese medicine theory and clinical practice. Some students try to achieve this by using a method of supervised learning, which trains a model by a large amount of labeling data of traditional Chinese medicine so that the model can learn a mode of human preference, but this method requires the use of a large amount of labeling data, which is relatively expensive to obtain, and which may contain inaccurate knowledge of traditional Chinese medicine or information inconsistent with the practice of traditional Chinese medicine, resulting in that the model may absorb incorrect preference in the learning process. In addition, some students try to align the model with preferences using rule-based methods, which ensure that the model output meets specific preferences by encoding medical expert knowledge or rules into the model, but such methods have difficulty capturing duplicate or fuzzy preferences, and are poor in scalability and subjectivity, and cannot meet the personalized needs of different users. Still other scholars have attempted to solve this problem using reinforcement learning by interacting a large model of chinese medicine with the environment to obtain feedback and thus adjust the output of the model to conform to the doctor's preference, but such preference alignment methods often lack an effective feedback mechanism, and the model aligned by such methods may have limited generalization ability when it performs well in a specific doctor or in a specific scenario, meaning that the model may have difficulty maintaining the same performance in other doctors or in different chinese medicine scenarios.
Disclosure of Invention
The invention aims to provide a method, equipment and medium for aligning the preference of a large traditional Chinese medicine model, so as to realize the individual preference alignment of the large traditional Chinese medicine model, and the finally optimized large traditional Chinese medicine model can generate an answer sequence meeting the standards of the traditional Chinese medicine theory and clinical practice according to a question sequence input by a user.
In order to achieve the above object, the present invention provides the following solutions:
a method of alignment of chinese medicine large model preferences, comprising:
Constructing a standardized corpus, and training a first pre-training language model on the standardized corpus by adopting a self-supervision learning strategy and a supervised learning strategy to obtain a preliminarily aligned Chinese medicine large model;
constructing a data partial order pair, and training a second pre-training language model based on the data partial order pair by adopting a reinforcement learning technology to obtain a trained rewarding model;
according to the preliminarily aligned Chinese medicine big model and the trained reward model, carrying out preference alignment on the Chinese medicine big model based on reinforcement learning to obtain a Chinese medicine big model subjected to preference alignment;
according to the traditional Chinese medicine large model aligned by preference, performing model feedback optimization based on a neural network to obtain a final optimized traditional Chinese medicine large model; the final optimized Chinese medicine large model is used for generating an answer sequence which accords with the standards of Chinese medicine theory and clinical practice according to the question sequence input by the user.
Optionally, a standardized corpus is constructed, and a first pre-training language model is trained on the standardized corpus by adopting a self-supervision learning strategy and a supervised learning strategy to obtain a preliminarily aligned Chinese medicine large model, which specifically comprises the following steps:
Obtaining traditional Chinese medical knowledge data and constructing a standardized corpus; the standardized corpus comprises a plurality of sequences, and each sequence comprises a plurality of Chinese characters;
defining a first loss function; the first loss function represents the negative log likelihood loss of the next Chinese character through model prediction;
Minimizing the first loss function by adopting a gradient descent algorithm, and updating trainable parameters of the first pre-training language model to obtain a pre-trained traditional Chinese medicine large model;
extracting a question sequence and a corresponding answer sequence from the standardized corpus according to a question-answer scene, and merging the question sequence and the corresponding answer sequence into a combined sequence;
Defining a second loss function; the second loss function converts the probability problem of maximizing the combined sequence into a minimization problem;
and minimizing the second loss function by adopting a gradient descent algorithm, and updating the trainable parameters of the pre-trained traditional Chinese medicine large model to obtain a preliminarily aligned traditional Chinese medicine large model.
Optionally, constructing a data partial order pair, and training a second pre-training language model based on the data partial order pair by adopting a reinforcement learning technology to obtain a trained rewarding model, which specifically comprises the following steps:
extracting a question sequence and a corresponding answer sequence from the standardized corpus according to a question-answer scene;
Aiming at each question sequence, a plurality of different medical large models are adopted to respectively generate different answer sequences, and a standard answer sequence and a corresponding extracted answer sequence are combined to generate partial sequence pairs according to the matching degree sequence;
replacing the embedded layer of the output text of the second pre-training language model with a projection layer of the output scalar to obtain a replaced second pre-training language model;
assigning scores to all answer sequences of each question sequence by adopting the replaced second pre-training language model to obtain score sequences;
Defining a third loss function; the third loss function is used for enabling the score difference between the high-quality answer sequence and the low-quality answer sequence to be larger;
And aiming at minimizing the third loss function, and back-propagating and updating the trainable parameters of the replaced second pre-training language model to obtain a trained rewarding model.
Optionally, performing bias alignment of the large traditional Chinese medicine model based on reinforcement learning according to the preliminarily aligned large traditional Chinese medicine model and the trained reward model to obtain the large traditional Chinese medicine model with aligned bias, which specifically comprises the following steps:
Extracting a problem sequence from the standardized corpus according to a question-answer scene to form a problem data set;
Based on a supervised fine tuning strategy, performing token sampling on each problem sequence in the problem data set by adopting the preliminarily aligned traditional Chinese medicine large model to obtain a corresponding response sequence;
Splicing and combining each problem sequence in the problem data set with a corresponding response sequence to obtain a spliced sequence;
Calculating a reward score by adopting the trained reward model according to the splicing sequence based on a reward optimization preference strategy, calculating a dominance score by adopting a generalized dominance function GAE in reinforcement learning, combining and carrying out normalization and shearing treatment to obtain an optimized reward score;
introducing a mean square error of KL divergence between the reward optimization preference strategy and the supervised fine tuning strategy as a penalty term to control the difference size of the reward optimization preference strategy and the supervised fine tuning strategy;
Defining a first Markov decision process to form a first reinforcement learning track; the state space of the first Markov decision process represents an input problem sequence, the action space represents a corresponding response sequence, and the reward function represents a scoring strategy after the reward model is optimized; the first reinforcement learning track comprises a question sequence, a corresponding response sequence and a reward score which are input at different time steps;
calculating the total return of the first reinforcement learning track according to the optimized reward score and the punishment item;
training the preliminarily aligned Chinese medicine large model by taking the maximum total return as a target to obtain the Chinese medicine large model after preference alignment.
Optionally, according to the traditional Chinese medicine large model aligned by preference, performing model feedback optimization based on a neural network to obtain a final optimized traditional Chinese medicine large model, which specifically comprises:
Constructing a fuzzy neural network; the fuzzy neural network comprises an input layer, a fuzzy layer, an reasoning layer and an output layer; the input layer is used for inputting weights of the evaluation text and the evaluation index; the evaluation text comprises an input question sequence and a corresponding response sequence; the fuzzy layer is used for processing the evaluation text to obtain a membership function of the evaluation index; the reasoning layer is used for dividing the grades of the fuzzy rules and calculating the excitation density of the fuzzy rules according to the membership function of the evaluation index; the output layer is used for calculating a preference alignment quality assessment result according to the excitation density of the fuzzy rule and the weight of the assessment index;
acquiring feedback information of a user; the feedback information includes: the input problem sequence, the corresponding response sequence and the feedback content;
The fuzzy neural network is adopted to evaluate the input problem sequence and the corresponding response sequence in the feedback information, so as to obtain a preference alignment quality evaluation result and map the preference alignment quality evaluation result to a corresponding preference alignment quality evaluation grade; the preference alignment quality rating comprises: excellent, good, medium, general and poor;
If the preference alignment quality evaluation level is medium, general or poor, or the preference alignment quality evaluation level is excellent or good, but the input problem sequence and the corresponding response sequence do not accord with the standards of the traditional Chinese medicine theory and the clinical practice, extracting the depth characteristics of the feedback information by adopting a convolutional neural network to obtain a depth characteristic expression sequence;
The feedback information is coded by adopting sparse self-coding, so that a weight matrix is obtained;
mapping the depth characteristic representation sequence to a plurality of characteristic nodes by adopting width learning to obtain a first characteristic sequence of feedback information;
Mapping the first characteristic sequence of the feedback information to a plurality of enhancement nodes by adopting an activation function to obtain a second characteristic sequence of the feedback information;
According to the weight matrix, fusing the feedback information first characteristic sequence and the feedback information second characteristic sequence to obtain a feedback information fusion characteristic matrix;
Defining a second Markov decision process to form a second reinforcement learning track; the state space of the second Markov decision process represents feedback information of the user, the action space represents corresponding preference returned according to the feedback information of the user, and the reward function represents return obtained by making preference alignment quality evaluation on the evaluation text by combining the feedback information; the second reinforcement learning track comprises feedback information acquired at different time steps, corresponding preference and obtained return;
calculating an average expected cumulative discount prize of the second reinforcement learning track according to the feedback information fusion feature matrix and the preference alignment quality assessment result;
And training the preference-aligned traditional Chinese medicine large model by taking the maximum average expected cumulative discount prize as a target to obtain a final optimized traditional Chinese medicine large model.
Optionally, the first pre-trained language model is a Qwen-14B model; the second pre-trained language model is Qwen-7B model.
Optionally, the calculation formula of the total return is:
wherein R (X i,RPi) is total return, R (X i,RPi) is the optimized reward score, eta is KL divergence coefficient, As penalty term,/>Optimizing preference policies for rewards, pi SFT(RPi|Xi) is a supervised fine tuning policy, X i is an input problem sequence, RP i is a corresponding response sequence.
Optionally, the calculation formula of the average desired cumulative discount prize is:
Where J π (τ ') is the average desired cumulative discount prize, E [ ] is the desire, t is the time step, γ ' t is the discount factor at the t-th time step, r ' t is the return obtained by combining feedback information to make a preferred alignment quality assessment on the assessment text at the t-th time step, s ' is the state, and s ' 0 is the initial state.
A computer device, comprising: the system comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor executes the computer program to realize the steps of the traditional Chinese medicine large model preference alignment method.
A computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of the chinese medicine large model preference alignment method.
According to the specific embodiment provided by the invention, the invention discloses the following technical effects:
According to the method for aligning the preference of the large traditional Chinese medicine model, firstly, a standardized corpus is built by integrating traditional Chinese medicine data, and a model is trained on the corpus by applying self-supervision learning and supervised learning strategies, so that the model is primarily aligned with tasks in the traditional Chinese medicine field; secondly, training a reward model by combining reinforcement learning technology with preference ordering data, wherein the reward model can calculate a reward value according to input and output information of a traditional Chinese medicine large model, and can evaluate consistency of the traditional Chinese medicine large model and human preference to a certain extent; thirdly, training the output preference of the alignment model by adopting a reward model optimization strategy based on reinforcement learning, so that the model can output an answer more consistent with the human preference; finally, evaluating the text alignment quality through a fuzzy neural network, and establishing a feedback loop optimization flow, thereby realizing personalized preference alignment of the traditional Chinese medicine large model. The invention can realize the personalized preference alignment of the large traditional Chinese medicine model so as to further improve the humanization degree and the professional accuracy of the response of the model, and the finally optimized large traditional Chinese medicine model can generate an answer sequence which accords with the standards of the traditional Chinese medicine theory and the clinical practice according to the question sequence input by the user.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of a method for aligning the preferences of a large model of traditional Chinese medicine provided by the invention;
FIG. 2 is a specific flow chart of the method for aligning the preferences of the large model of the traditional Chinese medicine;
FIG. 3 is a framework diagram of a large model learning link of traditional Chinese medicine based on medical knowledge;
FIG. 4 is a flow chart of a large model learning link of traditional Chinese medicine based on medical knowledge;
FIG. 5 is a frame diagram of a bonus model training link based on a partial order pair provided by the invention;
FIG. 6 is a flowchart of a reward model training procedure based on a partial order pair provided by the invention;
FIG. 7 is a framework diagram of a big model preference alignment link of traditional Chinese medicine based on reinforcement learning;
FIG. 8 is a flow chart of a big model preference alignment link of traditional Chinese medicine based on reinforcement learning provided by the invention;
FIG. 9 is a framework diagram of a neural network-based model feedback optimization link provided by the invention;
fig. 10 is a flowchart of a model feedback optimization link based on a neural network.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The invention aims to provide a method, equipment and medium for aligning the preference of a large traditional Chinese medicine model, so as to realize the individual preference alignment of the large traditional Chinese medicine model, and the finally optimized large traditional Chinese medicine model can generate an answer sequence meeting the standards of the traditional Chinese medicine theory and clinical practice according to a question sequence input by a user.
In order that the above-recited objects, features and advantages of the present invention will become more readily apparent, a more particular description of the invention will be rendered by reference to the appended drawings and appended detailed description.
As shown in fig. 1 and fig. 2, the method for aligning the preference of the large traditional Chinese medicine model provided by the invention comprises the following steps:
Step S1: constructing a standardized corpus, and training a first pre-training language model on the standardized corpus by adopting a self-supervision learning strategy and a supervised learning strategy to obtain a preliminarily aligned Chinese medicine large model; namely a large model learning link of traditional Chinese medicine based on medical knowledge. Preferably, the first pre-trained language model is a Qwen-14B model.
Step S2: constructing a data partial order pair, and training a second pre-training language model based on the data partial order pair by adopting a reinforcement learning technology to obtain a trained rewarding model; namely a reward model training link based on the partial order pair. Preferably, the second pre-trained language model is Qwen-7B model.
Step S3: according to the preliminarily aligned Chinese medicine big model and the trained reward model, carrying out preference alignment on the Chinese medicine big model based on reinforcement learning to obtain a Chinese medicine big model subjected to preference alignment; namely, the traditional Chinese medicine large model preference alignment link based on reinforcement learning.
Step S4: according to the traditional Chinese medicine large model aligned by preference, performing model feedback optimization based on a neural network to obtain a final optimized traditional Chinese medicine large model; namely, a model feedback optimization link based on a neural network. The final optimized Chinese medicine large model is used for generating an answer sequence which accords with the standards of Chinese medicine theory and clinical practice according to the question sequence input by the user.
The steps of the above links will be described in detail with reference to fig. 3 to 10.
Medical knowledge-based large model learning link of traditional Chinese medicine
Firstly, collecting and processing traditional Chinese medical knowledge data to construct a standardized corpus, then training Qwen-14B models by adopting a self-supervision learning strategy to complete the expansion of the models on the knowledge of the medical field, and then carrying out supervision fine tuning on Qwen-14B models according to specific field tasks and respectively combining corresponding instruction prompts to enable the models to be primarily aligned with question-answer tasks of the field. The framework and the flow of the large traditional Chinese medicine model learning link based on medical knowledge are shown in fig. 3 and 4.
Specifically, step S1 includes: obtaining traditional Chinese medical knowledge data and constructing a standardized corpus; the standardized corpus comprises a plurality of sequences, and each sequence comprises a plurality of Chinese characters; defining a first loss function; the first loss function represents the negative log likelihood loss of the next Chinese character through model prediction; minimizing the first loss function by adopting a gradient descent algorithm, and updating trainable parameters of the first pre-training language model to obtain a pre-trained traditional Chinese medicine large model; extracting a question sequence and a corresponding answer sequence from the standardized corpus according to a question-answer scene, and merging the question sequence and the corresponding answer sequence into a combined sequence; defining a second loss function; the second loss function converts the probability problem of maximizing the combined sequence into a minimization problem; and minimizing the second loss function by adopting a gradient descent algorithm, and updating the trainable parameters of the pre-trained traditional Chinese medicine large model to obtain a preliminarily aligned traditional Chinese medicine large model. The detailed steps of the whole flow are as follows:
(1) Collect and organize the relevant data of traditional Chinese medicine, construct data set TCM-D, data set contains N sequences, such as [ [ whatis the manifestation of damp-heavy-heat light syndrome? The pattern of light damp-heat refers to the condition of excessive damp-evil and depression of middle energizer and qi system, manifested as fever, heavy body, thirst, polydipsia, difficult urination, red tongue with yellow and greasy coating, slippery and rapid pulse. It is common in damp-warm diseases. The main symptoms of \n are fever, chest distress, epigastric fullness, abdominal distension, anorexia, nausea and vomiting, no thirst or no desire to drink or thirst with preference for hot drink, loose stool, turbid urine, white and greasy coating and soft pulse. Common complications of \n are: wet temperature. "]" (what is thought to hurt the spleen? The syndrome of impairment of spleen due to excessive anxiety refers to the condition of spleen qi stagnation and failure of transportation and transformation. Is commonly found in depression and other diseases. The main symptoms of \n are suspicion, dizziness, listlessness, palpitation, timidity, insomnia, amnesia, anorexia, lusterless complexion, pale tongue with thin and white coating, and thready pulse. Common complications of \n are: depression. "] [ whatis the turbid phlegm pattern? The syndrome of turbid phlegm refers to the condition of spleen failing to transport and gather dampness to form phlegm and phlegm qi blocking, and is manifested as cough and asthma, excessive phlegm, vomiting and dizziness, or smooth tumor in local area with wiry and slippery coating pulse. It is commonly found in headache and other diseases. The main symptoms of \n are cough and asthma, excessive phlegm, nausea and dizziness, or smooth bumps in the local area, wiry and slippery coating. N common complications … … "], … … ]. The ith sequence X i=[xi,1,xi,2,…,xi,m contains m Chinese characters, and the Chinese characters together form the relevant knowledge of the traditional Chinese medicine.
(2) A probability function p (X i)=p(xi,1,xi,2,…,xi,m) is defined, representing the probability that the ith sequence X i meets human medical preferences, p.gtoreq.0. If the criterion p=0 is not met.
(3) For each sequence X i in the dataset TCM-D, a probability p (X i) is calculated that it meets the human medical preference. According to the chain law, the probability p is expanded as: Where p (x i,j|xi,1,…,xi,j-1) represents the probability that the Qwen-14B model predicts the next chinese given the preceding chinese sequence.
(4) Defining a first loss functionRepresenting the negative log likelihood penalty for predicting the next Chinese character by the model, where P (X i,j+1∣xi,1...j1) represents the probability of predicting the Chinese character at position X i,j+1 given the first j Chinese character sequences X i,1...j in the ith sequence X i, and θ 1 represents the trainable parameters of the Qwen-14B model (i.e., the first pre-trained language model).
(5) The trainable parameter θ 1 for the Qwen-14B model is updated using a gradient descent algorithm to minimize the loss function LX. The probability that the model predicts the next Chinese character to accord with the medical preference of human is improved in the medical field, so that the aim of expanding Qwen-14B model medical knowledge is fulfilled.
(6) For different traditional Chinese medicine question-answering scenes, a certain amount of data are selected from the data set TCM-D respectively by combining with prompt instructions, and then M arrays containing a question sequence X i and a corresponding answer sequence Y i are combined. The prompt instruction aims at different question-answer scenes, and prompts of the question-answer scenes are spliced at the forefront of each piece of training data. For example, a question-answer scene introduced by traditional Chinese medicine symptoms can be formed by splicing an instruction on the front surface of each piece of training data under the scene, and a task introduced by traditional Chinese medicine symptoms is described. The questions are answered correctly using knowledge of traditional Chinese medicine. In the sentence of n/n# # ", each training data form after splicing is as follows: the following is an instruction describing the task of the introduction of the symptoms of traditional Chinese medicine. The questions are answered correctly using knowledge of traditional Chinese medicine. What are the evil invagination pericardium syndrome? The pattern of pathogenic factors invading the pericardium refers to the pattern of invasion of pathogenic heat, blockage and envelope, manifested as hyperpyrexia, coma, delirium, stiff tongue, cold limbs, convulsions and convulsions, and a dark-red tongue without tongue coating. It is common in wind-warm diseases. The main symptoms of \n are high fever, coma, delirium, stiff tongue, cold limbs, convulsion and convulsion, and a deep-red tongue without tongue coating. Common complications of \n are: and (5) wind temperature. "].
(7) Determining whether the answer sequence Y i corresponding to the question sequence X i (i.e., the second element in the previous array, such as "what is the pericardium invasion syndrome"), i.e., the third element in the previous array, such as "the pericardium invasion syndrome refers to the invagination of pathogenic warm evil … …"), substantially corresponds to the preference of human, if so, proceeding to step (9), and if not, proceeding to step (8). Wherein, the approximate agreement with human preference means that the information contained in the answer sequence Y i should be able to solve the problems in the problem sequence X i, and accord with the standards of Chinese medicine theory and clinical practice.
(8) The data that answer sequence Y i does not meet the human preference requires the domain expert to modify and supplement it according to X i and then go to step (9).
(9) The data processed in the steps (7) and (8) are spliced and combined to form a combined sequence H i,Hi which contains all elements of the question sequence and the answer sequence, and H i=[xi,1,xi,2,…,…yi,1,yi,2 and …, for example [ "the following is an instruction, and the task of introducing Chinese medical symptoms is described. The questions are answered correctly using knowledge of traditional Chinese medicine. What are the evil invagination pericardium syndrome? The pattern of pathogenic factors invading the pericardium refers to the pattern of invasion of pathogenic heat, blockage and envelope, manifested as hyperpyrexia, coma, delirium, stiff tongue, cold limbs, convulsions and convulsions, and a dark-red tongue without tongue coating. It is common in wind-warm diseases. The main symptoms of \n are high fever, coma, delirium, stiff tongue, cold limbs, convulsion and convulsion, and a deep-red tongue without tongue coating. Common complications of \n are: and (5) wind temperature. "].
(10) The goal of the training is to maximize the probability of H i, i.e., to maximize the combined probability of the question sequence and the answer sequence. Converting targets to maximize using chain law Where T represents the total length of the answer sequences in the combined sequence H i, P (y i,j+1∣Xi,yi,1...j2) represents the probability of predicting the next position y i,j+1 of the answer sequence in the case of the known question sequence X i and the first j answer sequences y i,1...j, and θ 2 represents the trainable parameters of the Qwen-14B model (i.e., the pre-trained Chinese medicine big model) after medical knowledge expansion.
(11) To reduce the computational complexity, the above problem is translated into minimizing the second loss function G [ X ].The goal is to maximize the probability of sequence H i, where M represents the total number of combined sequences, thereby increasing the probability of answers meeting human preferences when the model is tasked with question-answering.
(12) A gradient descent algorithm is used to minimize the function gx to obtain the optimal model parameters θ 2. The model can more accurately understand question-answering tasks under different scenes, and the model preference alignment effect is improved.
(II) reward model training link based on partial order pair
In the training link of the rewarding model, the invention aims to explore a method for enabling the model to get rid of expensive labeling data and realizing self iteration. First, a data partial order pair based on manual annotation ordering is constructed, and then a scalar fraction partial order pair based on the data partial order pair is established. And then constructing an evaluation rewarding model by using a maximum loss function based on partial order pairs aiming at the output result of the traditional Chinese medicine big model, and preparing for the following traditional Chinese medicine big model preference alignment link. The framework and flow of the bonus model training links based on the partial order pairs are shown in fig. 5 and 6.
Specifically, step S2 includes: extracting a question sequence and a corresponding answer sequence from the standardized corpus according to a question-answer scene; aiming at each question sequence, a plurality of different medical large models are adopted to respectively generate different answer sequences, and a standard answer sequence and a corresponding extracted answer sequence are combined to generate partial sequence pairs according to the matching degree sequence; replacing the embedded layer of the output text of the second pre-training language model with a projection layer of the output scalar to obtain a replaced second pre-training language model; assigning scores to all answer sequences of each question sequence by adopting the replaced second pre-training language model to obtain score sequences; defining a third loss function; the third loss function is used for enabling the score difference between the high-quality answer sequence and the low-quality answer sequence to be larger; and aiming at minimizing the third loss function, and back-propagating and updating the trainable parameters of the replaced second pre-training language model to obtain a trained rewarding model. The detailed steps of the whole flow are as follows:
(1) For different medical task question-answering scenes, a certain amount of data are selected from the data set TCM-D constructed in the prior art respectively by combining with prompt instructions, and then M arrays containing a question sequence X i and a corresponding answer sequence Y i are combined.
(2) For each question sequence X i, a different medical large model (such as HuatuoGPT, qiZhenGPT, chatMed, shenNong-TCM-LLM, medicalGPT, DISC-MedLLM, huangDI) is used for generating seven answer sequences Y i-1…7 *, a field expert is requested to write a standard answer sequence Y i * according to the questions, and nine answer sequences { Y i-1,Yi-2,Yi-3,...,Yi-9 } corresponding to X i are obtained by combining the original answer sequences in the data set TCM-D.
(3) Please the domain expert sort the nine answer sequences Y i-1…9 according to the matching degree of Y i-1…9 and X i to obtain a partial sequence pair
(4) Replacing the embedding layer of the Qwen-7B model output text with the projection layer of the output scalar, and using the model to score nine answer sequences Y i-1…9 of each question sequence X i to obtain a score sequenceΘ 3 represents the parameters of the Qwen-7B model at this point (i.e., the second pre-trained language model after replacement), and these scores are used to measure the preferred alignment of the answer sequence Y i-1…9 corresponding to X i, the higher the score, the higher the alignment. And by using a labeling ordering method, the model gets rid of absolute tasks, and is further converted into relative tasks, so that difficulty and cost for distinguishing the preference alignment quality are reduced. Since the Qwen-7B model is similar in architecture to the Qwen-14B model, it is easier to use Qwen-7B as the basis for the rewards model and Qwen-14B model has similar characteristics in model representation. In addition, the Qwen-7B model is faster in training speed, requires less computing resources and is not easy to overfit compared with the Qwen-14B model.
(5) Defining a third loss function Wherein θ 3 represents the trainable parameters of the Qwen-7B model after the replacement embedding layer is the projection layer; DT is a data sequence set marked by field expert sequencing; /(I)The number of answer sequences corresponding to each question is 9,In order to arrange and combine operations, the function value of loss is not too high due to too many answer sequences; /(I)The sum of all the differences is accumulated to obtain the expectation; (X, Y w,Yl) to DT represent one sample extracted from the dataset DT, each sample comprising one question sequence and two corresponding answer sequences, X in the sample representing the question sequence, Y w being a high quality answer sequence and Y l being a low quality answer sequence; sigma represents a sigmoid function, maps values between (0, 1),Representing the difference/>, for each termApplying a sigmoid function to obtain a normalized difference value; the goal is to score/>, a high quality answer sequenceAnd low quality answer sequence score/>Is larger.
(6) The update parameter θ 3 is back-propagated by a minimization operation on the loss function, so that the optimization model makes the score difference between the high quality answer sequence and the low quality answer sequence greater.
(7) When the loss function is reduced and the mean mu and the variance sigma of the answer sequence differences are increased, the model is provided with the capability of distinguishing the preference alignment quality, and the model parameter theta 3 at the moment is stored to obtain the rewarding model.
Third, traditional Chinese medicine large model preference alignment link based on reinforcement learning
In the traditional Chinese medicine large model preference alignment link, firstly, a reward score optimization strategy related to a reward model is established. Then, aiming at the problem of difference of the traditional Chinese medicine large model in output in the optimization process and before optimization, establishing strategy constraint on the reward model optimization strategy. And finally, constructing a traditional Chinese medicine large model preference alignment optimization strategy based on reinforcement learning. The framework and flow of the traditional Chinese medicine large model preference alignment link based on reinforcement learning are shown in fig. 7 and 8.
Specifically, step S3 includes: extracting a problem sequence from the standardized corpus according to a question-answer scene to form a problem data set; based on a supervised fine tuning strategy, performing token sampling on each problem sequence in the problem data set by adopting the preliminarily aligned traditional Chinese medicine large model to obtain a corresponding response sequence; splicing and combining each problem sequence in the problem data set with a corresponding response sequence to obtain a spliced sequence; calculating a reward score by adopting the trained reward model according to the splicing sequence based on a reward optimization preference strategy, calculating a dominance score by adopting a generalized dominance function GAE in reinforcement learning, combining and carrying out normalization and shearing treatment to obtain an optimized reward score; introducing a mean square error of KL divergence between the reward optimization preference strategy and the supervised fine tuning strategy as a penalty term to control the difference size of the reward optimization preference strategy and the supervised fine tuning strategy; defining a first Markov decision process to form a first reinforcement learning track; the state space of the first Markov decision process represents an input problem sequence, the action space represents a corresponding response sequence, and the reward function represents a scoring strategy after the reward model is optimized; the first reinforcement learning track comprises a question sequence, a corresponding response sequence and a reward score which are input at different time steps; calculating the total return of the first reinforcement learning track according to the optimized reward score and the punishment item; training the preliminarily aligned Chinese medicine large model by taking the maximum total return as a target to obtain the Chinese medicine large model after preference alignment. The detailed steps of the whole flow are as follows:
(1) Aiming at different medical task question-answering scenes, a certain amount of classical medical problem data are selected from the data sets TCM-D constructed in the prior art respectively by combining with prompt instructions, and then the data are combined into a new data set TCM-HQD containing K pieces of high-quality data.
(2) And obtaining a large primary aligned traditional Chinese medicine model from the first link, and obtaining a trained rewarding model from the second link.
(3) The data set TCM-HQD is traversed, and each piece of data in the TCM-HQD contains a problem sequence X i=[xi,1,xi,2,xi,3 …. And performing token sampling on each problem sequence X i by using the preliminarily aligned traditional Chinese medicine large model to obtain a response sequence Rp i=[rpi,1,rpi,2,…rpi,n, …, and generating the corresponding logarithmic probability of each token in the response sequence.
(4) And splicing and combining each traversed problem sequence X i with a corresponding response sequence RP i to obtain a new spliced sequence RH i=[xi,1,xi,2,xi,3,…,rpi,1,rpi,2,…,rpi,n, ….
(5) The preferred bonus score a n(Xi,RPi,n for the nth round rp i,n is calculated from RH i using the bonus model, and the dominance score b (X i,RPi,n) for the nth round rp i,n is calculated from the generalized dominance function GAE in reinforcement learning. Combining the bonus score and the dominance score to obtain a new bonus score r *(Xi,RPi,n), accumulating the new bonus score of each round to obtain a bonus scoreWhere n represents the current training round and B represents the total training round.
(6) To prevent large fluctuations in the training process, the current score r *(Xi,RPi) is normalized and clipped by the bonus scaling to obtain an optimized bonus score: where δ represents the clipping region, σ (r *(Xi,RPi)) and/> R *(Xi,RPi) are shown separately.
(7) In order to ensure that the preference strategy of the model in the optimization process does not deviate too much from the initial supervised fine tuning strategy, the preference strategy is optimized in rewardsAnd a supervised fine tuning strategy pi SFT(RPi|Xi) to control the magnitude of the difference of the two strategies.
(8) Mean square error using KL divergence, i.e Where KL (·, ·) represents the policy/>, at each position of the response sequence RP i And KL divergence between pi SFT. The model parameters and the fine tuning process are adjusted by introducing the minimized KL divergence, so that the difference of the two strategies at the RP i position is as small as possible, and the stability and the continuity in the model preference alignment process are maintained.
(9) Defining a Markov decision process M= < S, A, R, P and gamma >, regarding the process of continuously and dynamically adjusting the scores of response sequences RP in an optimization strategy as a Markov decision process, wherein a state space S represents a possibly input problem sequence X, an action space A represents a response sequence RP corresponding to the problem sequence X, a reward function R represents a scoring strategy R (X, RP) after the reward model optimization, a state transition P represents state transitions X' to P (|X, RP) of current input information, and a discount factor gamma is used for controlling the importance of future rewards.
(10) For the input problem sequence X i, the current decision strategy is adoptedUpdating it to generate response sequence RP i, by which a reinforcement-learned trajectory τ is formed, containing an alignment sequence between the question and the response :τ=(Xi,RPi,r2,X'i,RP'i,r3,X"i,RP"i,r4,…).
(11) Rewarding calculation by using rewarding scoring strategy and strategy constraint method, and calculating total return for input problem sequence X i and response sequence RP i obtained by decision model Wherein R (X i,RPi) is total return, R (X i,RPi) is the optimized reward score, eta is KL divergence coefficient, used for controlling the intensity of KL penalty,As penalty term,/>Optimizing preference policies for rewards, pi SFT(RPi|Xi) is a supervised fine tuning policy, X i is an input problem sequence, RP i is a corresponding response sequence.
(12) And searching an optimal decision strategy pi * by maximizing the total return R (X i,RPi) to obtain a large traditional Chinese medicine model aligned by reinforcement learning preference.
Model feedback optimization link based on neural network
In the model feedback optimization link, firstly, a medical preference alignment quality assessment method based on a fuzzy neural network is established. And then, extracting the characteristics of the feedback information of the user based on the width learning, and understanding the main trend of the feedback problem of the user. And finally, constructing a model preference alignment optimization strategy based on reinforcement learning by using the feedback information. And deploying the optimized model, repeating feedback loop optimization, and realizing feedback-based model preference alignment. The framework and flow of the neural network-based model feedback optimization procedure are shown in fig. 9 and 10.
Specifically, step S4 includes: constructing a fuzzy neural network; the fuzzy neural network comprises an input layer, a fuzzy layer, an reasoning layer and an output layer; the input layer is used for inputting weights of the evaluation text and the evaluation index; the evaluation text comprises an input question sequence and a corresponding response sequence; the fuzzy layer is used for processing the evaluation text to obtain a membership function of the evaluation index; the reasoning layer is used for dividing the grades of the fuzzy rules and calculating the excitation density of the fuzzy rules according to the membership function of the evaluation index; the output layer is used for calculating a preference alignment quality assessment result according to the excitation density of the fuzzy rule and the weight of the assessment index; acquiring feedback information of a user; the feedback information includes: the input problem sequence, the corresponding response sequence and the feedback content; the fuzzy neural network is adopted to evaluate the input problem sequence and the corresponding response sequence in the feedback information, so as to obtain a preference alignment quality evaluation result and map the preference alignment quality evaluation result to a corresponding preference alignment quality evaluation grade; the preference alignment quality rating comprises: excellent, good, medium, general and poor; if the preference alignment quality evaluation level is medium, general or poor, or the preference alignment quality evaluation level is excellent or good, but the input problem sequence and the corresponding response sequence do not accord with the standards of the traditional Chinese medicine theory and the clinical practice, extracting the depth characteristics of the feedback information by adopting a convolutional neural network to obtain a depth characteristic expression sequence; the feedback information is coded by adopting sparse self-coding, so that a weight matrix is obtained; mapping the depth characteristic representation sequence to a plurality of characteristic nodes by adopting width learning to obtain a first characteristic sequence of feedback information; mapping the first characteristic sequence of the feedback information to a plurality of enhancement nodes by adopting an activation function to obtain a second characteristic sequence of the feedback information; according to the weight matrix, fusing the feedback information first characteristic sequence and the feedback information second characteristic sequence to obtain a feedback information fusion characteristic matrix; defining a second Markov decision process to form a second reinforcement learning track; the state space of the second Markov decision process represents feedback information of the user, the action space represents corresponding preference returned according to the feedback information of the user, and the reward function represents return obtained by making preference alignment quality evaluation on the evaluation text by combining the feedback information; the second reinforcement learning track comprises feedback information acquired at different time steps, corresponding preference and obtained return; calculating an average expected cumulative discount prize of the second reinforcement learning track according to the feedback information fusion feature matrix and the preference alignment quality assessment result; and training the preference-aligned traditional Chinese medicine large model by taking the maximum average expected cumulative discount prize as a target to obtain a final optimized traditional Chinese medicine large model. The detailed steps of the whole flow are as follows:
(1) An index c= { C 1,c2,…,ce } defining the response quality of the evaluation model, such as text quality, correlation, medical accuracy, etc. Weights for the e evaluation metrics using improved analytic hierarchy process
(2) A four-layer fuzzy neural network is constructed, comprising an input layer O 1, a fuzzy layer O 2, an inference layer O 3 and an output layer O 4.
(3) Inputting the evaluation text X tre and the weight W index of the evaluation index into the input layer O 1, and processing the evaluation text by the fuzzy layer O 2 to obtain a membership function mu kq of the evaluation index, namelyWhere x trekq represents the evaluation of the kth text by the qth index, a kq represents the center of the membership function, and σ kq represents the width of the membership function.
(4) The reasoning layer O 3 is responsible for classifying the fuzzy rule grades into five grades of excellent, good, medium, general and poor, and calculating the excitation density of each fuzzy rule through O 3 Wherein μ qu represents the degree of membership of the qth evaluation index to the nth fuzzy rule class; e represents the number of evaluation indexes; u represents the fuzzy rule grade number, and u is more than or equal to 1 and less than or equal to 5.
(5) The output layer O 4 expands and defuzzifies the text of each node to obtain a network output value, namely a final preference alignment quality evaluation result y, namelyWhere γ * denotes a smoothing coefficient, d denotes the number of nodes of the blur layer O 2, and p q and w q are the blur rule excitation density and weight of the q-th alignment quality evaluation index text, respectively. /(I)
(6) Mapping the output value of the fuzzy neural network to a corresponding preference alignment quality evaluation grade, and dividing a corresponding score interval for subsequent text alignment quality evaluation.
(7) User feedback information W back is collected, where the feedback information includes the model input (i.e., the input problem sequence) to which the piece of feedback information corresponds, the model response (i.e., the corresponding response sequence), and feedback content including error information, improvement suggestions, and other comments, etc. And (3) evaluating the model input and response of each piece of feedback information by using the preference alignment quality evaluation method established before, storing the evaluation result in the piece of data, and if the evaluation result is excellent or good, proceeding to the step (8), otherwise proceeding to the step (9).
(8) Data with excellent or good results and feedback content are evaluated using the preference alignment quality evaluation method, which is again evaluated by the domain expert. If the model response portion of the data meets the human preference requirements, the feedback data is ignored. If the model response portion of the data does not meet the human preference requirements, modifying the assessment rating of the data proceeds to step (9) while collecting the data for subsequent training of the preference alignment quality assessment method.
(9) Extracting depth features of feedback information of different users, different question and answer scenes and different evaluation results by using convolutional neural network to form a depth feature representation sequenceWherein K represents the number of categories of the disease or symptom information, t represents the time of user feedback, ca t represents the disease or symptom information fed back by the t th user.
(10) Encoding the feedback information by using sparse self-encoding to obtain a weight matrix
(11) The obtained feedback information features are mapped to b feature nodes by utilizing width learning to obtain a first feedback information feature sequence
(12) Taking the characteristics of complexity, asynchronism, variability and the like of the feedback information into consideration, further utilizing an activation function to map the feedback information characteristics to d enhancement nodes, and obtaining a feedback information second characteristic sequenceWherein W h and beta h are random matrices and biases,/>Is a nonlinear activation function.
(13) Connecting the feature nodes with the enhancement nodes to obtain a feedback information fusion feature matrix Wherein A' is an input matrix, W is an output connection weight matrix, and the input matrix passes through/>The overall situation of the user feedback information can be revealed, the feedback information fusion feature matrix provides comprehensive features related to the user feedback information, and the average expected accumulated discount rewards of the reinforcement learning track can be calculated by combining the preference alignment quality evaluation results of the feedback information.
(14) The process of model feedback optimization is modeled as a markov decision process M '= { S', a ', R', P ', γ' }, from which the trajectories τ '= { S' t,a't,r't,s't+1,a't+1,r't+1, … } are collected. Wherein the state space S ' represents the input sequence of the model, the response sequence of the model and the feedback information of the user, the action space A ' represents the corresponding preference returned by the model according to the feedback of the user, the specific form comprises the adjustment of model parameters, the updating of weight matrix and the like, and the reward function R ' represents the return obtained by carrying out the preference alignment quality evaluation on the input text by combining the feedback information, namely the feedback information fusion characteristic matrix containing the feedback information of the user and the evaluation result informationP 'represents the state transition probability, i.e. the probability of transitioning to the next state after performing a certain action in a given state, in particular the effect of the response of the model on the user feedback, and the effect of the action on the model state, the discount factor y' is used to control the importance of future rewards.
(15) Translating the model preference optimization problem based on user feedback to maximizing average expected cumulative discount rewards, finding the problem of optimal preference alignment policy pi *, i.e Where J π (τ') is the average desired cumulative discount prize, E [ ] is desired,/>The expected accumulation of rewards is represented, the expected value is carried out on all possible tracks, t is a time step, gamma 't is a discount factor on the t time step and is used for representing the relative importance of the current rewards and the future rewards, r' t is a return obtained by making preference alignment quality evaluation on the evaluation text by combining feedback information on the t time step, namely, the instant rewards obtained after the state s 't executes the action a' t, s 'is a state, and s' 0 is an initial state. Training the model using a reinforcement learning algorithm finds the optimal strategy pi * so that it can achieve optimal preference alignment.
(16) And redeploying the improved model into practical application, continuing to perform personalized preference alignment on the model according to feedback information and a model optimization strategy, and continuing iteration until the model approaches to an optimal state.
In summary, in order to solve the problem of preference alignment of the traditional Chinese medicine large model, the invention provides a new method for preference alignment of the traditional Chinese medicine large model based on doctor feedback and reinforcement learning. The method establishes an effective feedback mechanism by enhancing interaction between the medical expert and the model, so that the model can generate answers more consistent with human preferences. The supervised learning technology is combined with preference ordering data to construct a reward model capable of accurately reflecting human preferences, and then iterative optimization is carried out on model strategies and neural network structures by means of reinforcement learning algorithms, so that continuous improvement of traditional Chinese medicine large-scale model preference alignment and continuous improvement of feedback mechanisms are realized.
In one embodiment, there is also provided a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the method embodiments described above.
In one embodiment, there is also provided a computer device comprising a memory and a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the method embodiments described above when the computer program is executed.
Further, the computer device also includes an Input/Output interface (I/O) and a communication interface. The processor, the memory and the input/output interface are connected through a system bus, and the communication interface is connected to the system bus through the input/output interface. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the computer device is used to store the pending transactions. The input/output interface of the computer device is used to exchange information between the processor and the external device. The communication interface of the computer device is used for communicating with an external terminal through a network connection. The computer program, when executed by a processor, implements the steps of the method embodiments described above.
It should be noted that, the object information (including, but not limited to, object device information, object personal information, etc.) and the data (including, but not limited to, data for analysis, stored data, presented data, etc.) related to the present application are both information and data authorized by the object or sufficiently authorized by each party, and the collection, use and processing of the related data need to comply with the related laws and regulations and standards.
Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, database, or other medium used in embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high density embedded nonvolatile Memory, resistive random access Memory (ReRAM), magneto-resistive random access Memory (Magnetoresistive Random Access Memory, MRAM), ferroelectric Memory (Ferroelectric Random Access Memory, FRAM), phase change Memory (PHASE CHANGE Memory, PCM), graphene Memory, and the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory, and the like. By way of illustration, and not limitation, RAM can be in the form of a variety of forms, such as Static Random access memory (Static Random access memory AccessMemory, SRAM) or dynamic Random access memory (Dynamic Random Access Memory, DRAM), and the like. The databases referred to in the embodiments provided herein may include at least one of a relational database and a non-relational database. The non-relational database may include, but is not limited to, a blockchain-based distributed database, and the like. The processor referred to in the embodiments provided in the present application may be a general-purpose processor, a central processing unit, a graphics processor, a digital signal processor, a programmable logic unit, a data processing logic unit based on quantum computing, or the like, but is not limited thereto.
The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.
The principles and embodiments of the present invention have been described herein with reference to specific examples, the description of which is intended only to assist in understanding the methods of the present invention and the core ideas thereof; also, it is within the scope of the present invention to be modified by those of ordinary skill in the art in light of the present teachings. In view of the foregoing, this description should not be construed as limiting the invention.

Claims (10)

1. A method for aligning the preference of a large model of traditional Chinese medicine, which is characterized by comprising the following steps:
Constructing a standardized corpus, and training a first pre-training language model on the standardized corpus by adopting a self-supervision learning strategy and a supervised learning strategy to obtain a preliminarily aligned Chinese medicine large model;
constructing a data partial order pair, and training a second pre-training language model based on the data partial order pair by adopting a reinforcement learning technology to obtain a trained rewarding model;
according to the preliminarily aligned Chinese medicine big model and the trained reward model, carrying out preference alignment on the Chinese medicine big model based on reinforcement learning to obtain a Chinese medicine big model subjected to preference alignment;
according to the traditional Chinese medicine large model aligned by preference, performing model feedback optimization based on a neural network to obtain a final optimized traditional Chinese medicine large model; the final optimized Chinese medicine large model is used for generating an answer sequence which accords with the standards of Chinese medicine theory and clinical practice according to the question sequence input by the user.
2. The method for aligning the preferences of the large model of the traditional Chinese medicine according to claim 1, wherein a standardized corpus is constructed, and a first pre-training language model is trained on the standardized corpus by adopting a self-supervision learning strategy and a supervised learning strategy to obtain the large model of the traditional Chinese medicine which is primarily aligned, and the method specifically comprises the following steps:
Obtaining traditional Chinese medical knowledge data and constructing a standardized corpus; the standardized corpus comprises a plurality of sequences, and each sequence comprises a plurality of Chinese characters;
defining a first loss function; the first loss function represents the negative log likelihood loss of the next Chinese character through model prediction;
Minimizing the first loss function by adopting a gradient descent algorithm, and updating trainable parameters of the first pre-training language model to obtain a pre-trained traditional Chinese medicine large model;
extracting a question sequence and a corresponding answer sequence from the standardized corpus according to a question-answer scene, and merging the question sequence and the corresponding answer sequence into a combined sequence;
Defining a second loss function; the second loss function converts the probability problem of maximizing the combined sequence into a minimization problem;
and minimizing the second loss function by adopting a gradient descent algorithm, and updating the trainable parameters of the pre-trained traditional Chinese medicine large model to obtain a preliminarily aligned traditional Chinese medicine large model.
3. The method for aligning the preference of the large traditional Chinese medicine model according to claim 1, wherein the method comprises the steps of constructing a data partial order pair, training a second pre-training language model based on the data partial order pair by adopting a reinforcement learning technology, and obtaining a trained rewarding model, and specifically comprises the following steps:
extracting a question sequence and a corresponding answer sequence from the standardized corpus according to a question-answer scene;
Aiming at each question sequence, a plurality of different medical large models are adopted to respectively generate different answer sequences, and a standard answer sequence and a corresponding extracted answer sequence are combined to generate partial sequence pairs according to the matching degree sequence;
replacing the embedded layer of the output text of the second pre-training language model with a projection layer of the output scalar to obtain a replaced second pre-training language model;
assigning scores to all answer sequences of each question sequence by adopting the replaced second pre-training language model to obtain score sequences;
Defining a third loss function; the third loss function is used for enabling the score difference between the high-quality answer sequence and the low-quality answer sequence to be larger;
And aiming at minimizing the third loss function, and back-propagating and updating the trainable parameters of the replaced second pre-training language model to obtain a trained rewarding model.
4. The method for aligning the preferences of the large model of the traditional Chinese medicine according to claim 1, wherein the method for aligning the preferences of the large model of the traditional Chinese medicine based on reinforcement learning according to the preliminarily aligned large model of the traditional Chinese medicine and the trained rewarding model is carried out to obtain the large model of the traditional Chinese medicine after the preference alignment, and specifically comprises the following steps:
Extracting a problem sequence from the standardized corpus according to a question-answer scene to form a problem data set;
Based on a supervised fine tuning strategy, performing token sampling on each problem sequence in the problem data set by adopting the preliminarily aligned traditional Chinese medicine large model to obtain a corresponding response sequence;
Splicing and combining each problem sequence in the problem data set with a corresponding response sequence to obtain a spliced sequence;
Calculating a reward score by adopting the trained reward model according to the splicing sequence based on a reward optimization preference strategy, calculating a dominance score by adopting a generalized dominance function GAE in reinforcement learning, combining and carrying out normalization and shearing treatment to obtain an optimized reward score;
introducing a mean square error of KL divergence between the reward optimization preference strategy and the supervised fine tuning strategy as a penalty term to control the difference size of the reward optimization preference strategy and the supervised fine tuning strategy;
Defining a first Markov decision process to form a first reinforcement learning track; the state space of the first Markov decision process represents an input problem sequence, the action space represents a corresponding response sequence, and the reward function represents a scoring strategy after the reward model is optimized; the first reinforcement learning track comprises a question sequence, a corresponding response sequence and a reward score which are input at different time steps;
calculating the total return of the first reinforcement learning track according to the optimized reward score and the punishment item;
training the preliminarily aligned Chinese medicine large model by taking the maximum total return as a target to obtain the Chinese medicine large model after preference alignment.
5. The method for aligning the preferences of the large traditional Chinese medicine model according to claim 1, wherein the method for optimizing the model feedback based on the neural network according to the large traditional Chinese medicine model with the aligned preferences, to obtain the final optimized large traditional Chinese medicine model, specifically comprises the following steps:
Constructing a fuzzy neural network; the fuzzy neural network comprises an input layer, a fuzzy layer, an reasoning layer and an output layer; the input layer is used for inputting weights of the evaluation text and the evaluation index; the evaluation text comprises an input question sequence and a corresponding response sequence; the fuzzy layer is used for processing the evaluation text to obtain a membership function of the evaluation index; the reasoning layer is used for dividing the grades of the fuzzy rules and calculating the excitation density of the fuzzy rules according to the membership function of the evaluation index; the output layer is used for calculating a preference alignment quality assessment result according to the excitation density of the fuzzy rule and the weight of the assessment index;
acquiring feedback information of a user; the feedback information includes: the input problem sequence, the corresponding response sequence and the feedback content;
The fuzzy neural network is adopted to evaluate the input problem sequence and the corresponding response sequence in the feedback information, so as to obtain a preference alignment quality evaluation result and map the preference alignment quality evaluation result to a corresponding preference alignment quality evaluation grade; the preference alignment quality rating comprises: excellent, good, medium, general and poor;
If the preference alignment quality evaluation level is medium, general or poor, or the preference alignment quality evaluation level is excellent or good, but the input problem sequence and the corresponding response sequence do not accord with the standards of the traditional Chinese medicine theory and the clinical practice, extracting the depth characteristics of the feedback information by adopting a convolutional neural network to obtain a depth characteristic expression sequence;
The feedback information is coded by adopting sparse self-coding, so that a weight matrix is obtained;
mapping the depth characteristic representation sequence to a plurality of characteristic nodes by adopting width learning to obtain a first characteristic sequence of feedback information;
Mapping the first characteristic sequence of the feedback information to a plurality of enhancement nodes by adopting an activation function to obtain a second characteristic sequence of the feedback information;
According to the weight matrix, fusing the feedback information first characteristic sequence and the feedback information second characteristic sequence to obtain a feedback information fusion characteristic matrix;
Defining a second Markov decision process to form a second reinforcement learning track; the state space of the second Markov decision process represents feedback information of the user, the action space represents corresponding preference returned according to the feedback information of the user, and the reward function represents return obtained by making preference alignment quality evaluation on the evaluation text by combining the feedback information; the second reinforcement learning track comprises feedback information acquired at different time steps, corresponding preference and obtained return;
calculating an average expected cumulative discount prize of the second reinforcement learning track according to the feedback information fusion feature matrix and the preference alignment quality assessment result;
And training the preference-aligned traditional Chinese medicine large model by taking the maximum average expected cumulative discount prize as a target to obtain a final optimized traditional Chinese medicine large model.
6. The method for aligning large model preferences of chinese medical science according to claim 1, wherein the first pre-trained language model is Qwen-14B model; the second pre-trained language model is Qwen-7B model.
7. The method for aligning large model preferences of traditional Chinese medicine according to claim 4, wherein the calculation formula of the total return is:
wherein R (X i,RPi) is total return, R (X i,RPi) is the optimized reward score, eta is KL divergence coefficient, As penalty term,/>Optimizing preference policies for rewards, pi SFT(RPi|Xi) is a supervised fine tuning policy, X i is an input problem sequence, RP i is a corresponding response sequence.
8. The method of alignment of chinese medicine large model preferences according to claim 5, wherein the average expected cumulative discount prize is calculated by the formula:
Where J π (τ ') is the average desired cumulative discount prize, E [ ] is the desire, t is the time step, γ 't is the discount factor at the t-th time step, r' t is the return obtained by combining feedback information to make a preferred alignment quality assessment on the assessment text at the t-th time step, s 'is the state, and s' 0 is the initial state.
9. A computer device, comprising: a memory, a processor and a computer program stored and executable on the processor, wherein the processor executes the computer program to perform the steps of the method of big model preference alignment of chinese-medicinal as claimed in any of claims 1-8.
10. A computer readable storage medium having stored thereon a computer program, which when executed by a processor, implements the steps of the chinese-medicinal large model preference alignment method according to any one of claims 1-8.
CN202410437148.5A 2024-04-11 2024-04-11 Method, equipment and medium for aligning traditional Chinese medicine large model preference Pending CN118155860A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410437148.5A CN118155860A (en) 2024-04-11 2024-04-11 Method, equipment and medium for aligning traditional Chinese medicine large model preference

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410437148.5A CN118155860A (en) 2024-04-11 2024-04-11 Method, equipment and medium for aligning traditional Chinese medicine large model preference

Publications (1)

Publication Number Publication Date
CN118155860A true CN118155860A (en) 2024-06-07

Family

ID=91290844

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410437148.5A Pending CN118155860A (en) 2024-04-11 2024-04-11 Method, equipment and medium for aligning traditional Chinese medicine large model preference

Country Status (1)

Country Link
CN (1) CN118155860A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118485141A (en) * 2024-07-16 2024-08-13 北京大学 Training method and device for Chinese medical large language model

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118485141A (en) * 2024-07-16 2024-08-13 北京大学 Training method and device for Chinese medical large language model

Similar Documents

Publication Publication Date Title
CN107679617A (en) The deep neural network compression method of successive ignition
CN107729999A (en) Consider the deep neural network compression method of matrix correlation
CN118155860A (en) Method, equipment and medium for aligning traditional Chinese medicine large model preference
CN109992779A (en) A kind of sentiment analysis method, apparatus, equipment and storage medium based on CNN
CN112420191A (en) Traditional Chinese medicine auxiliary decision making system and method
CN113408430B (en) Image Chinese description system and method based on multi-level strategy and deep reinforcement learning framework
CN118230908A (en) Traditional Chinese medicine large model and preference alignment method based on reinforcement learning
CN112527966A (en) Network text emotion analysis method based on Bi-GRU neural network and self-attention mechanism
CN113591988B (en) Knowledge cognitive structure analysis method, system, computer equipment, medium and terminal
CN113129148A (en) Stock prediction method fusing generation of confrontation network and two-dimensional attention mechanism
CN114429143A (en) Cross-language attribute level emotion classification method based on enhanced distillation
CN111401547A (en) Passenger flow analysis-oriented HTM design method based on cyclic learning unit
CN115422369B (en) Knowledge graph completion method and device based on improved TextRank
CN117435715A (en) Question answering method for improving time sequence knowledge graph based on auxiliary supervision signals
CN115658886A (en) Intelligent liver cancer staging method, system and medium based on semantic text
CN116779091A (en) Automatic generation method of multi-mode network interconnection and fusion chest image diagnosis report
CN115827968A (en) Individualized knowledge tracking method based on knowledge graph recommendation
CN115223021A (en) Visual question-answering-based fruit tree full-growth period farm work decision-making method
Chanamarn et al. Enhancing efficient study plan for student with machine learning techniques
CN117954081A (en) Intelligent medical inquiry method and system based on graph transducer
CN117272149A (en) Cross-table multitasking pre-training method and device based on language model
Yang et al. Angle-based cost-sensitive multicategory classification
CN112329921A (en) Diuretic dose reasoning device based on deep characterization learning and reinforcement learning
CN115422321B (en) Knowledge graph complex logic reasoning method, component and knowledge graph query and retrieval method
CN115565669A (en) Cancer survival analysis method based on GAN and multitask learning

Legal Events

Date Code Title Description
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination