CN116757270A - Data processing method and server based on man-machine interaction model or large model - Google Patents

Data processing method and server based on man-machine interaction model or large model Download PDF

Info

Publication number
CN116757270A
CN116757270A CN202310777685.XA CN202310777685A CN116757270A CN 116757270 A CN116757270 A CN 116757270A CN 202310777685 A CN202310777685 A CN 202310777685A CN 116757270 A CN116757270 A CN 116757270A
Authority
CN
China
Prior art keywords
reply information
model
input sample
reply
pieces
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310777685.XA
Other languages
Chinese (zh)
Inventor
郁博文
宋非凡
余海洋
李永彬
黄非
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba China Co Ltd
Original Assignee
Alibaba China Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba China Co Ltd filed Critical Alibaba China Co Ltd
Priority to CN202310777685.XA priority Critical patent/CN116757270A/en
Publication of CN116757270A publication Critical patent/CN116757270A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/09Supervised learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/0985Hyperparameter optimisation; Meta-learning; Learning-to-learn

Abstract

The application provides a data processing method and a server based on a human-computer interaction model or a large model. The method comprises the steps of generating a plurality of pieces of reply information of an input sample through a pre-training human-computer interaction model, and obtaining output probability of each piece of reply information; the method comprises the steps of sorting a plurality of pieces of reply information of an input sample according to reply quality, optimizing parameters of a pre-training human-computer interaction model according to sorting results of the plurality of pieces of reply information of the input sample and output probability of each piece of reply information, training and optimizing model parameters through supervised learning based on sorting results reflecting preferences of human on the plurality of pieces of reply information of the same input sample, and compared with a training method of reinforcement learning, the method is simpler, more efficient and stable, the human-computer interaction model aligned with the preferences of the human can be obtained rapidly and effectively, the alignment degree of the reply information generated by the human-computer interaction model and the preferences of the human is improved, and therefore the quality of the reply information generated by the human-computer interaction model is improved, and the human-computer interaction quality of an artificial intelligent system is improved.

Description

Data processing method and server based on man-machine interaction model or large model
Technical Field
The present application relates to computer technologies, and in particular, to a data processing method and a server based on a human-computer interaction model or a large model.
Background
Human preference alignment is a technology that has been increasingly valued in recent years to align the output of artificial intelligence (Artificial Intelligence, AI) systems based on large-scale pre-training models with human value views. If the artificial intelligence is against the human value, erroneous output is easily generated, which causes bad effects such as damage to human interests and even deviation from control, for example, the artificial intelligence system outputs information conforming to language rules but distorted, and even gives out discrimination language. With the rapid evolution of large model technology, a large framework of general artificial intelligence has been created. Next, it is most important to 'align' the large model with the human preferences in the real world.
At present, reinforcement learning (Reinforcement Learning from Human Feedback, RLHF for short) based on human feedback is generally adopted, a Reward Model (RM for short) is trained, a Reward value of a pretraining Model for a plurality of replies input by the same input is generated by using the Reward Model, and parameters of the pretraining Model are optimized according to the Reward value based on a reinforcement learning method. However, the training method based on reinforcement learning is high in complexity and unstable, so that the man-machine interaction model is long in time consumption and easy to fail in human preference alignment, the man-machine interaction model aligned with the human preference is difficult to quickly and effectively acquire, and the quality of reply information generated by the man-machine interaction model is poor, so that the man-machine interaction quality of the artificial intelligent system is poor.
Disclosure of Invention
The application provides a data processing method and a server based on a man-machine interaction model or a large model, which are used for solving the problem that the man-machine interaction quality of an artificial intelligent system is poor due to the fact that the quality of reply information generated by the man-machine interaction model is poor.
In a first aspect, the present application provides a data processing method based on a human-computer interaction model, including:
acquiring a pre-trained man-machine interaction model and an input sample; generating a plurality of pieces of reply information of the input sample through the man-machine interaction model, and acquiring output probability of each piece of reply information; sequencing the plurality of reply information of the input sample according to the reply quality to obtain sequencing results of the plurality of reply information of the input sample; and optimizing parameters of the man-machine interaction model according to the sorting result of the plurality of pieces of reply information of the input sample and the output probability of each piece of reply information, wherein the man-machine interaction is used for generating reply information according to the input information of a user.
In a second aspect, the present application provides a data processing method of a large model, applied to a server, including:
obtaining a pre-training large model; acquiring an input sample of a currently applied vertical field; generating a plurality of pieces of reply information of the input sample through the pre-training large model, and acquiring output probability of each piece of reply information; sequencing the plurality of reply information of the input sample according to the reply quality to obtain sequencing results of the plurality of reply information of the input sample; and optimizing parameters of the pre-training large model according to the sequencing result of the plurality of pieces of reply information of the input sample and the output probability of each piece of reply information to obtain a large model in the vertical field, wherein the large model in the vertical field is applied to a man-machine interaction system in the vertical field and is used for generating reply information according to the input information.
In a third aspect, the present application provides a data processing method based on a human-computer interaction model, applied to a server, including:
receiving a training request for an initial large model sent by a terminal side device; pre-training the initial large model to obtain a pre-trained large model; generating a plurality of pieces of reply information of an input sample through the pre-training large model, and acquiring output probability of each piece of reply information; sequencing the plurality of reply information of the input sample according to the reply quality to obtain sequencing results of the plurality of reply information of the input sample; optimizing parameters of the pre-training large model according to the sequencing result of the plurality of pieces of reply information of the input sample and the output probability of each piece of reply information to obtain optimized model parameters; and sending the optimized model parameters to the end-side equipment.
In a fourth aspect, the present application provides a data processing method based on a man-machine interaction model, applied to an end-side device, including:
sending a training request for the initial large model to a server; receiving the optimized model parameters of the initial large model sent by the server, wherein the optimized model parameters are obtained by pre-training the initial large model to obtain a pre-trained large model, generating a plurality of pieces of reply information of an input sample through the pre-trained large model, obtaining output probability of each piece of reply information, sorting the plurality of pieces of reply information of the input sample according to reply quality to obtain sorting results of the plurality of pieces of reply information of the input sample, and optimizing parameters of the pre-trained large model according to the sorting results of the plurality of pieces of reply information of the input sample and the output probability of each piece of reply information; updating the model parameters of the initial large model according to the optimized model parameters to obtain a trained large model; and responding to the input information of the user, generating reply information of the input information through the trained large model, and outputting the reply information of the input information.
In a fifth aspect, the present application provides a server comprising: a processor, and a memory communicatively coupled to the processor; the memory stores computer-executable instructions; the processor executes computer-executable instructions stored by the memory to implement the method of the first, second or third aspects.
According to the data processing method and the server based on the human-computer interaction model or the large model, the pre-training human-computer interaction model is used for generating a plurality of pieces of reply information of an input sample, and the output probability of each piece of reply information is obtained; the method comprises the steps of sorting a plurality of pieces of reply information of an input sample according to reply quality to obtain sorting results of the plurality of pieces of reply information of the input sample, optimizing parameters of a pre-training human-computer interaction model according to the sorting results of the plurality of pieces of reply information of the input sample and output probability of each piece of reply information, optimizing model parameters based on the sorting results reflecting preferences of human on the plurality of pieces of reply information of the same input sample through a training method of supervised learning, obtaining a human-computer interaction model aligned with human preferences quickly and effectively compared with a training method of reinforcement learning, and improving alignment degree of the human-computer interaction model to generate the reply information and the human preferences, so that quality of the human-computer interaction model to generate the reply information is improved, and human-computer interaction quality of an artificial intelligent system is improved.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application.
FIG. 1 is a schematic diagram of an exemplary system architecture to which the present application is applicable;
FIG. 2 is a flowchart of a data processing method based on a human-computer interaction model according to an exemplary embodiment of the present application;
FIG. 3 is an exemplary diagram of pre-training model optimization provided by an exemplary embodiment of the present application;
FIG. 4 is a flowchart of a data processing method based on a human-computer interaction large model according to another exemplary embodiment of the present application;
FIG. 5 is a flowchart of a data processing method based on a human-computer interaction large model according to another exemplary embodiment of the present application;
FIG. 6 is a schematic diagram of another example system architecture to which the present application applies;
FIG. 7 is a flowchart of a method for large model-based data processing according to an exemplary embodiment of the present application;
FIG. 8 is a schematic diagram of a data processing apparatus based on a human-computer interaction model according to an exemplary embodiment of the present application;
fig. 9 is a schematic structural diagram of a server according to an embodiment of the present application.
Specific embodiments of the present application have been shown by way of the above drawings and will be described in more detail below. The drawings and the written description are not intended to limit the scope of the inventive concepts in any way, but rather to illustrate the inventive concepts to those skilled in the art by reference to the specific embodiments.
Detailed Description
Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples do not represent all implementations consistent with the application. Rather, they are merely examples of apparatus and methods consistent with aspects of the application as detailed in the accompanying claims.
First, the terms involved in the present application will be explained:
human preference alignment: is a technology that has been increasingly emphasized in recent years, and aims to align the output of artificial intelligence AI systems based on large-scale pre-training models with human value. If the artificial intelligence is against the human value, erroneous output is easily generated, which causes bad effects such as damage to human interests and even deviation from control, for example, the artificial intelligence system outputs information conforming to language rules but distorted, and even gives out discrimination language.
Pre-training language model: a language model obtained by pre-training a large-scale language model (Large Language Model, abbreviated as LLM).
Rank Learning (Learning to Rank): the goal is to automatically learn a ranking function from the training data.
BLEU (Bilingual Evaluation Understudy): a set of evaluation metrics for machine translation tasks.
ROUGE (Recall-Oriented Understudy for Gisting Evaluation): is a group of evaluation indexes in the natural language processing fields such as machine translation, automatic abstract, question and answer generation and the like. The ROUGE obtains a corresponding score by comparing the abstract or answer generated by the model with a reference answer (typically manually noted).
Visual question-answering task: from the input image and the question, an answer to the question is determined from visual information of the input image.
Image description task: descriptive text of the input image is generated.
Visual implication task: the semantic relativity of the input image and the text, namely implication, neutrality or contradiction, is predicted.
Refer to the expression and understanding task: and positioning an image area corresponding to the input text in the input image according to the input text.
Image generation tasks: an image is generated based on the entered descriptive text.
Text-based emotion classification tasks: emotion classification information of the input text is predicted.
Text summarization task: summary information of the input text is generated.
Multimodal tasks: the input/output data refers to downstream tasks of various modal data such as images, texts and the like, such as a visual question-answering task, an image description task, a visual implication task, a presentation and understanding task, an image generation task and the like.
Multimodal pre-training model: the method is characterized in that the input and output data relates to a pre-training model of various modal data such as images, texts and the like, and the pre-training model can be applied to multi-modal task processing after fine-tuning training.
Vertical field: the term internet industry refers to a divided area of providing specific services to a defined group, including entertainment, medical, environmental, educational, sports, and other areas. The vertical domain is a small domain divided vertically under one large domain.
Large models refer to deep-learning models with large-scale model parameters, typically containing hundreds of millions, or even hundreds of billions of model parameters. The large Model can be called as a Foundation Model, a training Model is performed by using a large-scale unlabeled corpus, a pre-training Model with more than one hundred million parameters is produced, the Model can adapt to a wide downstream task, and the Model has better generalization capability, such as a large-scale language Model, a multi-Model pre-training Model and the like.
When the large model is actually applied, the pretrained model can be applied to different tasks by only slightly adjusting a small number of samples, the large model can be widely applied to the fields of natural language processing (Natural Language Processing, NLP for short), computer vision and the like, and particularly can be applied to the tasks of the computer vision fields such as visual question and answer (Visual Question Answering, VQA for short), image description (IC for short), image generation and the like, and the tasks of the natural language processing fields such as emotion classification based on texts, text abstract generation, machine translation and the like, and main application scenes of the large model comprise digital assistants, intelligent robots, searching, online education, office software, electronic commerce, intelligent design and the like.
In recent years, large models of human-machine interactions (e.g., LLM, multimodal pre-training models, etc.) have been impressive of the ability to generate diverse text based on input information from human users. However, the evaluation of the generated results is subjective and context dependent, e.g. the user wants the model to generate an creative story, a piece of real informative text, or an executable code segment, which results are difficult to measure with existing rule-based text generation metrics (e.g. BLEU and ROUGE). In addition to evaluating metrics, existing models are typically modeled in a way that predicts the next word and simple loss functions (such as cross entropy), without explicitly introducing human preferences and subjective opinion. The rapid evolution of large model technology has created a large framework for versatile artificial intelligence. Next, it is most important to 'align' the large model with the human preferences in the real world. In the process of self-learning and self-iteration of the large model of human-computer interaction, human beings need to participate in the large model to keep the large model consistent with the human value and thinking modes, otherwise, the output of the large model may be far from the preference of human beings.
At present, reinforcement learning (Reinforcement Learning from Human Feedback, RLHF for short) based on human feedback is generally adopted, a Reward Model (RM for short) is trained, reward values of a plurality of replies input by the same input are generated by using the Reward Model, and parameters of the pre-trained large Model are optimized based on discrete Reward values, so that human preference alignment of the pre-trained large Model is realized. However, the training complexity of reinforcement learning is high and unstable, so that the time for carrying out human preference alignment on the large model is long and is easy to fail, the large model aligned with the human preference is difficult to quickly and effectively obtain, the quality of reply information generated by the large model is poor, and the human-computer interaction quality of the artificial intelligence system is poor.
The application provides a data processing method based on a human-computer interaction model, which is used for aligning human preferences of a pre-trained human-computer interaction model. Specifically, generating a plurality of pieces of reply information of an input sample through a pre-trained man-machine interaction model, and acquiring output probability of each piece of reply information; the method comprises the steps of sorting a plurality of pieces of reply information of an input sample according to reply quality to obtain sorting results of the plurality of pieces of reply information of the input sample, optimizing parameters of a human-computer interaction model according to the sorting results of the plurality of pieces of reply information of the input sample and output probability of each piece of reply information, optimizing model parameters based on the sorting results reflecting preferences of human on the plurality of pieces of reply information of the same input sample by a supervised learning training method, obtaining the human-computer interaction model aligned with human preferences more simply, efficiently and stably, improving alignment degree of the human-computer interaction model to the human preferences, improving quality of the reply information generated by the human-computer interaction model, and improving human-computer interaction quality of an artificial intelligent system. In addition, by comparing different reply information of the same input sample, the training sample is utilized more efficiently.
The method can be applied to human preference alignment on a pre-training model (i.e. a pre-training human-computer interaction model) for realizing manual interaction, such as a pre-training language model (e.g. a pre-training LM or a pre-training LLM), a multi-mode pre-training model and the like, and the pre-training human-computer interaction model can be particularly applied to natural language processing field tasks such as emotion classification based on texts, text abstract generation, machine translation and the like, and main application scenes comprise digital assistants, intelligent robots, searching, online education, office software, electronic commerce, intelligent design and the like. In addition, the application can be used for further training the large model based on the training data of the current vertical field when the large model is applied to each vertical field so as to optimize the human preference alignment capability of the large model. For example, a medical large model, a traffic large model, an enterprise-level large model that allows multiple enterprises/structures to access for use, etc., may be trained.
FIG. 1 is a schematic diagram of an exemplary system architecture to which the present application is applicable. As shown in fig. 1, the system architecture includes a server and an end-side device. The server and the end side equipment are provided with a communication link capable of communicating, so that communication connection between the server and the end side equipment can be realized.
The server may be a server cluster deployed in the cloud, or a device with computing capabilities locally. The server may obtain and store a pre-trained human-machine interaction model to be optimized for human preference alignment and pre-obtained input samples. The server is responsible for generating a plurality of pieces of reply information of an input sample through the pre-training human-computer interaction model, acquiring output probability of each piece of reply information, sequencing the plurality of pieces of reply information of the input sample according to reply quality to obtain sequencing results of the plurality of pieces of reply information of the input sample, and optimizing parameters of the pre-training human-computer interaction model according to the sequencing results of the plurality of pieces of reply information of the input sample and the output probability of each piece of reply information to realize human preference alignment of the pre-training human-computer interaction model. The server sends the model parameters after the human preference alignment optimization to the end-side device.
The terminal side device is an electronic device used by a user, and specifically may be a hardware device having a network communication function, an operation function and an information display function, including but not limited to a smart phone, a tablet computer, a desktop computer, a server, and the like. A user refers to a person or organization requesting pre-training and human preference alignment of a human-machine interaction model owned by the user or human preference alignment of a pre-trained human-machine interaction model owned by the user. The user provides an initial human-computer interaction model or a pre-training human-computer interaction model to the server through the terminal side equipment, and receives optimized model parameters from the server. The terminal side equipment updates model parameters of the initial human-computer interaction model based on the optimized model parameters to obtain a trained human-computer interaction model, and generates and outputs reply information of input information through the trained human-computer interaction model to realize various human-computer interaction functions.
In one example scenario, a server pre-trains a given initial human-machine interaction model and optimizes the pre-trained human-machine interaction model for human preference alignment to obtain a trained human-machine interaction model. The given initial man-machine interaction model can be a man-machine interaction model owned by a platform to which the server belongs, or can be a man-machine interaction model provided by the terminal side equipment. Illustratively, the user may upload the initial human-machine interaction model to the server through the end-side device, or provide a download address of the initial human-machine interaction model to the server, so that the server downloads and stores the initial human-machine interaction model based on the download address. The method comprises the steps that a server performs pre-training on an initial human-computer interaction model to obtain a pre-training human-computer interaction model, generates a plurality of pieces of reply information of an input sample through the pre-training human-computer interaction model, and obtains output probability of each piece of reply information; sequencing a plurality of pieces of reply information of the input sample according to the reply quality to obtain sequencing results of the plurality of pieces of reply information of the input sample; optimizing parameters of a pre-training human-computer interaction model according to the sequencing result of a plurality of pieces of reply information of an input sample and the output probability of each piece of reply information to obtain optimized model parameters; and sending the optimized model parameters to the terminal equipment. And the end-side equipment receives the optimized model parameters, and updates the model parameters of the initial human-computer interaction model according to the optimized model parameters to obtain a trained human-computer interaction model. Furthermore, the terminal side equipment can generate and output reply information of the input information through the trained man-machine interaction model, and the man-machine interaction function based on the man-machine interaction model is realized.
In a second example scenario, the server performs optimization of human preference alignment for a given pre-trained human-machine interaction model, resulting in a trained human-machine interaction model. The given pre-training human-computer interaction model can be obtained by pre-training the initial human-computer interaction model by a platform to which the server belongs, or can be a pre-training human-computer interaction model provided by the terminal side equipment. Illustratively, the user may upload the pre-trained human-machine interaction model to the server through the end-side device, or provide a download address of the pre-trained human-machine interaction model to the server, such that the server downloads and stores the pre-trained human-machine interaction model based on the download address. The server generates a plurality of pieces of reply information of an input sample through a pre-training man-machine interaction model, and obtains output probability of each piece of reply information; sequencing a plurality of pieces of reply information of the input sample according to the reply quality to obtain sequencing results of the plurality of pieces of reply information of the input sample; optimizing parameters of a pre-training human-computer interaction model according to the sequencing result of a plurality of pieces of reply information of an input sample and the output probability of each piece of reply information to obtain optimized model parameters; and sending the optimized model parameters to the terminal equipment. And the end-side equipment receives the optimized model parameters, and updates the model parameters of the pre-training human-computer interaction model according to the optimized model parameters to obtain a trained human-computer interaction model. Furthermore, the terminal side equipment can generate and output reply information of the input information through the trained man-machine interaction model, and the man-machine interaction function based on the man-machine interaction model is realized.
The process of pre-training the initial human-machine interaction model is shown in fig. 1 by the dashed arrow as optional, e.g. in the second example scenario described above, the server is only used for optimization of human preference alignment for a given pre-trained human-machine interaction model, and the server does not need to perform the pre-training process. The following describes the technical scheme of the present application and how the technical scheme of the present application solves the above technical problems in detail with specific embodiments. The following embodiments may be combined with each other, and the same or similar concepts or processes may not be described in detail in some embodiments. Embodiments of the present application will be described below with reference to the accompanying drawings.
Fig. 2 is a flowchart of a data processing method based on a human-computer interaction model according to an exemplary embodiment of the present application. The execution body of the embodiment is a server in the system architecture. As shown in fig. 2, the method specifically comprises the following steps:
step S201, a pre-training human-computer interaction model and an input sample are obtained.
In this step, optionally, the server may obtain a pre-trained human-machine interaction model provided by the end-side device.
Illustratively, a server acquires a pre-trained human-computer interaction model uploaded by an end-side device; or the server acquires the download address of the pre-training man-machine interaction model provided by the terminal side equipment, and downloads the pre-training man-machine interaction model according to the download address of the pre-training man-machine interaction model.
In this step, optionally, the server may further receive an initial human-computer interaction model provided by the terminal device, and pretrain the initial human-computer interaction model to obtain a pretrained human-computer interaction model.
The server obtains an initial man-machine interaction model uploaded by the end-side device, or obtains a download address of the initial man-machine interaction model provided by the end-side device, and downloads the initial man-machine interaction model according to the download address of the initial man-machine interaction model.
According to the scheme, the trained man-machine interaction model applied to specific tasks can be obtained, and the method can be particularly applied to natural language processing field tasks such as emotion classification based on texts, text abstract generation, machine translation and the like, and main application scenes comprise digital assistants, intelligent robots, searching, online education, office software, electronic commerce, intelligent design and the like. The input information of the human-computer interaction model may be different for different tasks, and the input information of the human-computer interaction model (i.e., input prompt) is generated based on a prompt (prompt) template under a specific task and a user input instruction.
In the step, input samples in the public data set of the man-machine interaction model can be collected based on a general prompt template of the man-machine interaction model, and input samples which are manually marked or manually written based on the general prompt template can be obtained.
Optionally, the pre-trained human-computer interaction model obtained in the step may also be a human-computer interaction model after fine tuning using training data of a specific task. In the step, based on a prompt template of a specific task applied by the human-computer interaction model, input samples of the human-computer interaction model under the specific task can be collected, wherein the input samples can be input information generated in historical application or input information which is manually marked or written.
In this embodiment, a specific obtaining manner of an input sample used for performing human preference alignment on a pre-training human-computer interaction model is consistent with a manner of obtaining an input sample used in a pre-training stage or a fine-tuning stage of the pre-training human-computer interaction model, which is not described herein.
Step S202, generating a plurality of pieces of reply information of an input sample through a pre-trained man-machine interaction model, and acquiring output probability of each piece of reply information.
In the step, an input sample is input into a pre-trained human-computer interaction model, and a plurality of pieces of reply information of the same input sample are generated and output through the human-computer interaction model of a pre-trained person. For example, the same input sample may be input into the pre-trained human-computer interaction model for multiple times, and the reply information of the input sample may be generated multiple times through the pre-trained human-computer interaction model, so as to obtain multiple different reply information of the input sample. In addition, the pre-trained man-machine interaction model can output the answer information and output the output probability information of the terms contained in the answer information. In the step, the output probability of the reply information is determined according to the output probability information of the terms contained in the reply information output by the pre-trained man-machine interaction model. Alternatively, in this step, for any reply information, the server may calculate the logarithm of the product of the output probabilities of terms contained in the reply information, resulting in the output probability of the reply information.
Exemplary, for send i Representing any reply information s output by the pre-training man-machine interaction model i Representing reply information send i Is a function of the output probability of (a). len (send) i ) Representing reply information send i Length of (a) i.e. reply information send i The number of terms contained. Response information send i Output probability s of (2) i The method can be calculated by the following steps: wherein x is k Representing reply information send i The kth term, p (x) k |x 1 、x 2 、…、x k-1 ) Representing reply information send i The kth term x contained k Is a function of the output probability of (a).
Alternatively, in this step, for any one of the reply information, the server may calculate the sum of the logarithms of the output probabilities of terms contained in the reply information, resulting in the output probability of the reply information.
Exemplary, for send i Representing any reply information s output by the pre-training man-machine interaction model i Representing reply information send i Is a function of the output probability of (a). len (send) i ) Representing reply information send i Length of (a) i.e. reply information send i The number of terms contained. Response information send i Output probability s of (2) i The method can be calculated by the following steps: wherein x is k Representing reply information send i The kth term, p (x) k |x 1 ,x 2 ,…,x k-1 ) Representing reply information send i The kth term x contained k Is a function of the output probability of (a).
In addition, in this step, for any reply information, the server may calculate a product of output probabilities of terms included in the reply information, to obtain an output probability of the reply information.
Step S203, sorting the plurality of reply information of the input sample according to the reply quality, to obtain a sorting result of the plurality of reply information of the input sample.
After the multiple pieces of reply information of the input samples are acquired, the multiple pieces of reply information of the same input sample can be sequenced according to the sequence from high to low of reply quality, and sequencing results of the multiple pieces of reply information of the same input sample are obtained.
In an alternative embodiment, in step S203, the multiple reply information of the same input sample may be manually sorted by the labeling personnel, and the sorting result of the multiple reply information of the same input sample may be obtained. The method can be realized in the following way:
outputting a plurality of pieces of reply information of the input samples through the interactive interface; a result of ordering of the plurality of reply information of the input sample specified within the interactive interface is received.
Optionally, in another optional implementation manner, an initial sorting result of the plurality of reply information of the input sample may also be output through the interactive interface. Based on the interactive interface, the annotator can adjust the initial ranking results. And responding to the adjustment operation of the initial sequencing result in the interactive interface, and acquiring the adjusted sequencing result.
The server provides an interactive interface through which a plurality of reply messages of the same input sample are output. The annotator can sort multiple pieces of reply information for the same input sample displayed on the interactive interface. The server receives a result of ordering of a plurality of reply information for the same input sample specified within the interactive interface. For example, a plurality of different reply messages of the same input sample on the interactive interface may respectively correspond to different display areas, and the annotator may change the sorting result of the respective reply messages by dragging the positions of the display areas of the respective reply messages. In addition, the annotator can also change the arrangement order of the reply information by inputting the arrangement order of the reply information, and the position of the display area of the reply information in the interactive interface can be automatically adjusted along with the change of the arrangement order of the reply information.
In another alternative embodiment, this step may be implemented in particular as follows: multiple pieces of reply information of the same input sample can be input into the sorting model, and the sorting model sorts the multiple pieces of reply information of the same input sample to obtain a sorting result.
The ranking model may be a pre-trained text ranking model, and the training data includes an input sample, a plurality of reply information of the input sample, and a ranking result of the plurality of reply information marked manually. The sequencing result of the plurality of reply information marked by the human body reflects the preference condition of the human body for the plurality of reply information of the same input sample. Based on the sequencing model obtained by training the training data, a plurality of pieces of reply information of the same input sample can be sequenced, and the sequencing result is aligned with human preference.
In another optional embodiment, in this step, the pre-trained evaluation model may be used to evaluate the reply quality of the plurality of reply information of the same input sample, to obtain an evaluation value of each reply information, and the reply information is ranked based on the evaluation value, so as to obtain a ranking result of the plurality of reply information of the same input sample.
Specifically, a plurality of pieces of answer information of the same input sample are input into a pre-trained evaluation model, and evaluation values of the answer information are output through the evaluation model; and sequencing the plurality of pieces of reply information of the input sample according to the evaluation value to obtain sequencing results of the plurality of pieces of reply information of the input sample. For example, the plurality of pieces of answer information of the input sample are sorted from high to low according to the evaluation value, and the sorting result of the plurality of pieces of answer information of the input sample is obtained.
Alternatively, the evaluation model is a pre-trained text quality evaluation model, and the training data used includes an input sample, answer information of the input sample, and an evaluation value of the answer information that is manually labeled. Wherein, the evaluation value of the reply information marked by the personnel reflects the preference degree of the human on the reply information. A higher evaluation value indicates that the human prefers the reply information, indicating that the quality of the reply information is higher. A lower evaluation value indicates that the human beings do not prefer the reply information, indicating that the quality of the reply information is lower. Based on the evaluation model obtained by training the training data, the answer quality of the answer information of the input sample can be evaluated accurately. And based on the evaluation model, sorting the answer information based on the evaluation values of the answer information of the same input sample, wherein the obtained sorting result can be aligned with human preference.
Alternatively, the pre-trained assessment model may be a reward model. In the step, a plurality of pieces of reply information of the same input sample are input into the reward model, reward values corresponding to the pieces of reply information are output through the reward model, and the reward values corresponding to the pieces of reply information reflect the preference degree of human beings for the pieces of reply information. A higher prize value indicates that the human prefers the reply message, indicating that the quality of the reply message is higher. A lower prize value indicates that the human beings do not prefer the reply information, indicating that the quality of the reply information is lower.
The pre-trained reward model in this embodiment may be specifically obtained by training the Reward Model (RM) in a Reinforcement Learning (RLHF) method based on human feedback. Illustratively, the reward model may be based on a fine-tuned human-machine interaction model (e.g., a fine-tuned language model LM), or based on training based on a human-machine interaction model (e.g., a language model LM) trained with human preference annotation data. The training data used may be generated from a predefined public data set sampling, or data samples generated using an application/tool of human-machine interaction. When training data is constructed, a human-computer interaction model or a plurality of human-computer interaction models with different fine tuning versions can be used for generating a plurality of reply texts of the same input. The plurality of answer texts input in the same way are manually ordered, and the relative evaluation value (such as an erlo rating) of each answer text is calculated according to the ordering result and is used as the reward value of the answer text. Training data containing input-reply text-reward values may thus be constructed and a reward model trained based on the training data. The bonus model is capable of outputting bonus values of a plurality of pieces of reply information for the same input sample as input, respectively.
Step S204, optimizing parameters of a pre-trained man-machine interaction model according to the sorting result of a plurality of pieces of reply information of an input sample and the output probability of each piece of reply information, wherein the man-machine interaction model is used for generating reply information according to user input information.
After obtaining the sorting results of the plurality of pieces of reply information of the same input sample, in this step, the server calculates the overall loss according to the sorting results of the plurality of pieces of reply information of the input sample and the output probability of each piece of reply information; and optimizing parameters of the human-computer interaction model according to the total loss.
Specifically, the server calculates the total loss according to the sorting result of the plurality of pieces of reply information of the input sample and the output probability of each piece of reply information, and the method can be specifically implemented as follows:
according to the sorting results of the plurality of pieces of reply information of the input sample, respectively calculating sub-losses corresponding to the first n-1 pieces of reply information in the sorting results, wherein the sub-loss corresponding to any piece of reply information aims at improving the output probability of the current reply information, and inhibiting the output probability of the reply information arranged behind the current reply information, and n is the number of pieces of reply information of the input sample; and calculating the sum of sub-losses corresponding to the first n-1 pieces of reply information to obtain the total loss.
For example, when calculating sub-losses corresponding to the first n-1 pieces of reply information in the sorting result according to the sorting result of the plurality of pieces of reply information of the input sample, the server may determine, for any piece of reply information ranked in the first n-1 bits in the sorting result, an improvement item of the sub-losses according to the output probability of the current reply information according to the sorting result of the plurality of pieces of reply information of the input sample; determining a sub-loss suppression item according to the current reply information and the output probability of the reply information arranged behind the current reply information; and calculating the sub-loss corresponding to the current reply information according to the ratio of the lifting item and the suppressing item.
Specifically, for reply information of rank order i (i < n), a promotion item (also referred to as bonus information or bonus item) of a sub-loss of the current reply information may be determined according to the output probability of reply information of rank order i, a suppression item (also referred to as penalty information or penalty item) of the loss is determined according to the output probability of reply information of rank order greater than or equal to i, and the sub-loss of the current reply information is generated, so that the output probability of reply information ranked after the current reply information is minimized by rank comparison while the output probability of the current reply information is maximized, thereby realizing that the output probability of each reply information should be greater than the output probability of each reply information ranked after it. Wherein i is any integer greater than or equal to 1 and less than n.
In an alternative embodiment, according to the sorting result of the plurality of reply information of the input sample, calculating the sub-loss corresponding to the reply information with the sorting order of i in the sorting result may be implemented in the following manner:
according to the arrangement order of a plurality of pieces of reply information of an input sample in an ordering result, for reply information with the arrangement order of i, taking an exponential function value of the output probability of the current reply information (i.e. the reply information with the arrangement order of i) as a lifting term, taking the sum of the exponential function values of the output probabilities of the reply information with the arrangement order of greater than or equal to i as a suppression term, calculating the opposite number of the logarithm of the ratio of the lifting term to the suppression term, and obtaining the sub-loss corresponding to the current reply information.
Specifically, the sub-loss corresponding to the reply information of the rank order i can be calculated by the following formula (1):
wherein l i Indicating the sub-loss corresponding to the reply information in the order of i. s is(s) i Representing the output probability of reply information in rank order i. exp () is an exponential function based on a natural constant e.
Alternatively, the sub-loss corresponding to the reply information in the order of i may also be determined by: the output probability of the reply information with the arrangement order of i is used as a promotion item, the sum of the output probabilities of the reply information with the arrangement order of i or more is used as a suppression item, and the ratio of the promotion item to the suppression item is used as a sub-loss corresponding to the reply information with the arrangement order of i.
In an alternative embodiment, when determining the suppression item of the sub-loss according to the current reply information and the output probability of the reply information arranged after the current reply information, the server may determine the penalty coefficient corresponding to the reply information arranged after the current reply information according to the difference between the evaluation value of the current reply information and the evaluation value of the reply information arranged after the current reply information; and determining the suppression item of the sub-loss according to the current reply information, the output probability of the reply information arranged after the current reply information and the penalty coefficient. Illustratively, according to the sorting result of the plurality of pieces of reply information of the input sample, calculating the sub-loss corresponding to the reply information with the sorting order of i in the sorting result can be specifically implemented in the following manner:
according to the arrangement order of a plurality of pieces of answer information of an input sample in the sorting result, regarding the answer information with the arrangement order of i, taking an exponential function value of the output probability of the current answer information (i.e. the answer information with the arrangement order of i) as a lifting item; taking the sum of the inverse of the difference between the evaluation value of the current reply information and the evaluation value of each reply information with the arrangement order being greater than or equal to i and a preset smoothing term as a penalty coefficient of each reply information with the arrangement order being greater than or equal to i; calculating the ratio of the output probability of each reply information with the arrangement order being greater than or equal to i to the penalty coefficient, and taking the ratio as penalty information corresponding to each reply information with the arrangement order being greater than or equal to i; taking the sum of index function values of penalty information corresponding to each reply information with the arrangement order being greater than or equal to i as a suppression item; and calculating the opposite number of the logarithm of the ratio of the lifting term to the inhibition term to obtain the sub-loss corresponding to the current reply information.
Specifically, the sub-loss corresponding to the current reply information can be calculated by the following formula (2):
wherein r is i Reply information evaluation value r indicating arrangement order i j Reply information evaluation value indicating arrangement order j, j.gtoreq.i indicates that arrangement order j is inAfter the arrangement order i.The inverse of the difference between the evaluation value of the reply information in the arrangement order i and the evaluation value of the reply information in the arrangement order j, that is, the penalty coefficient of the reply information in the arrangement order j is represented. s is(s) ji,j The ratio of the output probability of the reply information in the order of j to the penalty coefficient, namely the penalty information of the reply information in the order of j. a is a preset smoothing term, and τ can be avoided i,j The denominator of (a) is zero and typically takes a small value, e.g., 0.001, 0.002, etc.
In this embodiment, the inverse of the difference between the evaluation value of the reply information in the arrangement order i and the evaluation value of the reply information in the arrangement order j is used as the penalty coefficient τ of the reply information in the arrangement order j i,j Output probability s of reply information of rank order j on denominator j Divided by penalty coefficient τ i,j I.e. s j ×(r i -r j +a) so as to adjust the output probability s of the reply information in the order j based on the difference between the evaluation values of the reply information in the order j and the reply information in the order i j The penalty degree of the reply information is greater as the evaluation value is smaller.
Further, calculating the sum of sub-losses corresponding to the first n-1 pieces of reply information to obtain the total loss. Specifically, the overall lossWhere n represents the number of reply information of the input sample.
Optionally, when calculating the sum of sub-losses corresponding to the first n-1 pieces of reply information to obtain total loss, normalizing the evaluation values of the n pieces of reply information of the input information to obtain normalized evaluation values of the n pieces of reply information of the input information; taking the normalized evaluation value of each reply information of the input information as the weight coefficient corresponding to each reply information, and carrying out weighted summation on the sub-losses corresponding to the first n-1 reply information to obtain the total loss. Uncertainty of the assessment model/rewards model for ranking is measured by introducing the assessment value of the reply information as a weight in the loss function, and if the assessment value of the reply information is low, the model is more prone to not learn this reply information nor the comparison of the reply information and the reply information ranked after the reply information.
Specifically, based on the formula (2), multiplying the sub-loss corresponding to the reply information with the arrangement order of i by the corresponding weight coefficient through the formula (3), to obtain the updated sub-loss corresponding to the reply information with the arrangement order of i:
Wherein l' i Representing the sub-loss corresponding to the reply information of rank order i multiplied by the weight coefficient. sigmoid () is a normalization function, sigmoid (r i ) The normalized evaluation value of the reply information of rank order i obtained after normalization is represented. The meaning of the other characters in the formula (3) is the same as that of the same characters in the previous formula, and the description thereof is omitted here. By normalizing the evaluation values of the respective pieces of reply information, the influence of the evaluation values on the sub-loss can be reduced, thereby reducing the influence of the evaluation values on the model parameters. Further, according to the sub-loss corresponding to the reply information with the arrangement order of i after multiplying the weight coefficient, the sub-losses of the first n-1 reply information multiplied by the weight coefficient are summed to obtain the total loss. Illustratively, overall lossWhere n represents the number of reply information of the input sample.
Illustratively, in connection with FIG. 3, taking the human-computer interaction process as shown in FIG. 3 as an example, the pre-trained language model gives 4 different reply messages as shown in FIG. 3 for the current round of input information "how old but not want children to learn. The ranking results of evaluating the 4 pieces of reply information and ranking based on the evaluation values are shown in fig. 3. Pressing the button The evaluation values of the reply information in the arrangement order 1-4 are respectively expressed as r according to the order of arrangement from small to large 1 、r 2 、r 3 、r 4 And r is 1 >r 2 >r 3 >r 4 . The output probabilities of the reply information of the rank order 1-4 are respectively expressed as s 1 、s 2 、s 3 、s 4 . Let a preset smoothing term a=0.001. Based on the formula (3), for the reply information with the arrangement order of 1, the sub-loss weight corresponding to the reply information with the arrangement order of 1 can be calculated and obtained The sub-loss corresponding to reply information with rank order of 2 is weighted to +.>The sub-loss corresponding to reply information with rank order of 3 is weighted to +.>Overall loss l=l' 1 +l′ 2 +l′ 3 . Further, the pre-trained language model is updated based on the overall loss L.
After the total loss is calculated, the parameters of the pre-training human-computer interaction model are updated by adopting a gradient descent method according to the total loss, and then the parameter update of the pre-training human-computer interaction model can be realized.
In this embodiment, the optimized human-computer interaction model may be applied to a human-computer interaction system. Based on the optimized man-machine interaction model, corresponding reply information is generated according to the input information of the user, and the generated reply information accords with human preference and value, so that the quality of the generated reply information is higher, and the quality and performance of man-machine interaction are better.
According to the scheme of the embodiment, a plurality of pieces of reply information of an input sample are generated through a pre-training man-machine interaction model, and the output probability of each piece of reply information is obtained; the method comprises the steps of sorting a plurality of pieces of reply information of an input sample according to reply quality to obtain sorting results of the plurality of pieces of reply information of the input sample, optimizing parameters of a pre-training human-computer interaction model according to the sorting results of the plurality of pieces of reply information of the input sample and output probability of each piece of reply information, optimizing model parameters based on the sorting results reflecting preferences of human on the plurality of pieces of reply information of the same input sample through a training method of supervised learning, obtaining a human-computer interaction model aligned with human preferences quickly and effectively compared with a training method of reinforcement learning, and improving alignment degree of the human-computer interaction model to generate the reply information and the human preferences, so that quality of the human-computer interaction model to generate the reply information is improved, and human-computer interaction quality of an artificial intelligent system is improved. In addition, by comparing different reply information of the same input sample, the training sample is utilized more efficiently.
Fig. 4 is a flowchart of a data processing method based on a human-computer interaction large model according to another exemplary embodiment of the present application. In this embodiment, the terminal device provides the server with an initial large model for implementing man-machine interaction. Such as a large-scale language model, a large-scale multimodal model, etc. The server achieves pre-training of the initial large model and optimization of human preference alignment, and a trained large model is obtained. As shown in fig. 4, the method specifically comprises the following steps:
In step S401, the end device sends a training request for the initial large model to the server.
In this embodiment, the end-side device designates an initial large model, and sends a training request for the initial large model to the server. Optionally, the end-side device may send a training request for the initial large model to the server after providing the initial large model to the server; or, the end-side device may send a training request for the initial large model to the server, and then upload the initial large model to the server; alternatively, the end-side device may send a training request for the initial large model to the server, where the training request carries the download address of the initial large model. In addition, the end device may upload the initial large model to the server, and after receiving the initial large model, the server automatically triggers step S403 and the subsequent processing flows.
Illustratively, the server sends the interactive interface data to the end-side device. And the terminal side equipment outputs the interactive interface according to the interactive interface data. The interactive interface may provide an input area for the initial large model. The initial large model may be provided to the server by way of an upload model, an input model download address, etc. through an input area of the initial large model.
Optionally, in response to the upload operation of the large model, the end-side device uploads the initial large model to the server. In step S402, the server acquires an initial large model uploaded by the end-side device, and stores the initial large model.
Optionally, the end-side device provides the server with the download address of the initial large model. For example, the download address may be included in a training request for the initial large model. In step S402, the server acquires the download address of the initial large model provided by the end-side device, and downloads and stores the initial large model according to the download address of the initial large model.
Step S402, a server receives a training request for an initial large model sent by a terminal side device.
Step S403, the server performs pre-training on the initial large model to obtain a pre-trained large model.
After the initial large model to be trained is obtained, the server pretrains the initial large model through a large amount of corpus to obtain a pretrained large model.
Specifically, a large number of input samples and reply information of the input samples are collected, and the reply information of the input samples is marked, so that more accurate and standard reply information is obtained. And training the initial large model based on the marked input sample and the reply information of the input sample to obtain the pre-trained large model. In addition, the initial large model may also be trained using the public dataset to obtain a pre-trained large model.
In addition, the method for pre-training the large model to obtain the pre-trained large model can be implemented by any existing method, and details are omitted here.
Optionally, after pre-training the initial large model, fine-tuning of the pre-trained large model may also be performed. And in the subsequent step, continuing to align the human preference of the fine-tuned pre-trained large model to obtain a trained large model. The fine tuning of the pre-training large model can be specifically realized by using any existing method for fine tuning of the pre-training large model based on the use scene of the large model, and will not be repeated here.
Step S404, the server generates a plurality of pieces of reply information of the input sample through the pre-training large model, and obtains the output probability of each piece of reply information.
After the pre-trained large model is obtained, the input sample is input into the pre-trained large model, and a plurality of reply information of the same input sample is generated and output through the pre-trained large model.
The obtaining of the input samples is consistent with the implementation manner of obtaining the input samples in the step S201, which is not described herein. The specific implementation manner of the step S404 is identical to that of the step S202, and the details of the foregoing embodiment are referred to in the related content, which is not repeated here.
Step S405, the server sorts the multiple pieces of reply information of the input sample according to the reply quality, so as to obtain a sorting result of the multiple pieces of reply information of the input sample.
The specific implementation manner of this step is identical to that of the foregoing step S203, and the related content of the foregoing embodiment is specifically referred to, which is not described herein again.
Step S406, the server optimizes parameters of the pre-training large model according to the sequencing result of the plurality of pieces of reply information of the input sample and the output probability of each piece of reply information, and obtains optimized model parameters.
This step is consistent with the specific implementation manner of step S204, and the details of the foregoing embodiment are referred to in detail, which is not described herein.
Step S407, the server sends the optimized model parameters to the terminal side equipment.
After human preference alignment is performed on the pre-trained large model to obtain optimized model parameters through steps S404-S407, the server sends the optimized model parameters to the end-side device.
And step S408, the terminal side equipment receives the model parameters after the initial large model optimization sent by the server.
And step S409, updating the model parameters of the initial large model by the end side equipment according to the optimized model parameters to obtain a trained large model.
In the step, the end side equipment updates the model parameters of the initial large model according to the optimized model parameters returned by the server to obtain a large model with optimized model parameters, and the trained large model can be obtained.
Illustratively, the large model aligned with human preferences in the present embodiment may be a large model for human-computer interaction. The end side equipment can deploy the trained large model locally or to another server, and human-computer interaction is realized based on the trained large model, so that the human-computer interaction output information is more in line with human preference and value, and the quality and performance of human-computer interaction are improved.
In step S410, in response to the input information of the user, the terminal device generates reply information of the input information through the trained large model.
In the step, the terminal side equipment receives input information of a user and generates reply information of the input information through a trained large model.
Optionally, for the case that the trained large model is operated on the end-side device, the end-side device inputs input information of the user into the trained large model, and obtains reply information output by the large model. Alternatively, the end-side device may deploy the trained large model to a designated server that provides an Application Program Interface (API) for the trained large model. The terminal side equipment calls the trained large model to generate reply information of the input information of the user through an Application Program Interface (API) of the trained large model based on the input information of the user.
In step S411, the terminal device outputs the reply information of the input information.
The embodiment provides an application in a man-machine interaction scene, and an initial large model provided by end-side equipment to a server. The server achieves pre-training of the initial large model and optimization of human preference alignment, obtains a trained large model, and provides optimized model parameters for the terminal side equipment. The end-side equipment acquires a trained large model based on the optimized model parameters, and the output of the large model can be well aligned with human preference and value, so that the quality of human-computer interaction can be improved.
Fig. 5 is a flowchart of a data processing method based on a human-computer interaction large model according to another exemplary embodiment of the present application. In this embodiment, the end-side device provides a server with a pre-trained large model. Such as a large-scale pre-trained language model, a large-scale multi-modal pre-trained model, and the like. The server optimizes human preference alignment of the pre-trained large model to obtain the trained large model. As shown in fig. 5, the method specifically comprises the following steps:
in step S501, the end device sends an optimization request for the pre-trained large model to the server.
In this embodiment, the end-side device designates a pre-trained large model, and sends an optimization request for the pre-trained large model to the server. Optionally, the end-side device may send an optimization request for the pre-trained large model to the server after providing the pre-trained large model to the server; or, the end-side device may send an optimization request for the pre-trained large model to the server, and then upload the pre-trained large model to the server; alternatively, the end-side device may send an optimization request for the pre-trained large model to the server, where the optimization request carries the download address of the pre-trained large model. In addition, the terminal device may upload the pre-training large model to the server, and the server automatically triggers step S503 and subsequent processing flows after receiving the pre-training large model.
Illustratively, the server sends the interactive interface data to the end-side device. The terminal side device outputs an interactive interface according to the interactive interface data, and the interactive interface can provide an input area of the pre-training large model. The pre-trained large model can be provided to the server through the input area of the pre-trained large model by means of uploading the model, inputting the model download address and the like.
Optionally, in response to an upload operation of the pre-trained large model, the end-side device uploads the pre-trained large model to the server. The server acquires a pre-trained large model uploaded by the end-side device in step S502. Optionally, the end-side device provides the server with the download address of the pre-trained large model. For example, the download address of the pre-trained large model may be included in the optimization request for the pre-trained large model. In step S502, the server obtains a download address of the pre-trained large model provided by the end-side device, and downloads the pre-trained large model according to the download address of the pre-trained large model.
Step S502, a server receives a training request for a pre-training large model sent by a terminal side device.
In step S503, the server generates a plurality of reply information of the input sample through the pre-training large model, and obtains output probabilities of the reply information.
After the pre-trained large model is obtained, the input sample is input into the pre-trained large model, and a plurality of reply information of the same input sample is generated and output through the pre-trained large model.
The obtaining of the input samples is consistent with the implementation manner of obtaining the input samples in the step S201, which is not described herein. The specific implementation manner of the step S503 is identical to the specific implementation manner of the step S202, and the related content of the foregoing embodiment is specifically referred to, which is not described herein.
Step S504, the server sorts the plurality of pieces of reply information of the input sample according to the reply quality, and a sorting result of the plurality of pieces of reply information of the input sample is obtained.
The specific implementation manner of this step is identical to that of the foregoing step S203, and the related content of the foregoing embodiment is specifically referred to, which is not described herein again.
Step S505, the server optimizes parameters of the pre-training large model according to the sequencing result of the plurality of pieces of reply information of the input sample and the output probability of each piece of reply information, and obtains optimized model parameters.
This step is consistent with the specific implementation manner of step S204, and the details of the foregoing embodiment are referred to in detail, which is not described herein.
And step S506, the server sends the optimized model parameters to the terminal side equipment.
After human preference alignment is performed on the pre-trained large model through steps S503-S506 to obtain optimized model parameters, the server sends the optimized model parameters to the end-side device.
And step S507, the terminal side equipment receives model parameters after the pre-training large model optimization sent by the server.
And step S508, updating the model parameters of the pre-trained large model by the end side equipment according to the optimized model parameters to obtain a trained large model.
In the step, the end side equipment updates the model parameters of the initial large model according to the optimized model parameters returned by the server to obtain a large model with optimized model parameters, and the trained large model can be obtained.
Illustratively, the large model aligned with human preferences in the present embodiment may be a large model for human-computer interaction. The end side equipment can deploy the trained large model locally or to another server, and human-computer interaction is realized based on the trained large model, so that the human-computer interaction output information is more in line with human preference and value, and the quality and performance of human-computer interaction are improved.
Step S509, responding to the input information of the user, and generating reply information of the input information by the end side device through the trained large model.
In the step, the terminal side equipment receives input information of a user and generates reply information of the input information through a trained large model. Optionally, for the case that the trained large model is operated on the end-side device, the end-side device inputs input information of the user into the trained large model, and obtains reply information output by the large model. Alternatively, the end-side device may deploy the trained large model to a designated server that provides an Application Program Interface (API) for the trained large model. The terminal side equipment calls the trained large model to generate reply information of the input information of the user through an Application Program Interface (API) of the trained large model based on the input information of the user.
Step S510, the terminal device outputs the reply information of the input information.
The embodiment provides another application in the human-computer interaction scene, and the end-side equipment provides a pre-training large model for the server. The server optimizes human preference alignment of the pre-trained large model, obtains the trained large model, and provides optimized model parameters for the end-side equipment. The end-side equipment acquires a trained large model based on the optimized model parameters, and the output of the large model can be well aligned with human preference and value, so that the quality of human-computer interaction can be improved.
FIG. 6 is a schematic diagram of another example system architecture to which the present application applies. As shown in fig. 6, the system architecture includes a server at the enterprise end and electronics for providing a pre-trained large model. The communication link which can be communicated is arranged between the server of the enterprise side and the electronic equipment for providing the pre-training large model, so that the communication connection between the server and the electronic equipment for providing the pre-training large model can be realized. The electronic device for providing the pre-training large model may be a device for providing the pre-training large model to the outside by a mechanism/platform having the pre-training large model, and may specifically be a server cluster deployed in the cloud or a device with a local computing capability. An example of an electronic device providing a pre-trained large model for deployment at the cloud is illustrated in fig. 6.
FIG. 7 is a flowchart of a data processing method based on a large model according to an exemplary embodiment of the present application. The method of the embodiment is applied to the server of the enterprise side. As shown in fig. 7, the specific steps of the method are as follows:
step S701, obtaining a pre-training large model.
In this embodiment, the server of the enterprise may be a server device that provides man-machine interaction services for the outside based on a large model for each enterprise/service system/platform. The server at the enterprise side obtains a pre-trained large model from the electronic device.
Alternatively, the server at the enterprise end may request the pre-trained model from the electronic device providing the pre-trained large model and receive the pre-trained large model issued by the electronic device. Optionally, the server at the enterprise end may further obtain a download address of the pre-trained large model, and download the pre-trained large model from the electronic device that provides the pre-trained large model according to the download address of the pre-trained large model.
Step S702, an input sample of the currently applied vertical field is obtained.
In this embodiment, the vertical domain to which the large model is applied may be different for different enterprise terminals (e.g., enterprise, service system, service platform). Each enterprise terminal obtains sample data of the vertical field currently applied according to application requirements of the enterprise terminal, optimizes human preference alignment of the pre-trained large model, and obtains a large model which is applicable to the current vertical field and aligned with human preference.
For example, based on application requirements of an enterprise terminal, for an intelligent office scene, various office documents, notices, interaction information and the like of the enterprise terminal (such as an enterprise, a service system and a service platform) can be obtained, and the pre-training large model can be trained as training data to obtain a large model applied to intelligent office.
The difference between this step and the implementation manner of obtaining the input sample in the foregoing step S201 is that in this embodiment, the server at the enterprise end obtains the input sample in the specific vertical domain, and the implementation manner of obtaining the input sample can refer to the relevant content in the foregoing embodiment, which is not described herein again.
Step S703, generating a plurality of pieces of reply information of the input sample through the pre-training large model, and obtaining output probabilities of the respective pieces of reply information.
This step is consistent with the implementation of step S202, and details of the foregoing embodiment are not described herein.
Step S704, sorting the plurality of pieces of reply information of the input sample according to the reply quality, to obtain a sorting result of the plurality of pieces of reply information of the input sample.
This step is consistent with the implementation of step S203, and details of the foregoing embodiment are not described herein.
Step S705, optimizing parameters of the pre-trained large model according to the sequencing result of the plurality of pieces of reply information of the input sample and the output probability of each piece of reply information to obtain a large model in the vertical field, wherein the large model in the vertical field is applied to a man-machine interaction system in the vertical field and is used for generating reply information according to the input information.
This step is consistent with the implementation of step S204, and detailed descriptions thereof are omitted herein for reference.
Further, a server at the enterprise end uses the trained large model to realize the man-machine interaction system based on the vertical field. Specifically, the server at the enterprise end receives user input information, generates reply information of the input information by using the trained large model, and outputs the reply information to realize high-quality man-machine interaction.
In addition, aiming at the medical field, the traffic field, the entertainment field and the like, the large medical model, the large traffic model, the large entertainment knowledge model and the like which are aligned with human preferences can be trained by the method of the application so as to be applied to the human-computer interaction system corresponding to the vertical field.
According to the method, based on the input samples of the applied vertical field, human preference alignment optimization can be performed on the pre-trained large model, so that the large model applicable to the corresponding vertical field is obtained, human preference alignment capability of the large model when the large model is applied to the vertical field can be improved, and human-computer interaction quality is improved.
Fig. 8 is a schematic structural diagram of a data processing device based on a man-machine interaction model according to an exemplary embodiment of the present application. The data processing device based on the man-machine interaction model provided in the application embodiment can execute the processing flow of the server in the data processing method embodiment based on the man-machine interaction model. As shown in fig. 8, the data processing apparatus 80 based on the human-computer interaction model includes: a pre-training model acquisition module 81, a model processing module 82, a ranking module 83, and a model optimization module 84.
The pre-training model obtaining module 81 is configured to obtain a pre-trained human-computer interaction model and an input sample.
The model processing module 82 is configured to generate a plurality of reply information of an input sample through a human-computer interaction model, and obtain output probabilities of the reply information.
The sorting module 83 is configured to sort the plurality of reply information of the input sample according to the reply quality, so as to obtain a sorting result of the plurality of reply information of the input sample.
The model optimization module 84 is configured to optimize parameters of a human-computer interaction model according to the sorting result of the plurality of reply information of the input sample and the output probability of each reply information, and the human-computer interaction model is configured to generate reply information according to the user input information.
In an alternative embodiment, in implementing optimization of parameters of the human-computer interaction model according to the sorting result of the plurality of reply information of the input sample and the output probability of each reply information, the model optimization module 84 is further configured to: calculating total loss according to the sorting result of the plurality of pieces of reply information of the input sample and the output probability of each piece of reply information; and optimizing parameters of the human-computer interaction model according to the total loss.
In an alternative embodiment, in implementing calculation of the total loss according to the sorting result of the plurality of reply information of the input sample and the output probability of each reply information, the model optimization module 84 is further configured to: according to the sorting results of the plurality of pieces of reply information of the input sample, respectively calculating sub-losses corresponding to the first n-1 pieces of reply information in the sorting results, wherein the sub-loss corresponding to any piece of reply information aims at improving the output probability of the current reply information, and inhibiting the output probability of the reply information arranged behind the current reply information, and n is the number of pieces of reply information of the input sample; and calculating the sum of sub-losses corresponding to the first n-1 pieces of reply information to obtain the total loss.
In an alternative embodiment, when implementing sorting results according to the plurality of reply information of the input sample, and calculating sub-losses corresponding to the first n-1 reply information in the sorting results respectively, the model optimization module 84 is further configured to: determining a sub-loss promotion item according to the output probability of the current reply information for any reply information ranked in the top n-1 bits in the sorting result according to the sorting result of the plurality of pieces of reply information of the input sample; determining a sub-loss suppression item according to the current reply information and the output probability of the reply information arranged behind the current reply information; and calculating the sub-loss corresponding to the current reply information according to the lifting item and the suppressing item.
In an alternative embodiment, in implementing determining the suppression terms for sub-loss based on the current reply message and the output probabilities of the reply messages ranked after the current reply message, model optimization module 84 is further configured to: determining a penalty coefficient corresponding to the reply information arranged after the current reply information according to the difference between the evaluation value of the current reply information and the evaluation value of the reply information arranged after the current reply information; and determining the suppression item of the sub-loss according to the current reply information, the output probability of the reply information arranged after the current reply information and the penalty coefficient.
In an alternative embodiment, when the sum of sub-losses corresponding to the n-1 reply messages before calculation is implemented to obtain the total loss, the model optimization module 84 is further configured to: normalizing the evaluation values of the n pieces of reply information of the input information to obtain normalized evaluation values of the n pieces of reply information of the input information; taking the normalized evaluation value of each reply information of the input information as the weight coefficient corresponding to each reply information, and carrying out weighted summation on the sub-losses corresponding to the first n-1 reply information to obtain the total loss.
In an alternative embodiment, in implementing the output probability of obtaining each reply message, the model processing module 82 is further configured to: obtaining the output probability of terms contained in each reply information output by the man-machine interaction model; the output probability of the reply information is calculated based on the output probability of the term contained in the reply information.
In an alternative embodiment, when implementing ranking the plurality of reply messages of the input sample according to the reply quality, the ranking module 83 is further configured to: outputting a plurality of pieces of reply information of the input samples through the interactive interface; receiving a sequencing result of a plurality of pieces of reply information of an input sample appointed in the interactive interface; or outputting the initial sorting results of the plurality of pieces of reply information of the input sample through the interactive interface, and responding to the adjustment operation of the initial sorting results in the interactive interface to obtain the adjusted sorting results.
In an alternative embodiment, when implementing ranking the plurality of reply messages of the input sample according to the reply quality, the ranking module 83 is further configured to: inputting a plurality of pieces of answer information of an input sample into a pre-trained evaluation model, and outputting an evaluation value of each piece of answer information through the evaluation model; and sequencing the plurality of pieces of reply information of the input sample according to the evaluation value to obtain sequencing results of the plurality of pieces of reply information of the input sample.
In an alternative embodiment, when implementing the acquisition of the pre-trained human-computer interaction model, the pre-training model acquisition module 81 is further configured to: receiving a pre-trained human-computer interaction model provided by the terminal side equipment; or, receiving the initial model provided by the terminal side equipment, and pre-training the initial model to obtain a pre-trained man-machine interaction model.
The device provided in the embodiment of the present application may be specifically used to execute the processing flow executed by the server in any of the above method embodiments, and the specific functions and the technical effects that can be implemented are not described herein.
Fig. 9 is a schematic structural diagram of a server according to an embodiment of the present application. As shown in fig. 9, the server includes: a memory 901 and a processor 902. Memory 901 for storing computer-executable instructions and may be configured to store various other data to support operations on a server. The processor 902 is communicatively connected to the memory 901, and is configured to execute computer-executable instructions stored in the memory 901, so as to implement the technical solution provided by the server in any of the above method embodiments, and the specific functions and the technical effects that can be implemented are similar, and are not repeated herein.
Optionally, as shown in fig. 9, the server further includes: firewall 903, load balancer 904, communications component 905, power component 906, and other components. Only some of the components are schematically shown in fig. 9, which does not mean that the server only comprises the components shown in fig. 9. In addition, in fig. 9, only a server is taken as an example of a cloud server deployed in the cloud, and the server may also be a local server.
The embodiment of the application also provides end-side equipment, which comprises: memory and a processor. The memory is used to store computer-executable instructions and may be configured to store various other data to support operations on the end-side device. The processor is in communication connection with the memory, and is configured to execute computer-executed instructions stored in the memory, so as to implement the technical scheme executed by the end-side device in any of the above method embodiments, and specific functions and technical effects that can be implemented are similar, and are not repeated herein.
The embodiment of the application also provides a computer readable storage medium, in which computer executable instructions are stored, and the computer executable instructions are used for implementing the technical scheme provided by the server in any of the method embodiments when being executed by the processor, and specific functions and technical effects that can be implemented are not repeated here.
The embodiment of the application also provides a computer readable storage medium, in which computer executable instructions are stored, and when the computer executable instructions are executed by a processor, the computer executable instructions are used for implementing the technical scheme provided by the end side device in any of the method embodiments, and specific functions and technical effects that can be implemented are not repeated herein.
The embodiment of the application also provides a computer program product, which comprises: the computer program is stored in a readable storage medium, and the computer program can be read from the readable storage medium by at least one processor of the server, where execution of the computer program by at least one processor causes the server to execute the technical solution provided by the server in any of the method embodiments, and specific functions and technical effects that can be achieved are not described herein.
The embodiment of the application also provides a computer program product, which comprises: the computer program is stored in the readable storage medium, and the at least one processor of the end-side device may read the computer program from the readable storage medium, where execution of the computer program by the at least one processor causes the end-side device to execute the technical solution provided by the end-side device in any of the above method embodiments, and specific functions and technical effects that can be achieved are not repeated herein.
The embodiment of the application provides a chip, which comprises: the processing module and the communication interface, the processing module can execute the technical scheme of the server in the foregoing method embodiment. Optionally, the chip further includes a storage module (e.g. a memory), where the storage module is configured to store the instructions, and the processing module is configured to execute the instructions stored in the storage module, and execution of the instructions stored in the storage module causes the processing module to execute the technical solution provided by the server in any one of the foregoing method embodiments.
The embodiment of the application provides a chip, which comprises: the processing module and the communication interface, the processing module can execute the technical scheme of the terminal equipment in the foregoing method embodiment. Optionally, the chip further includes a storage module (e.g. a memory), where the storage module is configured to store the instructions, and the processing module is configured to execute the instructions stored in the storage module, and execution of the instructions stored in the storage module causes the processing module to execute the technical solution provided by the end-side device in any one of the foregoing method embodiments.
The memory may be an object store (Object Storage Service, OSS). The memory may be implemented by any type or combination of volatile or nonvolatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disk.
The communication component is configured to facilitate wired or wireless communication between the device in which the communication component is located and other devices. The device where the communication component is located may access a wireless network based on a communication standard, such as a mobile hotspot (WiFi), a mobile communication network of a second generation mobile communication system (2G), a third generation mobile communication system (3G), a fourth generation mobile communication system (4G)/Long Term Evolution (LTE), a fifth generation mobile communication system (5G), or a combination thereof. In one exemplary embodiment, the communication component receives a broadcast signal or broadcast-related information from an external broadcast management system via a broadcast channel. In one exemplary embodiment, the communication component further includes a Near Field Communication (NFC) module to facilitate short range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, ultra Wideband (UWB) technology, bluetooth (BT) technology, and other technologies. The power supply component provides power for various components of equipment where the power supply component is located. The power components may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for the devices in which the power components are located.
It will be appreciated by those skilled in the art that embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, magnetic disk storage, compact disk read-only memory (CD-ROM), optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks. These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
In one typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory. The memory may include volatile memory in a computer-readable medium, random Access Memory (RAM) and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of computer-readable media.
Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.
It should be noted that, the user information (including but not limited to user equipment information, user attribute information, etc.) and the data (including but not limited to data for analysis, stored data, presented data, etc.) related to the present application are information and data authorized by the user or fully authorized by each party, and the collection, use and processing of the related data need to comply with related laws and regulations and standards, and provide corresponding operation entries for the user to select authorization or rejection.
In addition, in some of the flows described in the above embodiments and the drawings, a plurality of operations appearing in a particular order are included, but it should be clearly understood that the operations may be performed out of order or performed in parallel in the order in which they appear herein, merely for distinguishing between the various operations, and the sequence number itself does not represent any order of execution. In addition, the flows may include more or fewer operations, and the operations may be performed sequentially or in parallel. It should be noted that, the descriptions of "first" and "second" herein are used to distinguish different messages, devices, modules, etc., and do not represent a sequence, and are not limited to the "first" and the "second" being different types. The meaning of "a plurality of" is two or more, unless specifically defined otherwise.
Other embodiments of the application will be apparent to those skilled in the art from consideration of the specification and practice of the application disclosed herein. This application is intended to cover any variations, uses, or adaptations of the application following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the application pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims.
It is to be understood that the application is not limited to the precise arrangements and instrumentalities shown in the drawings, which have been described above, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the application is limited only by the appended claims.

Claims (14)

1. The data processing method based on the man-machine interaction model is characterized by comprising the following steps of:
acquiring a pre-trained man-machine interaction model and an input sample;
generating a plurality of pieces of reply information of the input sample through the man-machine interaction model, and acquiring output probability of each piece of reply information;
sequencing the plurality of reply information of the input sample according to the reply quality to obtain sequencing results of the plurality of reply information of the input sample;
and optimizing parameters of the man-machine interaction model according to the sorting result of the plurality of pieces of reply information of the input sample and the output probability of each piece of reply information, wherein the man-machine interaction is used for generating reply information according to the input information of a user.
2. The method according to claim 1, wherein optimizing parameters of the human-computer interaction model based on the ranking result of the plurality of reply information of the input sample and the output probability of each of the reply information comprises:
Calculating overall loss according to the sorting result of the plurality of pieces of reply information of the input sample and the output probability of each piece of reply information;
and optimizing parameters of the man-machine interaction model according to the total loss.
3. The method of claim 2, wherein the calculating the overall loss based on the ranking result of the plurality of reply information of the input sample and the output probability of each of the reply information comprises:
according to the sorting results of the plurality of pieces of reply information of the input sample, respectively calculating sub-losses corresponding to the first n-1 pieces of reply information in the sorting results, wherein the sub-loss corresponding to any piece of reply information aims at improving the output probability of the current reply information, and suppresses the output probability of the reply information arranged behind the current reply information, and n is the number of the pieces of reply information of the input sample;
and calculating the sum of sub-losses corresponding to the first n-1 pieces of reply information to obtain the total loss.
4. A method according to claim 3, wherein the calculating sub-losses corresponding to the first n-1 pieces of reply information in the sorting result according to the sorting result of the plurality of pieces of reply information of the input sample includes:
According to the sorting results of the plurality of pieces of reply information of the input sample, for any piece of reply information which is ranked in the top n-1 bits in the sorting results, determining a sub-lost promotion item according to the output probability of the current reply information;
determining a sub-loss suppression item according to the current reply information and the output probability of the reply information arranged behind the current reply information;
and calculating the sub-loss corresponding to the current reply information according to the ratio of the lifting item to the restraining item.
5. The method of claim 4, wherein determining the suppression item for sub-loss based on the current reply message and the output probability of the reply message ranked after the current reply message, comprises:
determining a penalty coefficient corresponding to the reply information arranged behind the current reply information according to the difference between the evaluation value of the current reply information and the evaluation value of the reply information arranged behind the current reply information;
and determining the suppression item of the sub-loss according to the current reply information, the output probability of the reply information arranged after the current reply information and the penalty coefficient.
6. The method of claim 5, wherein calculating the sum of sub-losses corresponding to the first n-1 reply messages to obtain the total loss comprises:
Normalizing the evaluation values of the n pieces of reply information of the input information to obtain normalized evaluation values of the n pieces of reply information of the input information;
and taking the normalized evaluation value of each reply information of the input information as a weight coefficient corresponding to each reply information, and carrying out weighted summation on the sub-losses corresponding to the first n-1 reply information to obtain the total loss.
7. The method of claim 1, wherein said obtaining the output probability of each of said reply messages comprises:
obtaining the output probability of terms contained in each reply information output by the man-machine interaction model;
and calculating the output probability of the reply information according to the output probability of the terms contained in the reply information.
8. The method of claim 1, wherein the sorting the plurality of reply information of the input sample by reply quality results in a sorted result of the plurality of reply information of the input sample, comprising:
outputting a plurality of pieces of reply information of the input sample through an interactive interface, and receiving a sorting result of the plurality of pieces of reply information of the input sample appointed in the interactive interface;
or alternatively, the process may be performed,
And outputting initial sequencing results of a plurality of pieces of reply information of the input sample through an interactive interface, and responding to the adjustment operation of the initial sequencing results in the interactive interface to obtain adjusted sequencing results.
9. The method of claim 1, wherein the sorting the plurality of reply information of the input sample by reply quality results in a sorted result of the plurality of reply information of the input sample, comprising:
inputting a plurality of pieces of reply information of the input sample into a pre-trained evaluation model, and outputting an evaluation value of each piece of reply information through the evaluation model;
and sequencing the plurality of pieces of reply information of the input sample according to the evaluation value to obtain a sequencing result of the plurality of pieces of reply information of the input sample.
10. The method according to any one of claims 1-9, wherein the obtaining a pre-trained human-machine interaction model comprises:
receiving a pre-trained human-computer interaction model provided by the terminal side equipment;
or alternatively, the process may be performed,
and receiving an initial model provided by the end-side equipment, and pre-training the initial model to obtain a pre-trained human-computer interaction model.
11. A data processing method of a large model, which is applied to a server, comprising:
Obtaining a pre-training large model;
acquiring an input sample of a currently applied vertical field;
generating a plurality of pieces of reply information of the input sample through the pre-training large model, and acquiring output probability of each piece of reply information;
sequencing the plurality of reply information of the input sample according to the reply quality to obtain sequencing results of the plurality of reply information of the input sample;
and optimizing parameters of the pre-training large model according to the sequencing result of the plurality of pieces of reply information of the input sample and the output probability of each piece of reply information to obtain a large model in the vertical field, wherein the large model in the vertical field is applied to a man-machine interaction system in the vertical field and is used for generating reply information according to the input information.
12. A data processing method based on a large model, which is applied to a server, and comprises the following steps:
receiving a training request for an initial large model sent by a terminal side device;
pre-training the initial large model to obtain a pre-trained large model;
generating a plurality of pieces of reply information of an input sample through the pre-training large model, and acquiring output probability of each piece of reply information;
sequencing the plurality of reply information of the input sample according to the reply quality to obtain sequencing results of the plurality of reply information of the input sample;
Optimizing parameters of the pre-training large model according to the sequencing result of the plurality of pieces of reply information of the input sample and the output probability of each piece of reply information to obtain optimized model parameters;
and sending the optimized model parameters to the end-side equipment.
13. The data processing method based on the human-computer interaction model is characterized by being applied to the terminal side equipment and comprising the following steps of:
sending a training request for the initial large model to a server;
receiving the optimized model parameters of the initial large model sent by the server, wherein the optimized model parameters are obtained by pre-training the initial large model to obtain a pre-trained large model, generating a plurality of pieces of reply information of an input sample through the pre-trained large model, obtaining output probability of each piece of reply information, sorting the plurality of pieces of reply information of the input sample according to reply quality to obtain sorting results of the plurality of pieces of reply information of the input sample, and optimizing parameters of the pre-trained large model according to the sorting results of the plurality of pieces of reply information of the input sample and the output probability of each piece of reply information;
updating the model parameters of the initial large model according to the optimized model parameters to obtain a trained large model;
And responding to the input information of the user, generating reply information of the input information through the trained large model, and outputting the reply information of the input information.
14. A server, comprising: a processor, and a memory communicatively coupled to the processor;
the memory stores computer-executable instructions;
the processor executes computer-executable instructions stored in the memory to implement the method of any one of claims 1-12.
CN202310777685.XA 2023-06-28 2023-06-28 Data processing method and server based on man-machine interaction model or large model Pending CN116757270A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310777685.XA CN116757270A (en) 2023-06-28 2023-06-28 Data processing method and server based on man-machine interaction model or large model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310777685.XA CN116757270A (en) 2023-06-28 2023-06-28 Data processing method and server based on man-machine interaction model or large model

Publications (1)

Publication Number Publication Date
CN116757270A true CN116757270A (en) 2023-09-15

Family

ID=87947632

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310777685.XA Pending CN116757270A (en) 2023-06-28 2023-06-28 Data processing method and server based on man-machine interaction model or large model

Country Status (1)

Country Link
CN (1) CN116757270A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117390497A (en) * 2023-12-08 2024-01-12 浙江口碑网络技术有限公司 Category prediction method, device and equipment based on large language model

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117390497A (en) * 2023-12-08 2024-01-12 浙江口碑网络技术有限公司 Category prediction method, device and equipment based on large language model
CN117390497B (en) * 2023-12-08 2024-03-22 浙江口碑网络技术有限公司 Category prediction method, device and equipment based on large language model

Similar Documents

Publication Publication Date Title
CN110390108B (en) Task type interaction method and system based on deep reinforcement learning
WO2020177282A1 (en) Machine dialogue method and apparatus, computer device, and storage medium
CN107451199B (en) Question recommendation method, device and equipment
CN107330715B (en) Method and device for selecting picture advertisement material
US11657371B2 (en) Machine-learning-based application for improving digital content delivery
CN111428010B (en) Man-machine intelligent question-answering method and device
KR20200135892A (en) Method, apparatus and computer program for providing personalized educational curriculum and contents through user learning ability
CN111914176B (en) Question recommendation method and device
US20190362025A1 (en) Personalized query formulation for improving searches
US20230035366A1 (en) Image classification model training method and apparatus, computer device, and storage medium
CN116910561A (en) Data set construction method and server
CN110473042B (en) Method and device for acquiring information
CN111738010A (en) Method and apparatus for generating semantic matching model
CN116757270A (en) Data processing method and server based on man-machine interaction model or large model
CN111242710A (en) Business classification processing method and device, service platform and storage medium
EP4060517A1 (en) System and method for designing artificial intelligence (ai) based hierarchical multi-conversation system
CN111552787A (en) Question and answer processing method, device, equipment and storage medium
CN116501592B (en) Man-machine interaction data processing method and server
CN116383026B (en) Data processing method and server based on large model
CN112860878A (en) Service data recommendation method, storage medium and equipment
US20230325944A1 (en) Adaptive wellness collaborative media system
CN116561270A (en) Question-answering method and question-answering model training method
Selviandro et al. Context awareness system on ubiquitous learning with case based reasoning and nearest neighbor algorithm
CN116383027B (en) Man-machine interaction data processing method and server
CN116933800B (en) Template-based generation type intention recognition method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination