WO2020228636A1 - 训练方法和装置、对话处理方法和系统及介质 - Google Patents

训练方法和装置、对话处理方法和系统及介质 Download PDF

Info

Publication number
WO2020228636A1
WO2020228636A1 PCT/CN2020/089394 CN2020089394W WO2020228636A1 WO 2020228636 A1 WO2020228636 A1 WO 2020228636A1 CN 2020089394 W CN2020089394 W CN 2020089394W WO 2020228636 A1 WO2020228636 A1 WO 2020228636A1
Authority
WO
WIPO (PCT)
Prior art keywords
learning model
reinforcement learning
training
data
information
Prior art date
Application number
PCT/CN2020/089394
Other languages
English (en)
French (fr)
Inventor
朱红文
周莉
代亚菲
陈雪
邹声鹏
宋伊萍
张铭
张子涵
琚玮
Original Assignee
京东方科技集团股份有限公司
北京大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 京东方科技集团股份有限公司, 北京大学 filed Critical 京东方科技集团股份有限公司
Priority to US17/413,373 priority Critical patent/US20220092441A1/en
Publication of WO2020228636A1 publication Critical patent/WO2020228636A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • G06F18/2155Generating training patterns; Bootstrap methods, e.g. bagging or boosting characterised by the incorporation of unlabelled data, e.g. multiple instance learning [MIL], semi-supervised techniques using expectation-maximisation [EM] or naïve labelling
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/29Graphical models, e.g. Bayesian networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/004Artificial life, i.e. computing arrangements simulating life
    • G06N3/006Artificial life, i.e. computing arrangements simulating life based on simulated virtual individual or collective life forms, e.g. social simulations or particle swarm optimisation [PSO]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/088Non-supervised learning, e.g. competitive learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/02Knowledge representation; Symbolic representation
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H10/00ICT specially adapted for the handling or processing of patient-related medical or healthcare data
    • G16H10/20ICT specially adapted for the handling or processing of patient-related medical or healthcare data for electronic clinical trials or questionnaires
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H10/00ICT specially adapted for the handling or processing of patient-related medical or healthcare data
    • G16H10/60ICT specially adapted for the handling or processing of patient-related medical or healthcare data for patient-specific data, e.g. for electronic patient records
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/20ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for computer-aided diagnosis, e.g. based on medical expert systems
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/70ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H80/00ICT specially adapted for facilitating communication between medical practitioners or patients, e.g. for collaborative diagnosis, therapy or health monitoring

Definitions

  • the present disclosure relates to the field of machine learning, and more specifically to a reinforcement learning model training method and device, a dialog processing method, a dialog system, and a computer-readable storage medium.
  • Reinforcement Learning also known as Reinforcement Learning and Evaluation Learning
  • Reinforcement Learning is an important machine learning method that has many applications in the fields of intelligent control of robots and analysis and prediction. Reinforcement learning means that the agent learns in a "trial and error" manner, and guides behavior through the reward points obtained by interacting with the environment. The goal is to enable the agent's choice of behavior to obtain the maximum reward of the environment fraction.
  • the Dialog System (or Conversation Agent) is a computer system designed to communicate with people coherently. It can include a computer-based agent with a human-machine interface for accessing, processing, managing, and transferring information.
  • the dialogue system can be implemented based on a reinforcement learning model. However, in the construction of a dialogue system based on a reinforcement learning model, it is often necessary to obtain a large amount of annotation data to improve the accuracy of the dialogue system. These required annotation data are usually expensive and difficult to obtain, which affects the performance of the reinforcement learning model. Training and construction also limit the further application of dialogue systems in various fields.
  • a dialog processing method including: acquiring dialog information; generating reply information based on a reinforcement learning model; responding to the dialog information based on the reply information; wherein the reinforcement learning model It is obtained by training by the following method: obtaining unlabeled data and labeled data for training the reinforcement learning model; based on the unlabeled data, referring to the labeled data to generate an experience pool for training the reinforcement learning model; Training the reinforcement learning model using the experience pool.
  • a training device for a reinforcement learning model including: an acquiring unit configured to acquire unlabeled data and labeled data for training the reinforcement learning model; and a generating unit configured to be based on The unlabeled data refers to the labeled data to generate an experience pool for training the reinforcement learning model; the training unit is configured to train the reinforcement learning model by using the experience pool.
  • a training device for a reinforcement learning model including: a processor; a memory; and computer program instructions stored in the memory, where the computer program instructions are executed by the processor
  • the processor is caused to perform the following steps: obtain unlabeled data and labeled data used to train the reinforcement learning model; based on the unlabeled data, refer to the labeled data to generate information for training the reinforcement learning model Experience pool; using the experience pool to train the reinforcement learning model.
  • a dialogue system including: a processor; a memory; and computer program instructions stored in the memory, and when the computer program instructions are executed by the processor, The processor executes the following steps: acquiring dialogue information; generating reply information based on a reinforcement learning model; responding to the dialogue information based on the reply information; wherein the reinforcement learning model is trained by the following method: acquiring Training the unlabeled data and labeled data of the reinforcement learning model; based on the unlabeled data, referring to the labeled data to generate an experience pool for training the reinforcement learning model; using the experience pool to train the reinforcement learning model .
  • a computer-readable storage medium on which computer-readable instructions are stored.
  • the instructions are executed by a computer, the dialog processing method described in any one of the foregoing is executed.
  • Fig. 1 shows an exemplary flowchart of a training method of a reinforcement learning model according to an embodiment of the present disclosure
  • Fig. 2 shows an exemplary flowchart of a dialog processing method according to an embodiment of the present disclosure
  • Fig. 3 shows an exemplary flowchart of a training method of a reinforcement learning model used in a dialog processing method according to an embodiment of the present disclosure
  • Fig. 4 shows an exemplary flow chart of a training method for a reinforcement learning model of a medical dialogue system according to an embodiment of the present disclosure
  • FIG. 5 shows a schematic diagram of target information in a training method of a reinforcement learning model for a medical dialogue system according to an embodiment of the present disclosure
  • FIG. 6 shows the data collected according to the first example of the present disclosure and a schematic diagram of the training process for DQN;
  • FIG. 7 shows an exemplary flowchart of a dialog processing method used in the legal consulting field according to an embodiment of the present disclosure
  • Fig. 8 shows a block diagram of a training device for a reinforcement learning model according to an embodiment of the present disclosure
  • Fig. 9 shows a block diagram of a training device for a reinforcement learning model according to an embodiment of the present disclosure
  • FIG. 10 shows a block diagram of a dialogue system according to an embodiment of the present disclosure
  • FIG. 11 shows a block diagram of a dialogue system according to an embodiment of the present disclosure.
  • FIG. 12 shows a schematic diagram of a user interface 1200 of a medical dialogue system according to an embodiment of the present disclosure.
  • modules in the device and system according to the embodiments of the present disclosure, any number of different modules may be used and run on the client and/or server.
  • the modules are merely illustrative, and different modules may be used for different aspects of the apparatus, system, and method.
  • the reinforcement learning model generally includes an agent and an environment.
  • the agent continuously learns and optimizes its strategy through interaction and feedback with the environment. Specifically, the agent observes and obtains the state (state, s) of the environment, and according to a certain strategy, determines the action or action (action, a) to be taken for the state (s) of the current environment. Such action (a) acts on the environment, changes the state of the environment (for example, from s to s'), and generates reward points as feedback (reward, r) to be sent to the agent. The agent judges whether the previous action is correct and whether the strategy needs to be adjusted according to the reward score (r) obtained, and then updates its strategy. By repeatedly observing the state, determining actions, and receiving feedback, the agent can continuously update its strategy.
  • the ultimate goal of reinforcement learning model training is to be able to learn a strategy that maximizes the accumulation of reward points.
  • the agent can adopt a neural network, such as a neural network based on deep reinforcement learning DRL (such as Deep Q-Learing (DQN), Double-DQN, Dualing-DQN, Deep Deterministic Policy Gradient (DDPG), Asynchronous Advantage Actor-Critic (A3C), Continuous Deep Q-Learning with NAF, etc.), including some deep learning algorithms.
  • DQN Deep Q-Learing
  • DDPG Deep Deterministic Policy Gradient
  • A3C Asynchronous Advantage Actor-Critic
  • Continuous Deep Q-Learning with NAF etc.
  • the reinforcement learning model described in the embodiments of the present disclosure may be a neural network based on deep reinforcement learning DRL.
  • FIG. 1 shows an exemplary flowchart of a training method 100 of a reinforcement learning model according to an embodiment of the present disclosure.
  • the reinforcement learning model involved in Figure 1 can be applied to many fields such as education, legal consultation, shopping and dining inquiries, flight inquiries, navigation and so on.
  • step S101 unlabeled data and labeled data for training the reinforcement learning model are obtained.
  • the acquired data used for training the reinforcement learning model includes labeled data.
  • the annotation data may be data obtained from a database related to the field of the reinforcement learning model to be trained.
  • the training information related to the reinforcement learning model can be extracted from the annotation data, and the extracted training information can be stored as, for example, the user's goal information (also called user goal).
  • the extracted target information from the annotation data can be used for direct training of the reinforcement learning model to provide feedback to the agent and guide the training process.
  • the target information extracted in the annotation data may include information corresponding to results, classification tags, etc., respectively.
  • the acquired data used for training the reinforcement learning model may also include unlabeled data.
  • the unlabeled data may be obtained through various channels, and these channels may include unlabeled web pages, forums, chat records, databases, etc. related to the field of the reinforcement learning model to be trained.
  • the unlabeled data may be dialog data.
  • the annotation data may be medical case data obtained from, for example, electronic medical records, and the extracted target information may include disease, symptom classification, and symptom attributes. And other information.
  • the unlabeled data may be, for example, medical conversation data obtained from the Internet, and the extracted training information of the unlabeled data may include various information such as conversation time, conversation object, conversation content, and diagnosis result.
  • the training method can also be applied to various other fields such as education, legal consultation, shopping and dining inquiries, flight inquiries, navigation, etc., and is not limited here.
  • step S102 based on the unlabeled data, an experience pool for training the reinforcement learning model is generated with reference to the labeled data.
  • the experience pool may be generated based on the effective training information extracted from the unlabeled data, and the target information extracted from the labeled data is used as the training target.
  • the experience pool may include one or more sequences consisting of a first state (s), an action (a), a reward score (r), and a second state (s'), and may be expressed as a four-tuple ⁇ s,a,r,s'>.
  • the action and the current first state can be obtained based on unlabeled data
  • the second state and the reward score can be obtained through interaction with the environment.
  • the action when the unmarked data is dialogue data, the action may be any dialogue action acquired based on the dialogue data; the first state may include the dialogue data before the acquired dialogue action All the historical information of, and the historical information can be composed of all the information and actions before the dialogue action.
  • the second state may be the state to which the environment has migrated after the action is applied when the environment is in the first state; the reward points may include the state in which the environment is in the first state. In the case of the state, after the action is applied, the feedback is made under the guidance of the annotation data as the target information.
  • the reward scores in the quadruple for constructing the experience pool may further include the credibility (c or c') of the action. That is to say, when the field of the reinforcement learning model to be trained is known, the corresponding occurrence probability and specificity in the key information set of the field can be calculated based on the action (a), thereby obtaining the action
  • the credibility c of, and the credibility c'after smoothing and normalization is obtained.
  • the experience pool can be constructed based on the quadruple ⁇ s,a,c'r,s'>.
  • the action and the current first state may be acquired based on medical dialogue data, and the second state and the reward score may be acquired through interaction with the environment.
  • the action may be any dialogue action obtained based on the medical dialogue data.
  • the action includes but is not limited to: start a dialogue, end a dialogue, request symptom information, diagnose a disease, etc.; the first state It may include all the historical information before the acquired dialogue action in the medical dialogue data, and the historical information may be composed of all the information and actions before the dialogue action.
  • the second state may be the state to which the environment has migrated after the action is applied when the environment is in the first state;
  • the reward points may include the state in which the environment is in the first state.
  • the credibility of the action included in the reward score at this time can be calculated by the following formulas (1)-(3):
  • D i can represent the i-th disease (i is 0 or a positive integer);
  • a disease Di can include several medical conversation data, for example, It may represent data for the j-th session of diseases of D i (j is 0 or a positive integer), the AF operation can be represented in a medical conversation data IDF indicates the specificity of action a in a specific disease. Therefore, the credibility AF-IDF can be obtained through the product of the AF and IDF to reflect the credibility c of a certain action a.
  • step S103 the reinforcement learning model is trained using the experience pool.
  • the experience pool may be used to assist in training the reinforcement learning model.
  • the agent such as a DQN neural network
  • the environment such as a user simulator
  • the four-tuple ⁇ s, a ,r,s'> or ⁇ s,a,c'r,s'>
  • the experience pool can be continuously updated by using the quadruple obtained during the training process, that is, a new quadruple obtained during the training process can be added to the experience pool. Therefore, using the experience pool to train the reinforcement learning model may further include: in the process of training the reinforcement learning model, updating the experience pool according to the training result; using the updated experience pool to perform the training on the reinforcement learning model Conduct training.
  • the action (a) in the formed quadruple can be initiated by DQN and act on the environment instead of being taken from unlabeled data.
  • the actions at this time may also include but are not limited to: starting a conversation, ending a conversation, requesting symptom information, confirming a disease, and so on.
  • the external indication here may be a database related to the reinforcement learning model, such as a knowledge graph.
  • the knowledge graph includes nodes of M diseases and N symptoms, and the corresponding relationship between various diseases and various symptoms, where M and N are greater than or equal to The integer of 1, and the recommended drugs, prevention methods, treatment plans, and etiology for each disease.
  • the knowledge graph may also include the probability from each disease to each symptom and the probability from each symptom to each disease.
  • the training method of the reinforcement learning model can be used to train the reinforcement learning model for the dialogue system.
  • the dialogue system can be divided into a task-oriented system (Task-oriented Dialogue System) and a non-task-oriented system (Non-Task-Oriented Dialogue System).
  • the task-oriented dialogue system refers to a type of dialogue system that aims to help users complete tasks in a specific field based on communication with users.
  • the training method of the reinforcement learning model can be used to train a reinforcement learning model for a task-oriented dialogue system, for example, it can be used to train a reinforcement learning model for a medical dialogue system.
  • the training method can also be applied to dialogue systems related to education, legal consultation, shopping and dining inquiries, flight inquiries, navigation, etc., and is not limited here. .
  • a reinforcement learning model can be jointly trained based on unlabeled data and labeled data, thereby effectively reducing the need for labeled data when training the reinforcement learning model, and improving the training of the reinforcement learning model
  • the feasibility and stability of the RL improve the training results of the reinforcement learning model.
  • FIG. 2 shows an exemplary flowchart of a dialog processing method 200 according to an embodiment of the present disclosure.
  • the dialogue processing method in Figure 2 can be applied to dialogue systems, also known as chat information systems, spoken dialogue systems, conversation agents, chatter robots, chatterbots, chatbots, and chat Agents, digital personal assistants and automated online assistants, etc.
  • the dialogue system can use natural language to interact with people to simulate intelligent conversation and provide users with personalized assistance.
  • the dialogue system can be implemented based on a reinforcement learning model.
  • the reinforcement learning model based on the method shown in FIG. 2 can be applied to many fields such as education, legal consultation, shopping and dining inquiries, flight inquiries, navigation and so on.
  • step S201 dialog information is acquired.
  • the acquired dialogue information may be, for example, natural language text.
  • the effective dialog information to be processed can be extracted from it through various operations such as word segmentation, semantic analysis, etc., for use in subsequent dialog processing procedures.
  • step S202 reply information is generated based on the reinforcement learning model.
  • the reinforcement learning model can be obtained by training according to the aforementioned reinforcement learning model training method.
  • Fig. 3 shows an exemplary flowchart of a training method of a reinforcement learning model used in a dialog processing method according to an embodiment of the present disclosure.
  • the reinforcement learning model involved in Figure 3 can be applied to many fields such as education, legal consultation, shopping and dining inquiries, flight inquiries, navigation and so on.
  • step S202 unlabeled data and labeled data for training the reinforcement learning model are acquired.
  • the acquired data used for training the reinforcement learning model includes labeled data.
  • the annotation data may be data obtained from a database related to the field of the reinforcement learning model to be trained.
  • the training information related to the reinforcement learning model can be extracted from the annotation data, and the extracted training information can be stored as, for example, the user's goal information (also called user goal).
  • the extracted target information from the annotation data can be used for direct training of the reinforcement learning model to provide feedback to the agent and guide the training process.
  • the target information extracted in the annotation data may include information corresponding to results, classification tags, etc., respectively.
  • the acquired data used for training the reinforcement learning model may also include unlabeled data.
  • the unlabeled data may be obtained through various channels, and these channels may include unlabeled web pages, forums, chat records, databases, etc. related to the field of the reinforcement learning model to be trained.
  • the unlabeled data may be dialog data.
  • the annotation data may be medical case data obtained from, for example, electronic medical records, and the extracted target information may include disease, symptom classification, and symptom attributes. And other information.
  • the unlabeled data may be, for example, medical conversation data obtained from the Internet, and the extracted training information of the unlabeled data may include various information such as conversation time, conversation object, conversation content, and diagnosis result.
  • the training method can also be applied to various other fields such as education, legal consultation, shopping and dining inquiries, flight inquiries, navigation, etc., and is not limited here.
  • step S2022 based on the unlabeled data, refer to the labeled data to generate an experience pool for training the reinforcement learning model.
  • the target information extracted from the labeled data is used as the training target to generate the experience pool.
  • the experience pool may include one or more sequences consisting of a first state (s), an action (a), a reward score (r), and a second state (s'), and may be expressed as a four-tuple ⁇ s,a,r,s'>.
  • the action and the current first state can be obtained based on unlabeled data
  • the second state and the reward score can be obtained through interaction with the environment.
  • the action when the unmarked data is dialogue data, the action may be any dialogue action acquired based on the dialogue data; the first state may include the dialogue data before the acquired dialogue action All the historical information of, and the historical information can be composed of all the information and actions before the dialogue action.
  • the second state may be the state to which the environment has migrated after the action is applied when the environment is in the first state; the reward points may include the state in which the environment is in the first state. In the case of the state, after the action is applied, the feedback is made under the guidance of the annotation data as the target information.
  • the reward scores in the quadruple for constructing the experience pool may further include the credibility (c or c') of the action. That is to say, when the field of the reinforcement learning model to be trained is known, the corresponding occurrence probability and specificity in the key information set of the field can be calculated based on the action (a), thereby obtaining the action
  • the credibility c of, and the credibility c'after smoothing and normalization is obtained.
  • the experience pool can be constructed based on the quadruple ⁇ s,a,c'r,s'>.
  • the action and the current first state may be acquired based on medical dialogue data, and the second state and the reward score may be acquired through interaction with the environment.
  • the action may be any dialogue action obtained based on the medical dialogue data.
  • the action includes but is not limited to: start a dialogue, end a dialogue, request symptom information, diagnose a disease, etc.; the first state It may include all the historical information before the acquired dialogue action in the medical dialogue data, and the historical information may be composed of all the information and actions before the dialogue action.
  • the second state may be the state to which the environment has migrated after the action is applied when the environment is in the first state; the reward points may include the state in which the environment is in the first state.
  • a state feedback made under the guidance of the medical case data as target information after the action is applied.
  • the credibility of the action included in the reward score at this time can be calculated by the aforementioned formulas (1)-(3).
  • step S2023 the reinforcement learning model is trained using the experience pool.
  • the experience pool may be used to assist in training the reinforcement learning model.
  • the agent such as a DQN neural network
  • the environment such as a user simulator
  • the four-tuple ⁇ s, a ,r,s'> or ⁇ s,a,c'r,s'>
  • the experience pool can be continuously updated by using the quadruple obtained during the training process, that is, a new quadruple obtained during the training process can be added to the experience pool. Therefore, using the experience pool to train the reinforcement learning model may further include: in the process of training the reinforcement learning model, updating the experience pool according to the training result; using the updated experience pool to perform the training on the reinforcement learning model Conduct training.
  • the action (a) in the formed quadruple can be initiated by DQN and act on the environment. In the medical field, for example, the actions at this time may also include but are not limited to: starting a conversation, ending a conversation, requesting symptom information, confirming a disease, and so on.
  • the external indication here may be a database related to the reinforcement learning model, such as a knowledge graph.
  • the reply information generated by the trained DQN can be converted into natural language and output to respond to the dialogue information.
  • a reinforcement learning model can be jointly trained based on unlabeled data and labeled data, thereby effectively reducing the need for labeled data when training the reinforcement learning model, and improving the feasibility of training the reinforcement learning model Performance and stability improve the training results of the reinforcement learning model.
  • FIG. 4 shows an exemplary flowchart of a training method 400 for a reinforcement learning model of a medical dialogue system according to an embodiment of the present disclosure.
  • step S401 medical case data and medical dialogue data for training the reinforcement learning model are acquired.
  • the acquired data used for training the reinforcement learning model may include medical case data acquired from electronic medical records.
  • target information as a user goal can be extracted from the medical case data, for example, it can include various information such as disease, symptom classification, and symptom attributes.
  • the target information can be extracted into the format shown in FIG. 5.
  • the disease is represented as "disease_tag”: Ischaemic heart disease.
  • Symptoms actively reported by the patient can be recorded in “explicit_symptoms”.
  • the symptoms include "palpitations”, the frequency is “incidental”, and “sweating", and the condition is "after exercise”.
  • the symptoms obtained from the follow-up dialogue can be recorded in "implicit_symptoms”.
  • the symptoms include "chest tightness", which appears as “intensified”, and “vomiting", which occurs in Weeks ago", but did not "fever".
  • the remaining tags of the target information may be unknown, denoted as "UNK”.
  • the unlabeled data may be, for example, medical conversation data obtained from the Internet, and the training information of the unlabeled data extracted therefrom may include various information such as conversation time, conversation object, conversation content, diagnosis result, etc., such as JSON file is saved.
  • step S402 based on the medical dialogue data, with the medical case data as a target, an experience pool for training the reinforcement learning model is generated.
  • Fig. 6 shows the data collected according to the first example of the present disclosure and a schematic diagram of the DQN training process.
  • the left side is unlabeled data, that is, medical dialogue data, which is expressed in the format of network dialogue; the right side is labeled data, that is, medical case data, which acts as the target information user goal in the subsequent DQN training process.
  • effective training information can be first extracted based on medical dialogue data, and the target information extracted by the annotation data can be used as the training target, and the experience pool can be generated through interaction with the environment (user simulator).
  • any one of the dialogue actions obtained based on the medical dialogue data can be taken as action (a), for example, the action (a) includes but is not limited to: start a dialogue, end a dialogue, request symptom information, diagnose a disease Etc.; and all the information and actions before the action (a) in the medical dialogue data are combined into historical information to form the first state (s).
  • the second state may be the state (s') that the environment migrates to after the action (a) is applied when the user simulator is in the first state (s);
  • the reward score may include feedback (r) made under the guidance of the medical case data as the target information after the action (a) is applied when the user simulator is in the first state (s).
  • the experience pool can be constructed according to the four-tuple ⁇ s,a,r,s'>.
  • multiple quadruples may be formed according to medical dialogue data, for example, ⁇ s 1 ,a 1 ,r 1 ,s 1 '> to ⁇ s n ,a n ,r n ,s n '> For the subsequent construction of the experience pool.
  • the reliability of the utilization of the action may be further evaluated.
  • step S403 the reinforcement learning model is trained using the experience pool.
  • the experience pool may be used to assist in training the reinforcement learning model.
  • the DQN in Figure 6 and the user simulator can interact.
  • the quadruple contained in the experience pool is used for auxiliary training, and the target information user goal is used as the training target, so as to continue The simulation and iteration update the parameters in DQN and get the final training results.
  • the experience pool can be continuously updated using the quadruples obtained during the training process, that is, new quadruples obtained during the training process can be added to the experience pool, and Use the updated experience pool to further train the reinforcement learning model.
  • the action (a) in the formed quadruple can be initiated by the DQN and act on the environment.
  • the actions at this time may also include but are not limited to: starting a conversation, ending a conversation, requesting symptom information, confirming a disease, and so on.
  • additional external knowledge such as a knowledge graph may be introduced to assist decision-making.
  • the training results of the reinforcement learning model and The content of the knowledge graph can be used to make the final decision and achieve the purpose of further improving the training effect of the reinforcement learning model.
  • FIG. 7 An exemplary flowchart of a dialog processing method 700 used in the legal consulting field is provided, as shown in FIG. 7.
  • step S701 dialogue information related to legal consultation is acquired.
  • the acquired dialogue information related to legal consultation may be, for example, a natural language text related to legal consultation.
  • a natural language text related to legal consultation may be, for example, a natural language text related to legal consultation.
  • it can be understood based on the natural language text, and effective dialogue information to be processed can be extracted from it.
  • step S702 reply information is generated based on the reinforcement learning model.
  • the response information that needs feedback can be generated based on the DQN reinforcement learning model.
  • the reinforcement learning model can be obtained by training according to the following reinforcement learning model training method:
  • legal clause data (as labeled data) and legal consultation dialogue data (as unlabeled data) used to train the reinforcement learning model
  • the acquired data used to train the reinforcement learning model can include legal clause data obtained from electronic legal clauses
  • the target information as a user goal can be further extracted from the legal clause data, for example, it can include legal clauses.
  • the legal consultation dialogue data may be, for example, legal consultation dialogue data obtained from the Internet, and the training information of the legal consultation dialogue data extracted therefrom may include various types of dialogue time, dialogue object, dialogue content, legal application results, etc.
  • Information can be saved in the form of, for example, a json file.
  • the legal clause data can be used as a target to generate an experience pool for training the reinforcement learning model.
  • the experience pool can include one or more quadruples ⁇ s,a,r,s'> or ⁇ s,a,c'r,s'> including the confidence level c', which will not be repeated here.
  • the experience pool can be used to train the reinforcement learning model.
  • DQN and user simulator can be used to interact.
  • the quadruple contained in the experience pool is used to assist training, and the target information user goal is used as the training target to update through continuous simulation and iteration.
  • the parameters in DQN to get the final training result.
  • step S703 a response is made to the dialogue information based on the reply information.
  • the generated reply information can be converted into natural language and output to respond to the dialogue information.
  • FIG. 8 shows a block diagram of a training device 800 for a reinforcement learning model according to an embodiment of the present disclosure.
  • the training device 800 of the reinforcement learning model includes an acquiring unit 810, a generating unit 820, and a training unit 830.
  • the training device 800 of the reinforcement learning model may also include other components. However, since these components are not related to the content of the embodiments of the present disclosure, the illustration and description thereof are omitted here.
  • the specific details of the following operations performed by the training apparatus 800 of the reinforcement learning model according to the embodiment of the present disclosure are the same as those described above with reference to FIG. 1, the repeated description of the same details is omitted here to avoid repetition. .
  • the acquisition unit 810 of the training device 800 of the reinforcement learning model in FIG. 8 acquires unlabeled data and labeled data used for training the reinforcement learning model.
  • the data used for training the reinforcement learning model acquired by the acquiring unit 810 includes labeled data.
  • the annotation data may be data obtained from a database related to the field of the reinforcement learning model to be trained.
  • the training information related to the reinforcement learning model can be extracted from the annotation data, and the extracted training information can be stored as, for example, the user's goal information (also called user goal).
  • the extracted target information from the annotation data can be used for direct training of the reinforcement learning model to provide feedback to the agent and guide the training process.
  • the target information extracted in the annotation data may include information corresponding to results, classification tags, etc., respectively.
  • the data used for training the reinforcement learning model acquired by the acquiring unit 810 may also include unlabeled data.
  • the unlabeled data may be obtained through various channels, and these channels may include unlabeled web pages, forums, chat records, databases, etc. related to the field of the reinforcement learning model to be trained.
  • the unlabeled data may be dialog data.
  • the annotation data may be medical case data obtained from, for example, electronic medical records, and the extracted target information may include disease, symptom classification, and symptom attributes. And other information.
  • the unlabeled data may be, for example, medical conversation data obtained from the Internet, and the extracted training information of the unlabeled data may include various information such as conversation time, conversation object, conversation content, and diagnosis result.
  • the training device can also be applied to various other fields such as education, legal consultation, shopping and dining inquiries, flight inquiries, navigation, etc., which is not limited here.
  • the generating unit 820 generates an experience pool for training the reinforcement learning model based on the unlabeled data and referring to the labeled data.
  • the generating unit 820 may generate the experience pool based on the effective training information extracted from the unlabeled data, using the target information extracted from the labeled data as the training target.
  • the experience pool may include one or more sequences consisting of a first state (s), an action (a), a reward score (r), and a second state (s'), and may be expressed as a four-tuple ⁇ s,a,r,s'>.
  • the action and the current first state can be obtained based on unlabeled data
  • the second state and the reward score can be obtained through interaction with the environment.
  • the action when the unmarked data is dialogue data, the action may be any dialogue action acquired based on the dialogue data; the first state may include the dialogue data before the acquired dialogue action All the historical information of, and the historical information can be composed of all the information and actions before the dialogue action.
  • the second state may be the state to which the environment has migrated after the action is applied when the environment is in the first state; the reward points may include the state in which the environment is in the first state. In the case of the state, after the action is applied, the feedback is made under the guidance of the annotation data as the target information.
  • the reward scores in the quadruple for constructing the experience pool may further include the credibility (c or c') of the action. That is to say, when the field of the reinforcement learning model to be trained is known, the corresponding occurrence probability and specificity in the key information set of the field can be calculated based on the action (a), thereby obtaining the action
  • the credibility c of, and the credibility c'after smoothing and normalization is obtained.
  • the experience pool can be constructed based on the quadruple ⁇ s,a,c'r,s'>.
  • the generating unit 820 may obtain the action and the current first state based on the medical conversation data, and obtain the second state and the reward score through interaction with the environment.
  • the action may be any dialogue action obtained based on the medical dialogue data.
  • the action includes but is not limited to: start a dialogue, end a dialogue, request symptom information, diagnose a disease, etc.; the first state It may include all the historical information before the acquired dialogue action in the medical dialogue data, and the historical information may be composed of all the information and actions before the dialogue action.
  • the second state may be the state to which the environment has migrated after the action is applied when the environment is in the first state; the reward points may include the state in which the environment is in the first state.
  • a state feedback made under the guidance of the medical case data as target information after the action is applied.
  • the credibility of the action included in the reward score at this time can be calculated by the aforementioned formulas (1)-(3).
  • D i can represent the i-th disease (i is 0 or a positive integer);
  • a disease Di can include several medical conversation data, for example, It may represent data for the j-th session of diseases of D i (j is 0 or a positive integer), the AF operation can be represented in a medical conversation data IDF indicates the specificity of action a in a specific disease. Therefore, the credibility AF-IDF can be obtained through the product of the AF and IDF to reflect the credibility c of a certain action a.
  • the training unit 830 uses the experience pool to train the reinforcement learning model.
  • the training unit 830 may use the experience pool to assist in training the reinforcement learning model.
  • the agent such as a DQN neural network
  • the environment such as a user simulator
  • the four-tuple ⁇ s, a ,r,s'> or ⁇ s,a,c'r,s'>
  • the four-tuple ⁇ s, a ,r,s'> or ⁇ s,a,c'r,s'>
  • assist training and use the labeled data or the target information extracted as the training target to update the DQN through continuous simulation and iteration Parameters to get the final training result.
  • the training unit 830 may additionally introduce external knowledge to assist decision-making when training the reinforcement learning model.
  • the training results of the reinforcement learning model and the external The content of knowledge in order to make the final decision and achieve the purpose of further improving the training effect of the reinforcement learning model.
  • the external indication here may be a database related to the reinforcement learning model, such as a knowledge graph.
  • the training device for the reinforcement learning model may be used to train the reinforcement learning model for the dialogue system.
  • the dialogue system can be divided into a task-oriented system (Task-oriented Dialogue System) and a non-task-oriented system (Non-Task-Oriented Dialogue System).
  • the task-oriented dialogue system refers to a type of dialogue system that aims to help users complete tasks in a specific field based on communication with users.
  • the training device for the reinforcement learning model can be used to train a reinforcement learning model for a task-oriented dialogue system, for example, it can be used for training a reinforcement learning model for a medical dialogue system.
  • the training device can also be applied to dialogue systems related to various other fields such as education, legal consultation, shopping and dining inquiries, flight inquiries, navigation, etc., which is not limited here. .
  • FIG. 9 shows a block diagram of a training device 900 for a reinforcement learning model according to an embodiment of the present disclosure.
  • the apparatus 900 may be a computer or a server.
  • the training device 900 for a reinforcement learning model includes one or more processors 910 and a memory 920.
  • the training device 900 for a reinforcement learning model may also include an input device and an output device (not shown). These components can be interconnected through a bus system and/or other forms of connection mechanisms. It should be noted that the components and structure of the training device 900 for the reinforcement learning model shown in FIG. 9 are only exemplary and not restrictive. The training device 900 for the reinforcement learning model may also have other components and structures as required.
  • the processor 910 may be a central processing unit (CPU) or a field programmable logic array (FPGA) or a single-chip microcomputer (MCU) or a digital signal processor (DSP) or an application specific integrated circuit (ASIC), etc., with data processing capabilities and/or program execution Capable logic computing devices.
  • CPU central processing unit
  • FPGA field programmable logic array
  • MCU single-chip microcomputer
  • DSP digital signal processor
  • ASIC application specific integrated circuit
  • the processor 910 may use the computer program instructions stored in the memory 920 to perform desired functions, and may include: obtaining unlabeled data and labeled data for training the reinforcement learning model; based on the unlabeled data, referring to the The annotation data is used to generate an experience pool for training the reinforcement learning model; the experience pool is used to train the reinforcement learning model.
  • the computer program instructions include one or more processor operations defined by an instruction set architecture corresponding to the processor, and these computer instructions may be logically contained and represented by one or more computer programs.
  • the memory 920 may include one or more computer program products, and the computer program products may include various forms of computer-readable storage media, such as volatile memory and/or nonvolatile memory, such as static random access memory ( SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory , Disk or CD.
  • SRAM static random access memory
  • EEPROM electrically erasable programmable read-only memory
  • EPROM erasable programmable read-only memory
  • PROM programmable read-only memory
  • ROM read-only memory
  • One or more computer program instructions can be stored on the computer-readable storage medium, and the processor 910 can run the program instructions to implement the functions of the training device for the reinforcement learning model of the embodiment of the present disclosure described above and /Or other desired functions, and/or the training method of the reinforcement learning model according to the embodiment of the present disclosure may be executed.
  • Various application programs and various data can also be stored in the computer-readable storage medium.
  • a computer-readable storage medium according to an embodiment of the present disclosure is described, and computer program instructions are stored thereon, wherein the computer program instructions are executed by a processor to implement the following steps: obtain the information used to train the reinforcement learning model Annotated data and annotated data; based on the unlabeled data, refer to the labeled data to generate an experience pool for training the reinforcement learning model; use the experience pool to train the reinforcement learning model.
  • FIG. 10 shows a block diagram of a dialogue system 1000 according to an embodiment of the present disclosure.
  • the dialogue system 1000 includes an acquisition unit 1010, a generation unit 1020, and a response unit 1030.
  • the dialogue system 1000 may also include other components. However, since these components have nothing to do with the content of the embodiments of the present disclosure, their illustration and description are omitted here.
  • the specific details of the following operations performed by the dialogue system 1000 according to the embodiment of the present disclosure are the same as those described above with reference to FIGS. 2 to 3, repeated descriptions of the same details are omitted here to avoid repetition.
  • the dialogue system 1000 described in FIG. 10 can also be called a chat information system, a spoken dialogue system, a conversation agent, a chatter robot, a chatterbot, a chatbot, a chat agent, Digital personal assistants and automated online assistants, etc.
  • the dialogue system 1000 can use natural language to interact with people to simulate intelligent conversation and provide personalized assistance to the user.
  • the dialogue system can be implemented based on a reinforcement learning model.
  • the reinforcement learning model based on the system shown in FIG. 10 can be applied to many fields such as education, legal consulting, shopping and dining inquiries, flight inquiries, navigation and so on.
  • the acquisition unit 1010 of the dialogue system 1000 in FIG. 10 acquires dialogue information.
  • the dialog information acquired by the acquiring unit 1010 may be, for example, a natural language text.
  • a natural language text it can be understood based on the natural language text, and the effective dialog information to be processed can be extracted from it through various operations such as word segmentation, semantic analysis, etc., for use in subsequent dialog processing procedures.
  • the generating unit 1020 generates reply information based on the reinforcement learning model.
  • the generating unit 1020 may generate response information that needs to be fed back according to the acquired dialog information, based on a reinforcement learning model such as DQN.
  • the reinforcement learning model can be obtained by training according to the aforementioned reinforcement learning model training method or training device.
  • FIG. 3 shows an exemplary flowchart of a training method of a reinforcement learning model used in a dialogue system according to an embodiment of the present disclosure.
  • the reinforcement learning model involved in Figure 3 can be applied to many fields such as education, legal consultation, shopping and dining inquiries, flight inquiries, navigation and so on.
  • step S202 unlabeled data and labeled data for training the reinforcement learning model are obtained.
  • the acquired data used for training the reinforcement learning model includes labeled data.
  • the annotation data may be data obtained from a database related to the field of the reinforcement learning model to be trained.
  • the training information related to the reinforcement learning model can be extracted from the annotation data, and the extracted training information can be stored as, for example, the user's goal information (also called user goal).
  • the extracted target information from the annotation data can be used for direct training of the reinforcement learning model to provide feedback to the agent and guide the training process.
  • the target information extracted in the annotation data may include information corresponding to results, classification tags, etc., respectively.
  • the acquired data used for training the reinforcement learning model may also include unlabeled data.
  • the unlabeled data may be obtained through various channels, and these channels may include unlabeled web pages, forums, chat records, databases, etc. related to the field of the reinforcement learning model to be trained.
  • the unlabeled data may be dialog data.
  • the annotation data may be medical case data obtained from, for example, electronic medical records, and the extracted target information may include disease, symptom classification, and symptom attributes. And other information.
  • the unlabeled data may be, for example, medical conversation data obtained from the Internet, and the extracted training information of the unlabeled data may include various information such as conversation time, conversation object, conversation content, and diagnosis result.
  • the training method can also be applied to various other fields such as education, legal consultation, shopping and dining inquiries, flight inquiries, navigation, etc., and is not limited here.
  • step S2022 based on the unlabeled data, refer to the labeled data to generate an experience pool for training the reinforcement learning model.
  • the target information extracted from the labeled data is used as the training target to generate the experience pool.
  • the experience pool may include one or more sequences consisting of a first state (s), an action (a), a reward score (r), and a second state (s'), and may be expressed as a four-tuple ⁇ s,a,r,s'>.
  • the action and the current first state can be obtained based on unlabeled data
  • the second state and the reward score can be obtained through interaction with the environment.
  • the action when the unmarked data is dialogue data, the action may be any dialogue action acquired based on the dialogue data; the first state may include the dialogue data before the acquired dialogue action All the historical information of, and the historical information can be composed of all the information and actions before the dialogue action.
  • the second state may be the state to which the environment has migrated after the action is applied when the environment is in the first state; the reward points may include the state in which the environment is in the first state.
  • the feedback is made under the guidance of the annotation data as target information.
  • the reward scores in the quadruple for constructing the experience pool may further include the credibility (c or c') of the action. That is to say, when the field of the reinforcement learning model to be trained is known, the corresponding occurrence probability and specificity in the key information set of the field can be calculated based on the action (a), thereby obtaining the action
  • the credibility c of, and the credibility c'after smoothing and normalization is obtained.
  • the experience pool can be constructed based on the quadruple ⁇ s,a,c'r,s'>.
  • the action and the current first state may be acquired based on medical dialogue data, and the second state and the reward score may be acquired through interaction with the environment.
  • the action may be any dialogue action obtained based on the medical dialogue data.
  • the action includes but is not limited to: start a dialogue, end a dialogue, request symptom information, diagnose a disease, etc.; the first state It may include all the historical information before the acquired dialogue action in the medical dialogue data, and the historical information may be composed of all the information and actions before the dialogue action.
  • the second state may be the state to which the environment has migrated after the action is applied when the environment is in the first state; the reward points may include the state in which the environment is in the first state.
  • a state feedback made under the guidance of the medical case data as target information after the action is applied.
  • the credibility of the action included in the reward score at this time can be calculated by the aforementioned formulas (1)-(3).
  • D i can represent the i-th disease (i is 0 or a positive integer);
  • a disease Di can include several medical conversation data, for example, It may represent data for the j-th session of diseases of D i (j is 0 or a positive integer), the AF operation can be represented in a medical conversation data IDF indicates the specificity of action a in a specific disease. Therefore, the credibility AF-IDF can be obtained by the product of AF and IDF to reflect the credibility c of a certain action a.
  • step S2023 the reinforcement learning model is trained using the experience pool.
  • the experience pool may be used to assist in training the reinforcement learning model.
  • the agent such as a DQN neural network
  • the environment such as a user simulator
  • the four-tuple ⁇ s, a ,r,s'> or ⁇ s,a,c'r,s'>
  • the experience pool can be continuously updated by using the quadruple obtained during the training process, that is, a new quadruple obtained during the training process can be added to the experience pool. Therefore, using the experience pool to train the reinforcement learning model may further include: in the process of training the reinforcement learning model, updating the experience pool according to the training result; using the updated experience pool to perform the training on the reinforcement learning model Conduct training.
  • the action (a) in the formed quadruple can be initiated by DQN and act on the environment. In the medical field, for example, the actions at this time may also include but are not limited to: starting a conversation, ending a conversation, requesting symptom information, confirming a disease, and so on.
  • the external indication here may be a database related to the reinforcement learning model, such as a knowledge graph.
  • the response unit 1030 responds to the dialogue information based on the reply information.
  • the response unit 1030 can convert the generated reply information into natural language and output it to respond to the dialogue information.
  • a reinforcement learning model can be jointly trained based on unlabeled data and labeled data, thereby effectively reducing the need for labeled data when training the reinforcement learning model, and improving the feasibility of training the reinforcement learning model And stability, which improves the training results of the reinforcement learning model.
  • FIG. 11 shows a block diagram of a dialogue system 1100 according to an embodiment of the present disclosure.
  • the apparatus 1100 may be a computer or a server.
  • the dialogue system 1100 includes one or more processors 1110 and a memory 1120.
  • the dialogue system 1100 may also include an input device, an output device (not shown), etc., and these components can be The bus system and/or other forms of connection mechanism are interconnected.
  • the components and structure of the dialogue system 1100 shown in FIG. 11 are only exemplary and not restrictive.
  • the dialogue system 1100 may also have other components and structures as required.
  • the processor 1110 may be a central processing unit (CPU) or a field programmable logic array (FPGA) or a single-chip microcomputer (MCU) or a digital signal processor (DSP) or an application specific integrated circuit (ASIC), etc., with data processing capabilities and/or program execution Capable of logical operation devices, and can use the computer program instructions stored in the memory 1120 to perform desired functions, which can include: obtaining dialog information; generating reply information based on the reinforcement learning model; performing the dialog information based on the reply information Response; wherein, the reinforcement learning model is obtained by training the following method: obtaining unlabeled data and labeled data used to train the reinforcement learning model; based on the unlabeled data, referencing the labeled data to generate for training The experience pool of the reinforcement learning model; and the use of the experience pool to train the reinforcement learning model.
  • CPU central processing unit
  • FPGA field programmable logic array
  • MCU microcomputer
  • DSP digital signal processor
  • ASIC application specific integrated circuit
  • the memory 1120 may include one or more computer program products, and the computer program products may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory, such as static random access memory ( SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory , Disk or CD.
  • SRAM static random access memory
  • EEPROM electrically erasable programmable read-only memory
  • EPROM erasable programmable read-only memory
  • PROM programmable read-only memory
  • ROM read-only memory
  • magnetic memory flash memory
  • Disk Disk or CD
  • One or more computer program instructions may be stored on the computer-readable storage medium, and the processor 1110 may run the program instructions to implement the functions of the training device for the reinforcement learning model of the embodiment of the present disclosure described above and /Or other desired functions, and/or the dialog processing
  • a computer-readable storage medium on which computer program instructions are stored, where the computer program instructions are executed by a processor to implement the following steps: obtain dialog information; generate reply information based on the reinforcement learning model Respond to the dialogue information based on the reply information; wherein, the reinforcement learning model is obtained by training by the following method: obtaining unlabeled data and labeled data for training the reinforcement learning model; based on the unlabeled data
  • the annotation data is used to generate an experience pool for training the reinforcement learning model with reference to the annotation data; and the reinforcement learning model is trained using the experience pool.
  • the dialogue system of the embodiment of the present disclosure is particularly suitable for the medical field.
  • the medical field has very little labeled data because the requirements for labeling data are relatively high, that is, doctors with higher expertise and more experience are required to label to improve professionalism and accuracy.
  • the reinforcement learning model can be jointly trained based on the unlabeled data and the labeled data, thereby reducing the dependence and requirements on the professional and experience of the doctor, and effectively reducing the training of the reinforcement learning The need for labeling data when modeling.
  • FIG. 12 shows a schematic diagram of a user interface 1200 of a medical dialogue system according to an embodiment of the present disclosure.
  • the model involved in the medical dialogue system can be obtained by training using the training method of the reinforcement learning model described above.
  • the model can be stored in the form of a computer program instruction set.
  • the medical dialogue system may include: a user interface; a processor; and a memory on which computer program instructions are stored, and when the computer program instructions are executed by the processor, the processor is caused to perform the following steps.
  • the medical dialogue system first receives natural language input information from the user, and displays the natural language input information on the user interface 1200 (for example, to the right).
  • Natural language input information can be input by voice or text.
  • the user uses text to input natural language input information of "I am a little dizzy and nauseous when eating".
  • the medical dialogue system displays one or more questions associated with the symptoms mentioned in the natural language input information on the user interface 1200 (for example, to the left), so as to realize multiple rounds of question and answer, and each Question: Receive an answer to the question from the user, and display the answer on the right side of the user interface.
  • the medical dialogue system displays a question asking when the dizziness occurs on the user interface (on the left side below the box 1201) (box 1202).
  • the question can be given with multiple answer options in which the user can choose.
  • the user gives an answer to the question and the answer will be displayed on the right side of the user interface (below box 1202).
  • the medical dialogue system displays on the user interface (on the left side below the box 1203) questions asking about the frequency of dizziness (box 1204).
  • the question can be given with multiple answer options in which the user can choose.
  • the user gives an answer to the question and the answer will be displayed on the user interface (below box 1204 to the right). For example, when the user answers the question (1204) and selects the option "three episodes and more per week", then Box 1205 displays the text "Three episodes and more per week”. And so on to complete multiple rounds of question and answer. Although only two rounds of question and answer are shown in the figure, according to the training method of the reinforcement learning model of the medical dialogue system, there may be more rounds of questions, which is not limited in the present disclosure.
  • the medical dialogue system generates and displays the diagnosis result for the symptom on the user interface, for example, as shown in block 1206.
  • the diagnosis result includes at least one of the following: possible disease type, symptoms of the possible disease type, recommended drugs suitable for the possible disease type, symptoms targeted by the recommended drug, understanding of recommended drugs More Links to information, etc.
  • the diagnosis result may also include the probability of various disease types that the symptom may correspond to.
  • diagnosis result is output in the form of natural language and displayed on the user interface, for example, as shown in block 1206.
  • the description method of the functional unit corresponding to the function to be performed is used. It is easy to understand that these functional units are functional entities and do not necessarily correspond to physically or logically independent entities. . These functional entities can be implemented by a general-purpose processor running software corresponding to the functions in the form of executing computer instructions, or can be programmed to implement these functional entities in one or more hardware modules or integrated circuits, or designed to specifically perform corresponding functions Integrated circuits to realize these functional entities.
  • the general-purpose processor may be a central processing unit (CPU), a single-chip microcomputer (MCU), a digital signal processor (DSP), etc.
  • CPU central processing unit
  • MCU single-chip microcomputer
  • DSP digital signal processor
  • the programmable integrated circuit may be a field programmable logic circuit (FPGA).
  • FPGA field programmable logic circuit
  • the specialized integrated circuit may be an application-specific integrated circuit (ASIC), such as Tensor Processing Unit (TPU).
  • ASIC application-specific integrated circuit
  • TPU Tensor Processing Unit
  • the program part in the technology can be regarded as a "product” or “article” in the form of executable code and/or related data, which participates in or is realized by a computer-readable medium.
  • the tangible and permanent storage medium may include any memory or storage used by computers, processors, or similar devices or related modules. For example, various semiconductor memories, tape drives, disk drives, or similar devices that can provide storage functions for software.
  • All software or part of it may sometimes communicate via a network, such as the Internet or other communication networks.
  • This type of communication can load software from one computer device or processor to another.
  • a hardware platform loaded from a server or host computer of an image retrieval device to a computer environment, or other computer environments that implement the system, or a system with similar functions related to providing information required for image retrieval. Therefore, another medium that can transmit software elements can also be used as a physical connection between local devices, such as light waves, electric waves, electromagnetic waves, etc., through cables, optical cables, or air.
  • the physical media used for carrier waves, such as cables, wireless connections, or optical cables can also be considered as media carrying software.
  • the tangible "storage” medium other terms referring to the computer or machine "readable medium” all refer to the medium that participates in the process of executing any instructions by the processor.
  • first/second embodiment means a certain feature, structure, or characteristic related to at least one embodiment of the present application. Therefore, it should be emphasized and noted that “an embodiment” or “an embodiment” or “an alternative embodiment” mentioned twice or more in different positions in this specification does not necessarily refer to the same embodiment. . In addition, some features, structures, or characteristics in one or more embodiments of the present application can be appropriately combined.

Abstract

一种强化学习模型训练方法和装置、对话处理方法和对话系统及计算机可读存储介质。其中,所述强化学习模型的训练方法,包括:获取用于训练所述强化学习模型的未标注数据和标注数据;基于所述未标注数据,参考所述标注数据生成用于训练所述强化学习模型的经验池;利用所述经验池训练所述强化学习模型。

Description

训练方法和装置、对话处理方法和系统及介质
相关申请的交叉引用
本申请要求于2019年5月10日提交的中国专利申请第201910390546.5的优先权,该中国专利申请的全文通过引用的方式结合于此以作为本申请的一部分。
技术领域
本公开涉及机器学习领域,更具体地涉及强化学习模型训练方法和装置、对话处理方法和对话系统及计算机可读存储介质。
背景技术
强化学习(Reinforcement Learning),又称再励学习、评价学习,是一种重要的机器学习方法,在智能控制机器人及分析预测等领域有许多应用。强化学习是指智能体(Agent)以“试错”的方式进行学习,通过与环境(Environment)进行交互获得的奖励分数来指导行为,其目标是使得智能体选择的行为能够获得环境最大的奖励分数。
对话系统(Dialog System,或Conversation Agent)是一种旨在与人进行连贯交流的计算机系统,可以包括具有用于访问、处理、管理和传递信息的人机接口的基于计算机的代理。对话系统可以基于强化学习模型而实现。然而,在基于强化学习模型的对话系统的构建过程中,往往需要获取大量的标注数据,以提高对话系统的精度,这些所需的标注数据通常较为昂贵并且难以获取,从而影响了强化学习模型的训练和构建,也限制了对话系统在各领域的进一步应用。
发明内容
根据本公开的一个方面,提供了一种强化学习模型的训练方法,包括:获取用于训练所述强化学习模型的未标注数据和标注数据;基于所述未标注数据,参考所述标注数据生成用于训练所述强化学习模型的经验池;利用所述经验池训练所述强化学习模型。
根据本公开的另一方面,提供了一种对话处理方法,包括:获取对话信息;基于强化学习模型生成回复信息;基于所述回复信息对所述对话信息进行响应;其中,所述强化学习模型是通过如下方法训练得到的:获取用于训练所述强化学习模型的未标注数据和标注数据;基于所述未标注数据,参考所述标注数据生成用于训练所述强化学习模型的经验池;利用所述经验池训练所述强化学习模型。
根据本公开的另一方面,提供了一种强化学习模型的训练装置,包括:获取单元,配置为获取用于训练所述强化学习模型的未标注数据和标注数据;生成单元,配置为基于所述未标注数据,参考所述标注数据生成用于训练所述强化学习模型的经验池;训练单元,配置为利用所述经验池训练所述强化学习模型。
根据本公开的另一方面,提供了一种强化学习模型的训练装置,包括:处理器;存储器;和存储在所述存储器中的计算机程序指令,在所述计算机程序指令被所述处理器运行时,使得所述处理器执行以下步骤:获取用于训练所述强化学习模型的未标注数据和标注数据;基于所述未标注数据,参考所述标注数据生成用于训练所述强化学习模型的经验池;利用所述经验池训练所述强化学习模型。
根据本公开的另一方面,提供了一种计算机可读存储介质,其上存储有计算机可读的指令,当利用计算机执行所述指令时,执行前述任一项所述的强化学习模型训练方法。
根据本公开的另一方面,提供了一种对话系统,包括:获取单元,配置为获取对话信息;生成单元,配置为基于强化学习模型生成回复信息;响应单元,配置为基于所述回复信息对所述对话信息进行响应;其中,所述强化学习模型是通过如下方法训练得到的:获取用于训练所述强化学习模型的未标注数据和标注数据;基于所述未标注数据,参考所述标注数据生成用于训练所述强化学习模型的经验池;利用所述经验池训练所述强化学习模型。
根据本公开的另一方面,提供了一种对话系统,包括:处理器;存储器;和存储在所述存储器中的计算机程序指令,在所述计算机程序指令被所述处理器运行时,使得所述处理器执行以下步骤:获取对话信息;基于强化学习模型生成回复信息;基于所述回复信息对所述对话信息进行响应;其中,所述强化学习模型是通过如下方法训练得到的:获取用于训练所述强化学习模型的未标注数据和标注数据;基于所述未标注数据,参考所述标注数据生成用于训练所述强化学习模型的经验池;利用所述经验池训练所述强化学习模型。
根据本公开的另一方面,提供了一种计算机可读存储介质,其上存储有计算机可读的指令,当利用计算机执行所述指令时,执行前述任一项所述的对话处理方法。
附图说明
为了更清楚地说明本公开实施例的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本公开的一些实施例,对于本领域普通技术人员而言,在没有做出创造性劳动的前提下,还可以根据这些附图获得其他的附图。以下附图并未刻意按实际尺寸等比例缩放绘制,重点在于示出本公开的主旨。
图1示出了根据本公开实施例的强化学习模型的训练方法的示例性流程图;
图2示出了根据本公开实施例的对话处理方法的示例性流程图;
图3示出了根据本公开实施例的对话处理方法中所使用的强化学习模型的训练方法的示例性流程图;
图4示出了根据本公开实施例的用于医疗对话系统的强化学习模型的训练方法的示例性流程图;
图5示出了根据本公开实施例的用于医疗对话系统的强化学习模型的训练方法中目标信息的示意图;
图6示出了根据本公开的第一示例所采集的数据以及对DQN的训练流程示意图;
图7示出了根据本公开实施例的用于法律咨询领域的对话处理方法的示例性流程图;
图8示出了根据本公开实施例的强化学习模型的训练装置的框图;
图9示出了根据本公开实施例的强化学习模型的训练装置的框图;
图10示出了根据本公开实施例的对话系统的框图;
图11示出了根据本公开实施例的对话系统的框图;以及
图12示出了根据本公开实施例的医疗对话系统的用户界面1200的示意图。
具体实施方式
下面将结合附图对本公开实施例中的技术方案进行清楚、完整地描述,显而易见地,所描述的实施例仅仅是本公开的部分实施例,而不是全部的实施例。基于本公开实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,也属于本公开保护的范围。
如本公开和权利要求书中所示,除非上下文明确提示例外情形,“一”、“一个”、“一种”和/或“该”等词并非特指单数,也可包括复数。一般说来,术语“包括”与“包含”仅提示包括已明确标识的步骤和元素,而这些步骤和元素不构成一个排它性的罗列,方法、装置或者系统也可能包含其他的步骤或元素。
虽然本公开对根据本公开的实施例的装置、系统中的某些模块做出了各种引用,然而,任何数量的不同模块可以被使用并运行在用户端和/或服务器上。所述模块仅是说明性的,并且所述装置、系统和方法的不同方面可以使用不同模块。
本公开中使用了流程图用来说明根据本公开的实施例的装置、系统所执行的操作。应当理解的是,前面或下面操作不一定按照顺序来精确地执行。相反,根据需要,可以按照倒序或同时处理各种步骤。同时,也可以将其他操作添加到这些过程中,或从这些过程移除某一步或数步操作。
强化学习模型一般包括智能体和环境,智能体通过与环境的交互和反馈,不断进行学习,优化其策略。具体而言,智能体观察并获得环境的状态(state,s),根据一定策略,针对当前环境的状态(s)确定要采取的行为或动作(action,a)。这样的动作(a)作用于环境,会改变环境的状态(例如,从s到s’),同时产生奖励分数作为反馈(reward,r)发送给智能体。智能体根据获得的奖励分数(r)来判断之前的动作是否正确,策略是否需要调整,进而更新其策略。通过反复不断地观察状态、确定动作、收到反馈,智能体可以不断更新策略。强化学习模型训练的最终目标是能够学习到一个策略,使得获得的奖励分数累积最大化。在学习和调整策略的强化学习过程中,智能体可以采取包括神经网络,例如基于深度强化学习DRL的神经网络(例如Deep Q-Learing(DQN)、Double-DQN、Dualing-DQN、Deep Deterministic Policy Gradient(DDPG)、Asynchronous Advantage Actor-Critic(A3C)、Continuous Deep Q-Learning with NAF等)在内的一些深度学习的算 法。本公开实施例中所描述的强化学习模型可以是基于深度强化学习DRL的神经网络。
可见,在强化学习模型的训练过程中,一般需要采用大量的标注数据,以作为训练的目标来引导训练的过程,但是,这些标注数据的获取往往需要耗费大量的时间和系统资源,并且数量较少,较难获取。
在此基础上,本公开实施例提供了一种强化学习模型的训练方法,如图1所示。图1示出了根据本公开实施例的强化学习模型的训练方法100的示例性流程图。可选地,图1所涉及的强化学习模型可以适用于教育、法律咨询、购物餐饮查询、航班查询、导航等诸多领域。
在步骤S101中,获取用于训练所述强化学习模型的未标注数据和标注数据。
本步骤中,所获取的用于训练强化学习模型的数据包括标注数据。可选地,标注数据可以是从与所需训练的强化学习模型所在的领域相关的数据库中所获取的数据。在一个示例中,可以从标注数据中提取与强化学习模型相关的训练信息,并将提取出的训练信息作为例如用户的目标信息(也称user goal)进行保存。所提取的来自标注数据的目标信息可以用于对强化学习模型的直接训练,以向智能体提供反馈,并引导训练过程。可选地,在标注数据中所提取的目标信息可以包括分别与结果(result)、分类标签(tag)等对应的信息。
进一步地,所获取的用于训练强化学习模型的数据还可以包括未标注数据。可选地,未标注数据可以通过各种途径进行获取,这些途径可以包括与所需训练的强化学习模型所在的领域相关的未经标注的网页、论坛、聊天记录、数据库等。可选地,未标注数据可以为对话数据。在一个示例中,也可以从未标注数据中提取与强化学习模型相关的训练信息,并将提取出的训练信息用于后续生成用于训练强化学习模型的经验池。
可选地,在本公开实施例的方法应用于医疗领域时,所述标注数据可以为从例如电子病历等所获取的医疗病例数据,而提取出的目标信息可以包括疾病、症状分类、症状属性等各种信息。相应地,所述未标注数据可以为例如从互联网获取的医疗对话数据,提取出的未标注数据的训练信息则可以包括对话时间、对话对象、对话内容、诊断结果等各种信息。当然,上述内容仅为示例,在本公开实施例中,所述训练方法还可以应用于教育、法律咨询、购物餐饮查询、航班查询、导航等各个其他领域,在此不做限制。
在步骤S102中,基于所述未标注数据,参考所述标注数据生成用于训练所述强化学习模型的经验池。
在本步骤中,可选地,可以基于未标注数据所提取的有效的训练信息,以标注数据所提取的目标信息作为训练目标,来生成所述经验池(experience pool)。可选地,经验池可以包括由第一状态(s)、动作(a)、奖励分数(r)和第二状态(s’)构成的一个或多个序列,并且可以表示为四元组<s,a,r,s’>。在一个示例中,可以基于未标注数据获取动作和当前的第一状态,并通过与环境的交互,获取第二状态和奖励分数。其中,当未标注数据为对话数据时,所述动作可以是基于所述对话数据获取的其中任一对话动作;所述第一 状态可以包括所述对话数据中在所获取的所述对话动作之前的所有历史信息,而所述历史信息可以由在该对话动作之前的所有信息和动作共同组成。相应地,所述第二状态可以是在所述环境处于第一状态的情况下,被施加所述动作后,所述环境迁移到的状态;所述奖励分数可以包括在所述环境处于第一状态的情况下,被施加所述动作后,在所述标注数据作为目标信息的引导下所做出的反馈。
可选地,构建所述经验池的四元组中的奖励分数还可以进一步包括所述动作的可信度(c或c’)。也就是说,在已知所需训练的强化学习模型所在的领域时,可以基于所述动作(a)计算其在该领域的关键信息集合中相应的出现概率和特异性,从而得到所述动作的可信度c,并经平滑和归一化处理得到处理后的可信度c’。随后,可以根据四元组<s,a,c’r,s’>来构建所述经验池。
可选地,在本公开实施例的方法应用于医疗领域时,可以基于医疗对话数据获取动作和当前的第一状态,并通过与环境的交互,获取第二状态和奖励分数。其中,所述动作可以是基于所述医疗对话数据获取的其中任一对话动作,例如,所述动作包括但不限于:开始对话、结束对话、请求症状信息、确诊疾病等;所述第一状态可以包括所述医疗对话数据中在所获取的所述对话动作之前的所有历史信息,而所述历史信息可以由在该对话动作之前的所有信息和动作共同组成。相应地,所述第二状态可以是在所述环境处于第一状态的情况下,被施加所述动作后,所述环境迁移到的状态;所述奖励分数可以包括在所述环境处于第一状态的情况下,被施加所述动作后,在所述医疗病例数据作为目标信息的引导下所做出的反馈。可选地,此时奖励分数中包括的所述动作的可信度可以通过如下公式(1)-(3)进行计算:
Figure PCTCN2020089394-appb-000001
Figure PCTCN2020089394-appb-000002
AF-IDF=AF·IDF     (3)
其中,D={D i}可以为疾病的集合,疾病的集合可以包括例如使用ICD-10编码的若干个疾病,例如,D i可以表示第i个疾病(i为0或正整数);每个疾病D i可以包括若干个医疗对话数据,例如,
Figure PCTCN2020089394-appb-000003
可以表示针对疾病D i的第j个对话数据(j为0或正整数),则AF可以表示动作a在医疗对话数据
Figure PCTCN2020089394-appb-000004
中所出现的概率,IDF则表示动作a出现在特定疾病下的特异性。从而,可信度AF-IDF可以通过AF和IDF二者的乘积而获取,以反映某个动作a的可信度c。在计算出可信度c之后,可以将其进行平滑和归一化处理,得到处理后的可信度c’,以避免因某些未采集到的疾病影响训练结果。最后,可以根据计算得到的c’形成四元组<s,a,c’r,s’>,以构建所述经验池。
在步骤S103中,利用所述经验池训练所述强化学习模型。
在本公开实施例中,在根据未标注数据和标注数据形成所述经验池之后,可以利用所 述经验池辅助训练所述强化学习模型。可选地,智能体(例如DQN神经网络)和环境(例如可以为用户模拟器(user simulator))可以进行交互,在交互过程中,以经验池中所包含的四元组(<s,a,r,s’>或<s,a,c’r,s’>)辅助训练,并以标注数据或其提取的目标信息作为训练的目标,以通过不断的模拟和迭代来更新DQN中的参数,从而得到最终的训练结果。可选地,在训练过程中,可以利用训练过程中得到的四元组不断更新所述经验池,也就是说,可以将训练过程中得到的新的四元组加入所述经验池。从而,利用所述经验池训练所述强化学习模型还可以包括:在训练所述强化学习模型的过程中,根据训练结果更新所述经验池;利用更新的所述经验池对所述强化学习模型进行训练。在训练所述强化学习模型的过程中,所形成的四元组中的动作(a)可以由DQN发起并作用于环境,而并不取自未标注数据。在例如医疗领域中,此时的动作也可以包括但不限于:开始对话、结束对话、请求症状信息、确诊疾病等。
可选地,在本公开实施例中,还可以在训练所述强化学习模型时,额外引入外部知识以辅助决策,在这种情况下,可以同时考虑强化学习模型的训练结果和外部知识的内容,以做出最终决策,实现进一步改善强化学习模型的训练效果的目的。在一个示例中,这里的外部指示可以为与强化学习模型相关的数据库,例如知识图谱等。例如,在对医疗对话系统的强化学习模型进行训练时,所述知识图谱包括M种疾病和N种症状的节点以及各种疾病与各种症状之间的对应关系,其中M和N为大于等于1的整数,以及针对每种疾病的推荐药物、预防手段、治疗方案、和病因等等。可选地,知识图谱还可以包括每种疾病到每种症状的概率以及每种症状到每种疾病的概率。在本公开实施例中,可选地,所述强化学习模型的训练方法可以用于训练用于对话系统的强化学习模型。其中,根据对话系统的任务类型不同,对话系统可以分为任务导向性系统(Task-oriented Dialogue System)和非任务导向性系统(Non-Task-Oriented Dialogue System)。其中,任务导向型对话系统是指以根据与用户的交流,帮助用户完成特定领域内的任务为目标的一类对话系统。在一个示例中,所述强化学习模型的训练方法可以用于训练用于任务导向性对话系统的强化学习模型,例如,可以用于训练用于医疗对话系统的强化学习模型。当然,上述内容仅为示例,在本公开实施例中,所述训练方法还可以应用于教育、法律咨询、购物餐饮查询、航班查询、导航等各个其他领域相关的对话系统,在此不做限制。
根据本公开实施例的强化学习模型训练方法,能够基于未标注数据和标注数据共同训练强化学习模型,从而有效减少了在训练所述强化学习模型时对标注数据的需求,提高了强化学习模型训练的可行性和稳定性,改善了强化学习模型的训练结果。
本公开实施例提供了一种对话处理方法,如图2所示。图2示出了根据本公开实施例的对话处理方法200的示例性流程图。图2中的对话处理方法可以应用于对话系统,也称聊天信息系统、口语对话系统、交谈代理、聊天者机器人(chatter robot)、聊天者机器人程序(chatterbot)、聊天机器人程序(chatbot)、聊天代理、数字个人助理和自动化在线助理等。该对话系统可以使用自然语言与人交互以模拟智能交谈,并向用户提供个性化的协助。对 话系统可以基于强化学习模型而实现。可选地,图2所示的方法中所基于的强化学习模型可以适用于教育、法律咨询、购物餐饮查询、航班查询、导航等诸多领域。
在步骤S201中,获取对话信息。
在本步骤中,所获取的对话信息可以例如为自然语言文本。可选地,可以基于所述自然语言文本进行理解,并从中通过例如分词、语义分析等各种操作而提取需处理的有效对话信息,以供后续的对话处理流程使用。
在步骤S202中,基于强化学习模型生成回复信息。
本步骤中,可以根据所获取的对话信息,基于例如DQN的强化学习模型来生成需要进行反馈的回复信息。其中,所述强化学习模型可以根据前述的强化学习模型训练方法训练得到。
图3示出了根据本公开实施例的对话处理方法中所使用的强化学习模型的训练方法的示例性流程图。可选地,图3所涉及的强化学习模型可以适用于教育、法律咨询、购物餐饮查询、航班查询、导航等诸多领域。
在步骤S2021中获取用于训练所述强化学习模型的未标注数据和标注数据。
本步骤中,所获取的用于训练强化学习模型的数据包括标注数据。可选地,标注数据可以是从与所需训练的强化学习模型所在的领域相关的数据库中所获取的数据。在一个示例中,可以从标注数据中提取与强化学习模型相关的训练信息,并将提取出的训练信息作为例如用户的目标信息(也称user goal)进行保存。所提取的来自标注数据的目标信息可以用于对强化学习模型的直接训练,以向智能体提供反馈,并引导训练过程。可选地,在标注数据中所提取的目标信息可以包括分别与结果(result)、分类标签(tag)等对应的信息。
进一步地,所获取的用于训练强化学习模型的数据还可以包括未标注数据。可选地,未标注数据可以通过各种途径进行获取,这些途径可以包括与所需训练的强化学习模型所在的领域相关的未经标注的网页、论坛、聊天记录、数据库等。可选地,未标注数据可以为对话数据。在一个示例中,也可以从未标注数据中提取与强化学习模型相关的训练信息,并将提取出的训练信息用于后续生成用于训练强化学习模型的经验池。
可选地,在本公开实施例的方法应用于医疗领域时,所述标注数据可以为从例如电子病历等所获取的医疗病例数据,而提取出的目标信息可以包括疾病、症状分类、症状属性等各种信息。相应地,所述未标注数据可以为例如从互联网获取的医疗对话数据,提取出的未标注数据的训练信息则可以包括对话时间、对话对象、对话内容、诊断结果等各种信息。当然,上述内容仅为示例,在本公开实施例中,所述训练方法还可以应用于教育、法律咨询、购物餐饮查询、航班查询、导航等各个其他领域,在此不做限制。
在步骤S2022中,基于所述未标注数据,参考所述标注数据生成用于训练所述强化学习模型的经验池。
在本步骤中,可选地,可以基于未标注数据所提取的有效的训练信息,以标注数据所 提取的目标信息作为训练目标,来生成所述经验池。可选地,经验池可以包括由第一状态(s)、动作(a)、奖励分数(r)和第二状态(s’)构成的一个或多个序列,并且可以表示为四元组<s,a,r,s’>。在一个示例中,可以基于未标注数据获取动作和当前的第一状态,并通过与环境的交互,获取第二状态和奖励分数。其中,当未标注数据为对话数据时,所述动作可以是基于所述对话数据获取的其中任一对话动作;所述第一状态可以包括所述对话数据中在所获取的所述对话动作之前的所有历史信息,而所述历史信息可以由在该对话动作之前的所有信息和动作共同组成。相应地,所述第二状态可以是在所述环境处于第一状态的情况下,被施加所述动作后,所述环境迁移到的状态;所述奖励分数可以包括在所述环境处于第一状态的情况下,被施加所述动作后,在所述标注数据作为目标信息的引导下所做出的反馈。
可选地,构建所述经验池的四元组中的奖励分数还可以进一步包括所述动作的可信度(c或c’)。也就是说,在已知所需训练的强化学习模型所在的领域时,可以基于所述动作(a)计算其在该领域的关键信息集合中相应的出现概率和特异性,从而得到所述动作的可信度c,并经平滑和归一化处理得到处理后的可信度c’。随后,可以根据四元组<s,a,c’r,s’>来构建所述经验池。
可选地,在本公开实施例的方法应用于医疗领域时,可以基于医疗对话数据获取动作和当前的第一状态,并通过与环境的交互,获取第二状态和奖励分数。其中,所述动作可以是基于所述医疗对话数据获取的其中任一对话动作,例如,所述动作包括但不限于:开始对话、结束对话、请求症状信息、确诊疾病等;所述第一状态可以包括所述医疗对话数据中在所获取的所述对话动作之前的所有历史信息,而所述历史信息可以由在该对话动作之前的所有信息和动作共同组成。相应地,所述第二状态可以是在所述环境处于第一状态的情况下,被施加所述动作后,所述环境迁移到的状态;所述奖励分数可以包括在所述环境处于第一状态的情况下,被施加所述动作后,在所述医疗病例数据作为目标信息的引导下所做出的反馈。可选地,此时奖励分数中包括的所述动作的可信度可以通过前述公式(1)-(3)进行计算。
其中,D={D i}可以为疾病的集合,疾病的集合可以包括例如使用ICD-10编码的若干个(疾病,例如,D i可以表示第i个疾病(i为0或正整数);每个疾病D i可以包括若干个医疗对话数据,例如,
Figure PCTCN2020089394-appb-000005
可以表示针对疾病D i的第j个对话数据(j为0或正整数),则AF可以表示动作a在医疗对话数据
Figure PCTCN2020089394-appb-000006
中所出现的概率,IDF则表示动作a出现在特定疾病下的特异性。从而,可信度AF-IDF可以通过AF和IDF二者的乘积而获取,以反映某个动作a的可信度c。在计算出可信度c之后,可以将其进行平滑和归一化处理,得到处理后的可信度c’,以避免因某些未采集到的疾病影响训练结果。最后,可以根据计算得到的c’形成四元组<s,a,c’r,s’>,以构建所述经验池。
在步骤S2023中,利用所述经验池训练所述强化学习模型。
在本公开实施例中,在根据未标注数据和标注数据形成所述经验池之后,可以利用所 述经验池辅助训练所述强化学习模型。可选地,智能体(例如DQN神经网络)和环境(例如可以为用户模拟器(user simulator))可以进行交互,在交互过程中,以经验池中所包含的四元组(<s,a,r,s’>或<s,a,c’r,s’>)辅助训练,并以标注数据或其提取的目标信息作为训练的目标,以通过不断的模拟和迭代来更新DQN中的参数,从而得到最终的训练结果。可选地,在训练过程中,可以利用训练过程中得到的四元组不断更新所述经验池,也就是说,可以将训练过程中得到的新的四元组加入所述经验池。从而,利用所述经验池训练所述强化学习模型还可以包括:在训练所述强化学习模型的过程中,根据训练结果更新所述经验池;利用更新的所述经验池对所述强化学习模型进行训练。在训练所述强化学习模型的过程中,所形成的四元组中的动作(a)可以由DQN发起,作用于环境。在例如医疗领域中,此时的动作也可以包括但不限于:开始对话、结束对话、请求症状信息、确诊疾病等。
可选地,在本公开实施例中,还可以在训练所述强化学习模型时,额外引入外部知识以辅助决策,在这种情况下,可以同时考虑强化学习模型的训练结果和外部知识的内容,以做出最终决策,实现进一步改善强化学习模型的训练效果的目的。在一个示例中,这里的外部指示可以为与强化学习模型相关的数据库,例如知识图谱等。
回到图2,在步骤S203中,基于所述回复信息对所述对话信息进行响应。
在本步骤中,可以将由完成训练的DQN所生成的回复信息转化为自然语言并输出,以对所述对话信息进行响应。
根据本公开实施例的对话处理方法,能够基于未标注数据和标注数据共同训练强化学习模型,从而有效减少了在训练所述强化学习模型时对标注数据的需求,提高了强化学习模型训练的可行性和稳定性,改善了强化学习模型的训练结果。
第一示例
在本公开实施例的第一示例中,提供了一种用于医疗对话系统的强化学习模型的训练方法,如图4所示。图4示出了根据本公开实施例的用于医疗对话系统的强化学习模型的训练方法400的示例性流程图。
在步骤S401中,获取用于训练所述强化学习模型的医疗病例数据和医疗对话数据。
本步骤中,所获取的用于训练强化学习模型的数据可以包括从电子病例获取的医疗病例数据。在此基础上,可以从所述医疗病例数据中提取作为user goal的目标信息,例如可以包括疾病、症状分类、症状属性等各种信息。
例如,可以将目标信息提取为如图5所示的格式。在图5中,疾病表示为“disease_tag”:缺血性心脏病(Ischaemic heart diseases)。由病人主动报告的症状可以记录在“explicit_symptoms”中,例如,在图5中,所述症状包括“心悸”,频率为“偶发”,以及“出汗”,条件为“运动后”。经医生询问,从后续对话中所获取的症状可以记录在“implicit_symptoms”中,例如,在图5中,所述症状包括“胸闷”,表现为“加剧”,以及“呕吐”,发生在“数周前”,然而并未“发热”。在图5中,目标信息的其余标签可以为未知,表示为“UNK”。
相应地,所述未标注数据可以为例如从互联网获取的医疗对话数据,从中所提取的未标注数据的训练信息可以包括对话时间、对话对象、对话内容、诊断结果等各种信息,可以采用例如JSON文件进行保存。
在步骤S402中,基于所述医疗对话数据,以所述医疗病例数据作为目标,生成用于训练所述强化学习模型的经验池。
图6示出了根据本公开的第一示例所采集的数据以及对DQN的训练流程示意图。如图6所示,左侧为未标记数据,即医疗对话数据,表现为网络对话的格式;右侧为标记数据,即医疗病例数据,作为目标信息user goal作用于后续DQN的训练过程。在图6中,可以首先基于医疗对话数据提取有效的训练信息,并以标注数据所提取的目标信息作为训练目标,通过与环境(用户模拟器)的交互,来生成所述经验池。
具体地,可以将基于所述医疗对话数据获取的其中任一对话动作,作为动作(a),例如,所述动作(a)包括但不限于:开始对话、结束对话、请求症状信息、确诊疾病等;并将所述医疗对话数据中在所述动作(a)之前的所有信息和动作共同组成历史信息以形成第一状态(s)。相应地,所述第二状态可以是在所述用户模拟器处于第一状态(s)的情况下,被施加所述动作(a)后,所述环境迁移到的状态(s’);所述奖励分数可以包括在所述用户模拟器处于第一状态(s)的情况下,被施加所述动作(a)后,在所述医疗病例数据作为目标信息的引导下做出的反馈(r)。此时,可以根据四元组<s,a,r,s’>来构建所述经验池。
具体地,如图6所示,可以根据医疗对话数据形成多个四元组,例如<s 1,a 1,r 1,s 1’>至<s n,a n,r n,s n’>,以用于后续构建经验池。在构建经验池的过程中,可以对之前所形成的多个四元组进行评估和筛选,可选地,可以将例如<s i,a i,r i,s i’>、<s j,a j,r j,s j’>至<s k,a k,r k,s k’>这些四元组用以构建经验池。当然,上述经验池的构建方式仅为示例,可选地,也可以将所有第1-n个四元组均置于所述经验池中。
此外,在另一示例中,还可以对所述动作利用可信度进行进一步的评估。也就是说,所述奖励分数还可以包括所述动作(a)的可信度(c’)。也就是说,可以基于所述动作(a)计算其在医疗领域的疾病的集合D={D i}中相应的出现概率和特异性,从而得到所述动作的可信度c,并经平滑和归一化处理得到处理后的可信度c’。随后,也可以根据四元组<s,a,c’r,s’>来构建所述经验池,具体的示例图6中暂未示出。
在步骤S403中,利用所述经验池训练所述强化学习模型。
在本公开的第一示例中,在根据医疗对话数据和医疗病例数据形成所述经验池之后,可以利用所述经验池辅助训练所述强化学习模型。可选地,图6中的DQN和用户模拟器可以进行交互,在交互过程中,以经验池中所包含的四元组进行辅助训练,并以目标信息user goal作为训练的目标,从而通过不断的模拟和迭代更新DQN中的参数,并得到最终的训练结果。可选地,在训练过程中,可以利用训练过程中得到的四元组不断更新所述经验池,也就是说,可以将训练过程中得到的新的四元组加入所述经验池,并可以利用更新的所述经验池对所述强化学习模型进行进一步的训练。在训练所述强化学习模型的过程中, 所形成的四元组中的动作(a)可以由DQN发起,作用于环境。在例如医疗领域中,此时的动作也可以包括但不限于:开始对话、结束对话、请求症状信息、确诊疾病等。
可选地,在本公开第一示例中,还可以在训练所述强化学习模型时,额外引入知识图谱等外部知识以辅助决策,在这种情况下,可以同时考虑强化学习模型的训练结果和知识图谱的内容,以做出最终决策,实现进一步改善强化学习模型的训练效果的目的。
第二示例
在本公开实施例的第二示例中,提供了一种用于法律咨询领域的对话处理方法700的示例性流程图,如图7所示。
在步骤S701中,获取法律咨询相关的对话信息。
在本步骤中,所获取的法律咨询相关的对话信息可以例如为法律咨询相关的自然语言文本。可选地,可以基于所述自然语言文本进行理解,并从中提取需处理的有效对话信息。
在步骤S702中,基于强化学习模型生成回复信息。
本步骤中,可以根据所获取的法律咨询相关的对话信息,基于DQN的强化学习模型来生成需要进行反馈的回复信息。其中,所述强化学习模型可以根据下述强化学习模型训练方法训练得到:
首先,可以获取用于训练所述强化学习模型的法律条款数据(作为标注数据)和法律咨询对话数据(作为未标注数据)。本步骤中,所获取的用于训练强化学习模型的数据可以包括从电子法律条款中获取的法律条款数据,并且可以进一步从所述法律条款数据中提取作为user goal的目标信息,例如可以包括法条名称、行为类型、行为表现等各种信息。相应地,所述法律咨询对话数据可以为例如从互联网获取的法律咨询对话数据,从中所提取的法律咨询对话数据的训练信息可以包括对话时间、对话对象、对话内容、法条适用结果等各种信息,可以采用例如json文件的形式进行保存。
随后,基于所述法律咨询对话数据,可以以所述法律条款数据作为目标,生成用于训练所述强化学习模型的经验池。例如,同样可以首先基于法律咨询对话数据提取有效的训练信息,并以法律条款数据所提取的目标信息作为训练目标,通过与环境(用户模拟器)的交互,来生成所述经验池。经验池可以包括一个或多个四元组<s,a,r,s’>或包括置信度c’的<s,a,c’r,s’>,在此不再赘述。
最后,可以利用所述经验池训练所述强化学习模型。例如,可以利用DQN和用户模拟器可以进行交互,在交互过程中,以经验池中所包含的四元组辅助训练,并以目标信息user goal作为训练的目标,以通过不断的模拟和迭代更新DQN中的参数,从而得到最终的训练结果。
在步骤S703中,基于所述回复信息对所述对话信息进行响应。
在本步骤中,可以将所生成的回复信息转化为自然语言并输出,以对所述对话信息进行响应。
下面,参照图8来描述根据本公开实施例的强化学习模型的训练装置。图8示出了根 据本公开实施例的强化学习模型的训练装置800的框图。如图8所示,强化学习模型的训练装置800包括获取单元810、生成单元820和训练单元830。除了这些单元以外,强化学习模型的训练装置800还可以包括其他部件,然而,由于这些部件与本公开实施例的内容无关,因此在这里省略其图示和描述。此外,由于根据本公开实施例的强化学习模型的训练装置800执行的下述操作的具体细节与在上文中参照图1描述的细节相同,因此在这里为了避免重复而省略对相同细节的重复描述。
图8中的强化学习模型的训练装置800的获取单元810获取用于训练所述强化学习模型的未标注数据和标注数据。
获取单元810所获取的用于训练强化学习模型的数据包括标注数据。可选地,标注数据可以是从与所需训练的强化学习模型所在的领域相关的数据库中所获取的数据。在一个示例中,可以从标注数据中提取与强化学习模型相关的训练信息,并将提取出的训练信息作为例如用户的目标信息(也称user goal)进行保存。所提取的来自标注数据的目标信息可以用于对强化学习模型的直接训练,以向智能体提供反馈,并引导训练过程。可选地,在标注数据中所提取的目标信息可以包括分别与结果(result)、分类标签(tag)等对应的信息。
进一步地,获取单元810所获取的用于训练强化学习模型的数据还可以包括未标注数据。可选地,未标注数据可以通过各种途径进行获取,这些途径可以包括与所需训练的强化学习模型所在的领域相关的未经标注的网页、论坛、聊天记录、数据库等。可选地,未标注数据可以为对话数据。在一个示例中,也可以从未标注数据中提取与强化学习模型相关的训练信息,并将提取出的训练信息用于后续生成用于训练强化学习模型的经验池。
可选地,在本公开实施例的装置应用于医疗领域时,所述标注数据可以为从例如电子病历等所获取的医疗病例数据,而提取出的目标信息可以包括疾病、症状分类、症状属性等各种信息。相应地,所述未标注数据可以为例如从互联网获取的医疗对话数据,提取出的未标注数据的训练信息则可以包括对话时间、对话对象、对话内容、诊断结果等各种信息。当然,上述内容仅为示例,在本公开实施例中,所述训练装置还可以应用于教育、法律咨询、购物餐饮查询、航班查询、导航等各个其他领域,在此不做限制。
生成单元820基于所述未标注数据,参考所述标注数据生成用于训练所述强化学习模型的经验池。
可选地,生成单元820可以基于未标注数据所提取的有效的训练信息,以标注数据所提取的目标信息作为训练目标,来生成所述经验池。可选地,经验池可以包括由第一状态(s)、动作(a)、奖励分数(r)和第二状态(s’)构成的一个或多个序列,并且可以表示为四元组<s,a,r,s’>。在一个示例中,可以基于未标注数据获取动作和当前的第一状态,并通过与环境的交互,获取第二状态和奖励分数。其中,当未标注数据为对话数据时,所述动作可以是基于所述对话数据获取的其中任一对话动作;所述第一状态可以包括所述对话数据中在所获取的所述对话动作之前的所有历史信息,而所述历史信息可以由在该对话动 作之前的所有信息和动作共同组成。相应地,所述第二状态可以是在所述环境处于第一状态的情况下,被施加所述动作后,所述环境迁移到的状态;所述奖励分数可以包括在所述环境处于第一状态的情况下,被施加所述动作后,在所述标注数据作为目标信息的引导下所做出的反馈。
可选地,构建所述经验池的四元组中的奖励分数还可以进一步包括所述动作的可信度(c或c’)。也就是说,在已知所需训练的强化学习模型所在的领域时,可以基于所述动作(a)计算其在该领域的关键信息集合中相应的出现概率和特异性,从而得到所述动作的可信度c,并经平滑和归一化处理得到处理后的可信度c’。随后,可以根据四元组<s,a,c’r,s’>来构建所述经验池。
可选地,在本公开实施例的装置应用于医疗领域时,生成单元820可以基于医疗对话数据获取动作和当前的第一状态,并通过与环境的交互,获取第二状态和奖励分数。其中,所述动作可以是基于所述医疗对话数据获取的其中任一对话动作,例如,所述动作包括但不限于:开始对话、结束对话、请求症状信息、确诊疾病等;所述第一状态可以包括所述医疗对话数据中在所获取的所述对话动作之前的所有历史信息,而所述历史信息可以由在该对话动作之前的所有信息和动作共同组成。相应地,所述第二状态可以是在所述环境处于第一状态的情况下,被施加所述动作后,所述环境迁移到的状态;所述奖励分数可以包括在所述环境处于第一状态的情况下,被施加所述动作后,在所述医疗病例数据作为目标信息的引导下所做出的反馈。可选地,此时奖励分数中包括的所述动作的可信度可以通过前述公式(1)-(3)进行计算。
其中,D={D i}可以为疾病的集合,疾病的集合可以包括例如使用ICD-10编码的若干个疾病,例如,D i可以表示第i个疾病(i为0或正整数);每个疾病D i可以包括若干个医疗对话数据,例如,
Figure PCTCN2020089394-appb-000007
可以表示针对疾病D i的第j个对话数据(j为0或正整数),则AF可以表示动作a在医疗对话数据
Figure PCTCN2020089394-appb-000008
中所出现的概率,IDF则表示动作a出现在特定疾病下的特异性。从而,可信度AF-IDF可以通过AF和IDF二者的乘积而获取,以反映某个动作a的可信度c。在计算出可信度c之后,可以将其进行平滑和归一化处理,得到处理后的可信度c’,以避免因某些未采集到的疾病影响训练结果。最后,可以根据计算得到的c’形成四元组<s,a,c’r,s’>,以构建所述经验池。
训练单元830利用所述经验池训练所述强化学习模型。
在本公开实施例中,在根据未标注数据和标注数据形成所述经验池之后,训练单元830可以利用所述经验池辅助训练所述强化学习模型。可选地,智能体(例如DQN神经网络)和环境(例如可以为用户模拟器(user simulator))可以进行交互,在交互过程中,以经验池中所包含的四元组(<s,a,r,s’>或<s,a,c’r,s’>)辅助训练,并以标注数据或其提取的目标信息作为训练的目标,以通过不断的模拟和迭代来更新DQN中的参数,从而得到最终的训练结果。可选地,在训练过程中,可以利用训练过程中得到的四元组不断更新所述经验池,也就是说,训练单元830可以将训练过程中得到的新的四元组加入所述经验池,并 利用更新的所述经验池对所述强化学习模型进行训练。在训练所述强化学习模型的过程中,所形成的四元组中的动作(a)可以由DQN发起并作用于环境,而并不取自未标注数据。在例如医疗领域中,此时的动作也可以包括但不限于:开始对话、结束对话、请求症状信息、确诊疾病等。
可选地,在本公开实施例中,训练单元830还可以在训练所述强化学习模型时,额外引入外部知识以辅助决策,在这种情况下,可以同时考虑强化学习模型的训练结果和外部知识的内容,以做出最终决策,实现进一步改善强化学习模型的训练效果的目的。在一个示例中,这里的外部指示可以为与强化学习模型相关的数据库,例如知识图谱等。
在本公开实施例中,可选地,所述强化学习模型的训练装置可以用于训练用于对话系统的强化学习模型。其中,根据对话系统的任务类型不同,对话系统可以分为任务导向性系统(Task-oriented Dialogue System)和非任务导向性系统(Non-Task-Oriented Dialogue System)。其中,任务导向型对话系统是指以根据与用户的交流,帮助用户完成特定领域内的任务为目标的一类对话系统。在一个示例中,所述强化学习模型的训练装置可以用于训练用于任务导向性对话系统的强化学习模型,例如,可以用于训练用于医疗对话系统的强化学习模型。当然,上述内容仅为示例,在本公开实施例中,所述训练装置还可以应用于教育、法律咨询、购物餐饮查询、航班查询、导航等各个其他领域相关的对话系统,在此不做限制。
根据本公开实施例的强化学习模型训练装置,能够基于未标注数据和标注数据共同训练强化学习模型,从而有效减少了在训练所述强化学习模型时对标注数据的需求,提高了强化学习模型训练的可行性和稳定性,改善了强化学习模型的训练结果。
下面,参照图9来描述根据本公开实施例的强化学习模型的训练装置900。图9示出了根据本公开实施例的强化学习模型的训练装置900的框图。如图9所示,该装置900可以是计算机或服务器。
如图9所示,强化学习模型的训练装置900包括一个或多个处理器910以及存储器920,当然,除此之外,强化学习模型的训练装置900还可能包括输入装置、输出装置(未示出)等,这些组件可以通过总线系统和/或其它形式的连接机构互连。应当注意,图9所示的强化学习模型的训练装置900的组件和结构只是示例性的,而非限制性的,根据需要,强化学习模型的训练装置900也可以具有其他组件和结构。
处理器910可以是中央处理单元(CPU)或者现场可编程逻辑阵列(FPGA)或者单片机(MCU)或者数字信号处理器(DSP)或者专用集成电路(ASIC)等具有数据处理能力和/或程序执行能力的逻辑运算器件。
处理器910可以利用存储器920中所存储的计算机程序指令以执行期望的功能,可以包括:获取用于训练所述强化学习模型的未标注数据和标注数据;基于所述未标注数据,参考所述标注数据生成用于训练所述强化学习模型的经验池;利用所述经验池训练所述强化学习模型。
所述的计算机程序指令包括了一个或多个由对应于处理器的指令集架构定义的处理器操作,这些计算机指令可以被一个或多个计算机程序在逻辑上包含和表示。
存储器920可以包括一个或多个计算机程序产品,所述计算机程序产品可以包括各种形式的计算机可读存储介质,例如易失性存储器和/或非易失性存储器,如静态随机存取存储器(SRAM),电可擦除可编程只读存储器(EEPROM),可擦除可编程只读存储器(EPROM),可编程只读存储器(PROM),只读存储器(ROM),磁存储器,快闪存储器,磁盘或光盘。在所述计算机可读存储介质上可以存储一个或多个计算机程序指令,处理器910可以运行所述程序指令,以实现上文所述的本公开实施例的强化学习模型的训练装置的功能以及/或者其它期望的功能,并且/或者可以执行根据本公开实施例的强化学习模型的训练方法。在所述计算机可读存储介质中还可以存储各种应用程序和各种数据。
下面,描述根据本公开实施例的计算机可读存储介质,其上存储有计算机程序指令,其中,所述计算机程序指令被处理器执行时实现以下步骤:获取用于训练所述强化学习模型的未标注数据和标注数据;基于所述未标注数据,参考所述标注数据生成用于训练所述强化学习模型的经验池;利用所述经验池训练所述强化学习模型。
下面,参照图10来描述根据本公开实施例的对话系统。图10示出了根据本公开实施例的对话系统1000的框图。如图10所示,对话系统1000包括获取单元1010、生成单元1020和响应单元1030。除了这些单元以外,对话系统1000还可以包括其他部件,然而,由于这些部件与本公开实施例的内容无关,因此在这里省略其图示和描述。此外,由于根据本公开实施例的对话系统1000执行的下述操作的具体细节与在上文中参照图2-图3描述的细节相同,因此在这里为了避免重复而省略对相同细节的重复描述。此外,图10所述的对话系统1000也可称聊天信息系统、口语对话系统、交谈代理、聊天者机器人(chatter robot)、聊天者机器人程序(chatterbot)、聊天机器人程序(chatbot)、聊天代理、数字个人助理和自动化在线助理等。对话系统1000可以使用自然语言与人交互以模拟智能交谈,并向用户提供个性化的协助。对话系统可以基于强化学习模型而实现。可选地,图10所示的系统中所基于的强化学习模型可以适用于教育、法律咨询、购物餐饮查询、航班查询、导航等诸多领域。
图10中的对话系统1000的获取单元1010获取对话信息。
获取单元1010所获取的对话信息可以例如为自然语言文本。可选地,可以基于所述自然语言文本进行理解,并从中通过例如分词、语义分析等各种操作而提取需处理的有效对话信息,以供后续的对话处理流程使用。
生成单元1020基于强化学习模型生成回复信息。
生成单元1020可以根据所获取的对话信息,基于例如DQN的强化学习模型来生成需要进行反馈的回复信息。其中,所述强化学习模型可以根据前述的强化学习模型训练方法或训练装置训练得到。
利用前述强化学习模型训练方法训练所述强化学习模型的流程如图3所示。图3示出了 根据本公开实施例的对话系统中所使用的强化学习模型的训练方法的示例性流程图。可选地,图3所涉及的强化学习模型可以适用于教育、法律咨询、购物餐饮查询、航班查询、导航等诸多领域。
在步骤S2021中,获取用于训练所述强化学习模型的未标注数据和标注数据。
本步骤中,所获取的用于训练强化学习模型的数据包括标注数据。可选地,标注数据可以是从与所需训练的强化学习模型所在的领域相关的数据库中所获取的数据。在一个示例中,可以从标注数据中提取与强化学习模型相关的训练信息,并将提取出的训练信息作为例如用户的目标信息(也称user goal)进行保存。所提取的来自标注数据的目标信息可以用于对强化学习模型的直接训练,以向智能体提供反馈,并引导训练过程。可选地,在标注数据中所提取的目标信息可以包括分别与结果(result)、分类标签(tag)等对应的信息。
进一步地,所获取的用于训练强化学习模型的数据还可以包括未标注数据。可选地,未标注数据可以通过各种途径进行获取,这些途径可以包括与所需训练的强化学习模型所在的领域相关的未经标注的网页、论坛、聊天记录、数据库等。可选地,未标注数据可以为对话数据。在一个示例中,也可以从未标注数据中提取与强化学习模型相关的训练信息,并将提取出的训练信息用于后续生成用于训练强化学习模型的经验池。
可选地,在本公开实施例的方法应用于医疗领域时,所述标注数据可以为从例如电子病历等所获取的医疗病例数据,而提取出的目标信息可以包括疾病、症状分类、症状属性等各种信息。相应地,所述未标注数据可以为例如从互联网获取的医疗对话数据,提取出的未标注数据的训练信息则可以包括对话时间、对话对象、对话内容、诊断结果等各种信息。当然,上述内容仅为示例,在本公开实施例中,所述训练方法还可以应用于教育、法律咨询、购物餐饮查询、航班查询、导航等各个其他领域,在此不做限制。
在步骤S2022中,基于所述未标注数据,参考所述标注数据生成用于训练所述强化学习模型的经验池。
在本步骤中,可选地,可以基于未标注数据所提取的有效的训练信息,以标注数据所提取的目标信息作为训练目标,来生成所述经验池。可选地,经验池可以包括由第一状态(s)、动作(a)、奖励分数(r)和第二状态(s’)构成的一个或多个序列,并且可以表示为四元组<s,a,r,s’>。在一个示例中,可以基于未标注数据获取动作和当前的第一状态,并通过与环境的交互,获取第二状态和奖励分数。其中,当未标注数据为对话数据时,所述动作可以是基于所述对话数据获取的其中任一对话动作;所述第一状态可以包括所述对话数据中在所获取的所述对话动作之前的所有历史信息,而所述历史信息可以由在该对话动作之前的所有信息和动作共同组成。相应地,所述第二状态可以是在所述环境处于第一状态的情况下,被施加所述动作后,所述环境迁移到的状态;所述奖励分数可以包括在所述环境处于第一状态的情况下,被施加所述动作后,在所述标注数据作为目标信息的引导下所做出的反馈。
可选地,构建所述经验池的四元组中的奖励分数还可以进一步包括所述动作的可信度(c或c’)。也就是说,在已知所需训练的强化学习模型所在的领域时,可以基于所述动作(a)计算其在该领域的关键信息集合中相应的出现概率和特异性,从而得到所述动作的可信度c,并经平滑和归一化处理得到处理后的可信度c’。随后,可以根据四元组<s,a,c’r,s’>来构建所述经验池。
可选地,在本公开实施例的方法应用于医疗领域时,可以基于医疗对话数据获取动作和当前的第一状态,并通过与环境的交互,获取第二状态和奖励分数。其中,所述动作可以是基于所述医疗对话数据获取的其中任一对话动作,例如,所述动作包括但不限于:开始对话、结束对话、请求症状信息、确诊疾病等;所述第一状态可以包括所述医疗对话数据中在所获取的所述对话动作之前的所有历史信息,而所述历史信息可以由在该对话动作之前的所有信息和动作共同组成。相应地,所述第二状态可以是在所述环境处于第一状态的情况下,被施加所述动作后,所述环境迁移到的状态;所述奖励分数可以包括在所述环境处于第一状态的情况下,被施加所述动作后,在所述医疗病例数据作为目标信息的引导下所做出的反馈。可选地,此时奖励分数中包括的所述动作的可信度可以通过前述公式(1)-(3)进行计算。
其中,D={D i}可以为疾病的集合,疾病的集合可以包括例如使用ICD-10编码的若干个疾病,例如,D i可以表示第i个疾病(i为0或正整数);每个疾病D i可以包括若干个医疗对话数据,例如,
Figure PCTCN2020089394-appb-000009
可以表示针对疾病D i的第j个对话数据(j为0或正整数),则AF可以表示动作a在医疗对话数据
Figure PCTCN2020089394-appb-000010
中所出现的概率,IDF则表示动作a出现在特定疾病下的特异性。从而,可信度AF-IDF可以通过AF和IDF二者的乘积而获取,以反映某个动作a的可信度c。在计算出可信度c之后,可以将其进行平滑和归一化处理,得到处理后的可信度c’,以避免因某些未采集到的疾病影响训练结果。最后,可以根据计算得到的c’形成四元组<s,a,c’r,s’>,以构建所述经验池。
在步骤S2023中,利用所述经验池训练所述强化学习模型。
在本公开实施例中,在根据未标注数据和标注数据形成所述经验池之后,可以利用所述经验池辅助训练所述强化学习模型。可选地,智能体(例如DQN神经网络)和环境(例如可以为用户模拟器(user simulator))可以进行交互,在交互过程中,以经验池中所包含的四元组(<s,a,r,s’>或<s,a,c’r,s’>)辅助训练,并以标注数据或其提取的目标信息作为训练的目标,以通过不断的模拟和迭代来更新DQN中的参数,从而得到最终的训练结果。可选地,在训练过程中,可以利用训练过程中得到的四元组不断更新所述经验池,也就是说,可以将训练过程中得到的新的四元组加入所述经验池。从而,利用所述经验池训练所述强化学习模型还可以包括:在训练所述强化学习模型的过程中,根据训练结果更新所述经验池;利用更新的所述经验池对所述强化学习模型进行训练。在训练所述强化学习模型的过程中,所形成的四元组中的动作(a)可以由DQN发起,作用于环境。在例如医疗领域中,此时的动作也可以包括但不限于:开始对话、结束对话、请求症状信息、确诊疾病 等。
可选地,在本公开实施例中,还可以在训练所述强化学习模型时,额外引入外部知识以辅助决策,在这种情况下,可以同时考虑强化学习模型的训练结果和外部知识的内容,以做出最终决策,实现进一步改善强化学习模型的训练效果的目的。在一个示例中,这里的外部指示可以为与强化学习模型相关的数据库,例如知识图谱等。
回到图10,响应单元1030基于所述回复信息对所述对话信息进行响应。
响应单元1030可以将所生成的回复信息转化为自然语言并输出,以对所述对话信息进行响应。
根据本公开实施例的对话系统,能够基于未标注数据和标注数据共同训练强化学习模型,从而有效减少了在训练所述强化学习模型时对标注数据的需求,提高了强化学习模型训练的可行性和稳定性,改善了强化学习模型的训练结果。
下面,参照图11来描述根据本公开实施例的对话系统1100。图11示出了根据本公开实施例的对话系统1100的框图。如图11所示,该装置1100可以是计算机或服务器。
如图11所示,对话系统1100包括一个或多个处理器1110以及存储器1120,当然,除此之外,对话系统1100还可能包括输入装置、输出装置(未示出)等,这些组件可以通过总线系统和/或其它形式的连接机构互连。应当注意,图11所示的对话系统1100的组件和结构只是示例性的,而非限制性的,根据需要,对话系统1100也可以具有其他组件和结构。
处理器1110可以是中央处理单元(CPU)或者现场可编程逻辑阵列(FPGA)或者单片机(MCU)或者数字信号处理器(DSP)或者专用集成电路(ASIC)等具有数据处理能力和/或程序执行能力的逻辑运算器件,并且可以利用存储器1120中所存储的计算机程序指令以执行期望的功能,可以包括:获取对话信息;基于强化学习模型生成回复信息;基于所述回复信息对所述对话信息进行响应;其中,所述强化学习模型是通过如下方法训练得到的:获取用于训练所述强化学习模型的未标注数据和标注数据;基于所述未标注数据,参考所述标注数据生成用于训练所述强化学习模型的经验池;利用所述经验池训练所述强化学习模型。
存储器1120可以包括一个或多个计算机程序产品,所述计算机程序产品可以包括各种形式的计算机可读存储介质,例如易失性存储器和/或非易失性存储器,如静态随机存取存储器(SRAM),电可擦除可编程只读存储器(EEPROM),可擦除可编程只读存储器(EPROM),可编程只读存储器(PROM),只读存储器(ROM),磁存储器,快闪存储器,磁盘或光盘。在所述计算机可读存储介质上可以存储一个或多个计算机程序指令,处理器1110可以运行所述程序指令,以实现上文所述的本公开实施例的强化学习模型的训练装置的功能以及/或者其它期望的功能,并且/或者可以执行根据本公开实施例的对话处理方法。在所述计算机可读存储介质中还可以存储各种应用程序和各种数据。
下面,描述根据本公开实施例的计算机可读存储介质,其上存储有计算机程序指令,其中,所述计算机程序指令被处理器执行时实现以下步骤:获取对话信息;基于强化学习 模型生成回复信息;基于所述回复信息对所述对话信息进行响应;其中,所述强化学习模型是通过如下方法训练得到的:获取用于训练所述强化学习模型的未标注数据和标注数据;基于所述未标注数据,参考所述标注数据生成用于训练所述强化学习模型的经验池;利用所述经验池训练所述强化学习模型。
如前面所述,本公开实施例的对话系统特别适用于医疗领域。与其他领域相比,医疗领域的标注数据特别少,因为对数据进行标注的要求比较高,即,需要专业性较高且经验较丰富的医生进行标注,以提高专业性和准确性。通过采用本申请的强化学习模型的训练方法,能够基于未标注数据和标注数据共同训练强化学习模型,从而降低了对医生的专业以及经验性的依赖和要求,有效减少了在训练所述强化学习模型时对标注数据的需求。
图12示出了根据本公开实施例的医疗对话系统的用户界面1200的示意图。
可选地,该医疗对话系统中涉及的模型可以采用前文所描述的强化学习模型的训练方法来进行训练而得到。该模型可以以计算机程序指令集的形式存储。该医疗对话系统可以包括:用户界面;处理器;存储器,其上存储有计算机程序指令,在所述计算机程序指令被所述处理器运行时,使得所述处理器执行以下步骤。
如图12所示,首先医疗对话系统从用户接收自然语言输入信息,并在用户界面1200上(例如靠右侧)显示该自然语言输入信息。自然语言输入信息可以通过语音或文字输入。例如,如框1201所示,用户用文字输入“我有点头晕,吃饭的时候恶心”的自然语言输入信息。
可选地,医疗对话系统对所述自然语言输入信息执行命名实体识别处理以提取症状信息。例如,医疗对话系统从自然语言输入信息“我有点头晕,吃饭的时候恶心”中提取到“头晕”和/或“恶心”的症状信息,以下以“头晕”为例。
然后,医疗对话系统在所述用户界面1200上(例如靠左侧)显示与所述自然语言输入信息中提及的症状相关联的一个或多个问题,以实现多轮问答,并且针对每个问题:从用户接收针对该问题的答案,并在所述用户界面上靠右侧显示所述答案。
具体地,在提取了“头晕”的症状信息后,医疗对话系统在用户界面上(在框1201下方靠左侧)显示询问何时出现头晕的问题(框1202)。该问题可以与用户可以在其中进行选择的多个答案选项一起给出。用户针对该问题给出答案并且该答案将在用户界面上靠右侧(在框1202下方)显示,例如用户在回答问题(1202)时选择了“近几天”的选项,则在框1203显示文本“近几天”。然后,进行下一轮问答,医疗对话系统在用户界面上(在框1203下方靠左侧)显示询问关于头晕的频率的问题(框1204)。类似地,该问题可以与用户可以在其中进行选择的多个答案选项一起给出。用户针对该问题给出答案并且该答案将在用户界面上(在框1204下方靠右侧)显示,例如用户在回答问题(1204)时选择了“每周发作三次及以上”的选项,则在框1205显示文本“每周发作三次及以上”。以此类推地完成多轮问答。虽然图中只示出了两轮问答,但是根据该医疗对话系统的强化学习模型的训练方式,可以有更多轮的问题,本公开对此不作限制。
最后,医疗对话系统在多轮问答结束之后,生成并在用户界面上显示针对所述症状的诊断结果,例如框1206所示。
可选地,诊断结果包括以下各项中的至少一个:可能的疾病类型、可能的疾病类型具有的症状、适用于可能的疾病类型的推荐药物、推荐药物所针对的症状、了解推荐药物更多信息的链接等等。
可选地,诊断结果还可以包括该症状可能对应的各种疾病类型的概率。
可选地,所述诊断结果以自然语言的形式输出并显示在所述用户界面上,例如框1206所示。
在上述实施例中,为了便于理解和描述,使用了与所要执行的功能对应的功能单元的描述方式,容易理解,这些功能单元是功能实体,不一定必须与物理或逻辑上独立的实体相对应。可以通过通用处理器运行对应功能的软件以执行计算机指令的形式来实现这些功能实体,或在一个或多个硬件模块或集成电路中可编程地实现这些功能实体,或设计为专门执行对应功能的集成电路来实现这些功能实体。
例如,通用处理器可以是中央处理器(CPU)、单片机(MCU)、数字信号处理器(DSP)等。
例如,可编程的集成电路可以是现场可编程逻辑电路(FPGA)。
例如,专门的集成电路可以是专用集成电路(ASIC),如Tensor Processing Unit(TPU)。
技术中的程序部分可以被认为是以可执行的代码和/或相关数据的形式而存在的“产品”或“制品”,通过计算机可读的介质所参与或实现的。有形的、永久的储存介质可以包括任何计算机、处理器、或类似设备或相关的模块所用到的内存或存储器。例如,各种半导体存储器、磁带驱动器、磁盘驱动器或者类似任何能够为软件提供存储功能的设备。
所有软件或其中的一部分有时可能会通过网络进行通信,如互联网或其他通信网络。此类通信可以将软件从一个计算机设备或处理器加载到另一个。例如:从图像检索设备的一个服务器或主机计算机加载至一个计算机环境的硬件平台,或其他实现系统的计算机环境,或与提供图像检索所需要的信息相关的类似功能的系统。因此,另一种能够传递软件元素的介质也可以被用作局部设备之间的物理连接,例如光波、电波、电磁波等,通过电缆、光缆或者空气等实现传播。用来载波的物理介质如电缆、无线连接或光缆等类似设备,也可以被认为是承载软件的介质。在这里的用法除非限制了有形的“储存”介质,其他表示计算机或机器“可读介质”的术语都表示在处理器执行任何指令的过程中参与的介质。
本申请使用了特定词语来描述本申请的实施例。如“第一/第二实施例”、“一实施例”、和/或“一些实施例”意指与本申请至少一个实施例相关的某一特征、结构或特点。因此,应强调并注意的是,本说明书中在不同位置两次或多次提及的“一实施例”或“一个实施例”或“一替代性实施例”并不一定是指同一实施例。此外,本申请的一个或多个实施例中的某些特征、结构或特点可以进行适当的组合。
此外,本领域技术人员可以理解,本申请的各方面可以通过若干具有可专利性的种类 或情况进行说明和描述,包括任何新的和有用的工序、机器、产品或物质的组合,或对他们的任何新的和有用的改进。相应地,本申请的各个方面可以完全由硬件执行、可以完全由软件(包括固件、常驻软件、微码等)执行、也可以由硬件和软件组合执行。以上硬件或软件均可被称为“数据块”、“模块”、“引擎”、“单元”、“组件”或“系统”。此外,本申请的各方面可能表现为位于一个或多个计算机可读介质中的计算机产品,该产品包括计算机可读程序编码。
除非另有定义,这里使用的所有术语(包括技术和科学术语)具有与本公开所属领域的普通技术人员共同理解的相同含义。还应当理解,诸如在通常字典里定义的那些术语应当被解释为具有与它们在相关技术的上下文中的含义相一致的含义,而不应用理想化或极度形式化的意义来解释,除非这里明确地这样定义。
上面是对本公开的说明,而不应被认为是对其的限制。尽管描述了本公开的若干示例性实施例,但本领域技术人员将容易地理解,在不背离本公开的新颖教学和优点的前提下可以对示例性实施例进行许多修改。因此,所有这些修改都意图包含在权利要求书所限定的本公开范围内。应当理解,上面是对本公开的说明,而不应被认为是限于所公开的特定实施例,并且对所公开的实施例以及其他实施例的修改意图包含在所附权利要求书的范围内。本公开由权利要求书及其等效物限定。

Claims (30)

  1. 一种强化学习模型的训练方法(100),包括:
    获取用于训练所述强化学习模型的未标注数据和标注数据(S101);
    基于所述未标注数据,参考所述标注数据生成用于训练所述强化学习模型的经验池(S102);
    利用所述经验池训练所述强化学习模型(S103)。
  2. 如权利要求1所述的方法(100),其中,所述基于所述未标注数据,参考所述标注数据生成用于训练所述强化学习模型的经验池(S102)包括:
    基于所述未标注数据,通过与环境进行交互生成所述经验池。
  3. 如权利要求2所述的方法(100),其中,
    所述经验池包括由第一状态、动作、奖励分数和第二状态构成的序列;
    其中,所述第一状态和动作是基于所述未标注数据获取的;所述第二状态是在所述环境处于第一状态的情况下,被施加所述动作后,所述环境迁移到的状态。
  4. 如权利要求3所述的方法(100),其中,
    所述奖励分数包括在所述环境处于第一状态的情况下,被施加所述动作后,在所述标注数据的引导下做出的反馈。
  5. 如权利要求3所述的方法(100),其中,
    所述奖励分数还包括所述动作的可信度。
  6. 如权利要求1所述的方法(100),其中,所述利用所述经验池训练所述强化学习模型(S103)还包括:
    在训练所述强化学习模型的过程中,根据训练结果更新所述经验池;
    利用更新的所述经验池对所述强化学习模型进行训练。
  7. 如权利要求3所述的方法(100),其中,
    所述未标注数据为医疗对话数据;和/或
    所述标注数据为医疗病例数据。
  8. 如权利要求7所述的方法(100),其中,
    所述动作是基于所述医疗对话数据获取的任一对话动作;
    所述第一状态是所述医疗对话数据中在所获取的所述对话动作之前的所有历史信息。
  9. 如权利要求1-8任一项所述的方法(100),其中,所述训练方法用于训练用于医疗对话系统的强化学习模型。
  10. 一种对话处理方法(200),包括:
    获取对话信息(S201);
    基于强化学习模型生成回复信息(S202);
    基于所述回复信息对所述对话信息进行响应(S203);
    其中,所述强化学习模型是通过如下方法训练得到的:
    获取用于训练所述强化学习模型的未标注数据和标注数据(S2021);
    基于所述未标注数据,参考所述标注数据生成用于训练所述强化学习模型的经验池(S2022);
    利用所述经验池训练所述强化学习模型(S2023)。
  11. 如权利要求10所述的方法(200),其中,所述基于所述未标注数据,参考所述标注数据生成用于训练所述强化学习模型的经验池(S2022)包括:
    基于所述未标注数据,通过与环境进行交互生成所述经验池。
  12. 如权利要求11所述的方法(200),其中,
    所述经验池包括由第一状态、动作、奖励分数和第二状态构成的序列;
    其中,所述第一状态和动作是基于所述未标注数据获取的;所述第二状态是在所述环境处于第一状态的情况下,被施加所述动作后,所述环境迁移到的状态。
  13. 如权利要求12所述的方法(200),其中,
    所述奖励分数包括在所述环境处于第一状态的情况下,被施加所述动作后,在所述标注数据的引导下做出的反馈。
  14. 如权利要求12所述的方法(200),其中,
    所述奖励分数还包括所述动作的可信度。
  15. 如权利要求12所述的方法(200),其中,
    所述未标注数据为医疗对话数据;和/或
    所述标注数据为医疗病例数据。
  16. 如权利要求15所述的方法(200),其中,
    所述动作是基于所述医疗对话数据获取的任一对话动作;
    所述第一状态是所述医疗对话数据中在所述对话动作之前的所有历史信息。
  17. 如权利要求10-16中任一项所述的方法(200),其中,所述基于所述回复信息对所述对话信息进行响应(S203)包括:
    将所述回复信息转化为自然语言并输出。
  18. 一种强化学习模型的训练装置(800),包括:
    获取单元(810),配置为获取用于训练所述强化学习模型的未标注数据和标注数据;
    生成单元(820),配置为基于所述未标注数据,参考所述标注数据生成用于训练所述强化学习模型的经验池;
    训练单元(830),配置为利用所述经验池训练所述强化学习模型。
  19. 一种强化学习模型的训练装置(900),包括:
    处理器(910);
    存储器(920);和
    存储在所述存储器中的计算机程序指令,在所述计算机程序指令被所述处理器运行时,使得所述处理器执行以下步骤:
    获取用于训练所述强化学习模型的未标注数据和标注数据;
    基于所述未标注数据,参考所述标注数据生成用于训练所述强化学习模型的经验池;
    利用所述经验池训练所述强化学习模型。
  20. 一种计算机可读存储介质,其上存储有计算机可读的指令,当利用计算机执行所述指令时,执行权利要求1-9中任一项所述的方法。
  21. 一种对话系统(1000),包括:
    获取单元(1010),配置为获取对话信息;
    生成单元(1020),配置为基于强化学习模型生成回复信息;
    响应单元(1030),配置为基于所述回复信息对所述对话信息进行响应;
    其中,所述强化学习模型是通过如下方法训练得到的:
    获取用于训练所述强化学习模型的未标注数据和标注数据;
    基于所述未标注数据,参考所述标注数据生成用于训练所述强化学习模型的经验池;
    利用所述经验池训练所述强化学习模型。
  22. 一种对话系统(1100),包括:
    处理器(1110);
    存储器(1120);和
    存储在所述存储器中的计算机程序指令,在所述计算机程序指令被所述处理器运行时,使得所述处理器执行以下步骤:
    获取对话信息;
    基于强化学习模型生成回复信息;
    基于所述回复信息对所述对话信息进行响应;
    其中,所述强化学习模型是通过如下方法训练得到的:
    获取用于训练所述强化学习模型的未标注数据和标注数据;
    基于所述未标注数据,参考所述标注数据生成用于训练所述强化学习模型的经验池;
    利用所述经验池训练所述强化学习模型。
  23. 一种计算机可读存储介质,其上存储有计算机可读的指令,当利用计算机执行所述指令时,执行权利要求10-17中任一项所述的方法。
  24. 一种医疗对话系统,包括:
    用户界面;
    处理器;
    存储器,其上存储有计算机程序指令,在所述计算机程序指令被所述处理器运行时, 使得所述处理器执行以下步骤:
    从用户接收自然语言输入信息,并在所述用户界面上显示所述自然语言输入信息;
    在所述用户界面上显示与所述自然语言输入信息中提及的症状相关联的一个或多个问题;
    针对每个问题:从用户接收针对该问题的答案,并在所述用户界面上显示所述答案,以及
    在问答结束之后,生成并在用户界面上显示针对所述症状的诊断结果。
  25. 根据权利要求24所述的医疗对话系统,其中,所述诊断结果包括以下各项中的至少一个:可能的疾病类型、可能的疾病类型具有的症状、适用于可能的疾病类型的推荐药物、推荐药物所针对的症状、了解推荐药物更多信息的链接。
  26. 根据权利要求25所述的医疗对话系统,其中,所述诊断结果以自然语言的形式输出并显示在所述用户界面上。
  27. 根据权利要求24所述的医疗对话系统,其中,所述问题包括多个选项,使得用户选择多个选项中的一个作为所述答案。
  28. 根据权利要求24所述的医疗对话系统,其中,所述指令还使得所述处理器对所述自然语言输入信息执行命名实体识别处理以提取症状信息。
  29. 根据权利要求24所述的医疗对话系统,其中,所述计算机程序指令还包括强化学习模型的指令集,其中所述强化学习模型是根据权利要求1-8中任一项所述的训练方法来进行训练的。
  30. 根据权利要求29所述的医疗对话系统,其中,所述医疗对话系统还基于知识图谱来生成所述诊断结果,
    其中,所述知识图谱包括M种疾病和N种症状的节点以及各种疾病与各种症状之间的对应关系,其中M和N为大于等于1的整数,以及针对每种疾病的推荐药物、预防手段、治疗方案、和病因。
PCT/CN2020/089394 2019-05-10 2020-05-09 训练方法和装置、对话处理方法和系统及介质 WO2020228636A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US17/413,373 US20220092441A1 (en) 2019-05-10 2020-05-09 Training method and apparatus, dialogue processing method and system, and medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910390546.5A CN111914069A (zh) 2019-05-10 2019-05-10 训练方法和装置、对话处理方法和系统及介质
CN201910390546.5 2019-05-10

Publications (1)

Publication Number Publication Date
WO2020228636A1 true WO2020228636A1 (zh) 2020-11-19

Family

ID=73242293

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/089394 WO2020228636A1 (zh) 2019-05-10 2020-05-09 训练方法和装置、对话处理方法和系统及介质

Country Status (3)

Country Link
US (1) US20220092441A1 (zh)
CN (1) CN111914069A (zh)
WO (1) WO2020228636A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113360618A (zh) * 2021-06-07 2021-09-07 暨南大学 一种基于离线强化学习的智能机器人对话方法及系统

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11645498B2 (en) * 2019-09-25 2023-05-09 International Business Machines Corporation Semi-supervised reinforcement learning
CN112434756A (zh) * 2020-12-15 2021-03-02 杭州依图医疗技术有限公司 医学数据的训练方法、处理方法、装置及存储介质
CN112702423B (zh) * 2020-12-23 2022-05-03 杭州比脉科技有限公司 一种基于物联网互动娱乐模式的机器人学习系统
US11790168B2 (en) * 2021-01-29 2023-10-17 Ncr Corporation Natural language and messaging system integrated group assistant
CN117931488A (zh) * 2022-10-17 2024-04-26 戴尔产品有限公司 对故障诊断的方法、设备和计算机程序产品

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107342078A (zh) * 2017-06-23 2017-11-10 上海交通大学 对话策略优化的冷启动系统和方法
CN107911299A (zh) * 2017-10-24 2018-04-13 浙江工商大学 一种基于深度q学习的路由规划方法
CN108600379A (zh) * 2018-04-28 2018-09-28 中国科学院软件研究所 一种基于深度确定性策略梯度的异构多智能体协同决策方法
CN109710741A (zh) * 2018-12-27 2019-05-03 中山大学 一种面向在线问答平台的基于深度强化学习的问题标注方法

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10176800B2 (en) * 2017-02-10 2019-01-08 International Business Machines Corporation Procedure dialogs using reinforcement learning

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107342078A (zh) * 2017-06-23 2017-11-10 上海交通大学 对话策略优化的冷启动系统和方法
CN107911299A (zh) * 2017-10-24 2018-04-13 浙江工商大学 一种基于深度q学习的路由规划方法
CN108600379A (zh) * 2018-04-28 2018-09-28 中国科学院软件研究所 一种基于深度确定性策略梯度的异构多智能体协同决策方法
CN109710741A (zh) * 2018-12-27 2019-05-03 中山大学 一种面向在线问答平台的基于深度强化学习的问题标注方法

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113360618A (zh) * 2021-06-07 2021-09-07 暨南大学 一种基于离线强化学习的智能机器人对话方法及系统
CN113360618B (zh) * 2021-06-07 2022-03-11 暨南大学 一种基于离线强化学习的智能机器人对话方法及系统

Also Published As

Publication number Publication date
CN111914069A (zh) 2020-11-10
US20220092441A1 (en) 2022-03-24

Similar Documents

Publication Publication Date Title
WO2020228636A1 (zh) 训练方法和装置、对话处理方法和系统及介质
Dharwadkar et al. A medical chatbot
US20230105969A1 (en) Computer-based system for applying machine learning models to healthcare related data
US20190311814A1 (en) Systems and methods for responding to healthcare inquiries
US10331659B2 (en) Automatic detection and cleansing of erroneous concepts in an aggregated knowledge base
US20230229898A1 (en) Data processing method and related device
CN111666477B (zh) 一种数据处理方法、装置、智能设备及介质
US10984024B2 (en) Automatic processing of ambiguously labeled data
US11295861B2 (en) Extracted concept normalization using external evidence
Li et al. Extracting medical knowledge from crowdsourced question answering website
Liu et al. Augmented LSTM framework to construct medical self-diagnosis android
US20210406640A1 (en) Neural Network Architecture for Performing Medical Coding
CN112052318A (zh) 一种语义识别方法、装置、计算机设备和存储介质
WO2021114635A1 (zh) 患者分群模型构建方法、患者分群方法及相关设备
WO2022068160A1 (zh) 基于人工智能的重症问诊数据识别方法、装置、设备及介质
WO2021151356A1 (zh) 分诊数据处理方法、装置、计算机设备及存储介质
JP2022500713A (ja) 機械支援対話システム、ならびに病状問診装置およびその方法
CN113707299A (zh) 基于问诊会话的辅助诊断方法、装置及计算机设备
WO2020016103A1 (en) Simulating patients for developing artificial intelligence based medical conditions
Kim et al. Constructing novel datasets for intent detection and ner in a korean healthcare advice system: guidelines and empirical results
US20240046127A1 (en) Dynamic causal discovery in imitation learning
KS et al. Conversational Chatbot Builder–Smarter Virtual Assistance with Domain Specific AI
Luo et al. Knowledge grounded conversational symptom detection with graph memory networks
Hwang et al. End-to-end dialogue system with multi languages for hospital receptionist robot
Song et al. Building Conversational Diagnosis Systems for Fine-Grained Diseases Using Few Annotated Data

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20805527

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20805527

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 20805527

Country of ref document: EP

Kind code of ref document: A1

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 220722)

122 Ep: pct application non-entry in european phase

Ref document number: 20805527

Country of ref document: EP

Kind code of ref document: A1