CN115438149A - End-to-end model training method and device, computer equipment and storage medium - Google Patents

End-to-end model training method and device, computer equipment and storage medium Download PDF

Info

Publication number
CN115438149A
CN115438149A CN202210981217.XA CN202210981217A CN115438149A CN 115438149 A CN115438149 A CN 115438149A CN 202210981217 A CN202210981217 A CN 202210981217A CN 115438149 A CN115438149 A CN 115438149A
Authority
CN
China
Prior art keywords
training
model
entity
data
medical
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210981217.XA
Other languages
Chinese (zh)
Inventor
刘佳瑞
王世朋
姚海申
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN202210981217.XA priority Critical patent/CN115438149A/en
Publication of CN115438149A publication Critical patent/CN115438149A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Medical Treatment And Welfare Office Work (AREA)

Abstract

The embodiment of the application belongs to the technical field of artificial intelligence natural language processing, and relates to an end-to-end model training method and device suitable for multi-Chinese medical language processing tasks, computer equipment and a storage medium. In addition, the present application also relates to a block chain technology, and a target sequence model of a user can be stored in the block chain. According to the method and the device, an initial sequence model is established according to a mT5-small model of a Seq2Seq framework, and pre-training is performed on an entity recognition task and a tail prediction task through a large amount of medical corpus data, so that the pre-trained sequence model can learn medical knowledge hidden in other tasks, and the accuracy of multi-Chinese medical language processing tasks is effectively improved.

Description

一种端到端模型训练方法、装置、计算机设备及存储介质An end-to-end model training method, device, computer equipment and storage medium

技术领域technical field

本申请涉及人工智能的自然语言处理技术领域,尤其涉及一种适用于多中文医疗语言处理任务的端到端模型训练方法、装置、计算机设备及存储介质。The present application relates to the technical field of natural language processing of artificial intelligence, in particular to an end-to-end model training method, device, computer equipment and storage medium suitable for multi-Chinese medical language processing tasks.

背景技术Background technique

对于医疗知识的不同NLP任务,现有的训练方案都是针对不同的任务分别训练,如NLU任务利用Bert类模型进行训练,文本生成任务利用GPT进行训练,NER任务基于LSTM相关模型,进行训练。For different NLP tasks of medical knowledge, the existing training schemes are trained separately for different tasks. For example, NLU tasks are trained using Bert-like models, text generation tasks are trained using GPT, and NER tasks are trained based on LSTM-related models.

然而,申请人发现传统的训练方式无法利用同领域内其他任务的信息,无法学习到隐藏于其他任务的医疗知识,部分样本量较少的任务如果利用Bert等大模型容易造成过拟合,使得传统的多中文医疗语言处理模型的预测准确性较低。However, the applicant found that traditional training methods cannot use information from other tasks in the same field, and cannot learn medical knowledge hidden in other tasks. For some tasks with small sample sizes, it is easy to cause overfitting if large models such as Bert are used. Traditional multi-Chinese medical language processing models have low predictive accuracy.

发明内容Contents of the invention

本申请实施例的目的在于提出一种适用于多中文医疗语言处理任务的端到端模型训练方法、装置、计算机设备及存储介质,以解决传统的多中文医疗语言处理模型的预测准确性较低的问题。The purpose of the embodiment of the present application is to propose an end-to-end model training method, device, computer equipment and storage medium suitable for multi-Chinese medical language processing tasks, so as to solve the low prediction accuracy of traditional multi-Chinese medical language processing models The problem.

为了解决上述技术问题,本申请实施例提供一种适用于多中文医疗语言处理任务的端到端模型训练方法,采用了如下所述的技术方案:In order to solve the above technical problems, the embodiment of the present application provides an end-to-end model training method suitable for multi-Chinese medical language processing tasks, adopting the following technical solutions:

获取与医疗领域相对应的医疗语料数据;Obtain medical corpus data corresponding to the medical field;

对所述医疗语料数据进行预处理操作,得到训练语料数据;Preprocessing the medical corpus data to obtain training corpus data;

对所述训练语料数据进行实体匹配操作,得到训练语料实体,其中,所述训练语料实体包括头部实体、实体关系以及尾部实体;Carrying out an entity matching operation on the training corpus data to obtain a training corpus entity, wherein the training corpus entity includes a head entity, an entity relationship, and a tail entity;

根据Seq2seq框架的mT5-small模型创建初始序列模型;Create an initial sequence model based on the mT5-small model of the Seq2seq framework;

根据所述训练语料数据、实体识别软提示以及实体识别硬提示构建实体识别训练数据;Constructing entity recognition training data according to the training corpus data, entity recognition soft prompts and entity recognition hard prompts;

将所述实体识别训练数据作为输入数据、所述训练语料实体作为标签信息对所述初始序列模型进行实体识别训练操作;Using the entity recognition training data as input data and the training corpus entities as label information to perform entity recognition training operations on the initial sequence model;

将所述头部实体、所述实体关系、尾部实体预测软提示以及尾部实体预测硬提示构建尾部预测训练数据;Constructing tail prediction training data with the head entity, the entity relationship, tail entity prediction soft hints and tail entity prediction hard hints;

将所述尾部预测训练数据作为输入数据、所述尾部实体作为标签信息对所述初始序列模型进行尾部预测训练操作;Using the tail prediction training data as input data and the tail entity as label information to perform a tail prediction training operation on the initial sequence model;

将完成所述实体识别训练操作以及所述尾部预测训练操作后的原始序列模型作为目标序列模型。The original sequence model after completing the entity recognition training operation and the tail prediction training operation is used as the target sequence model.

为了解决上述技术问题,本申请实施例还提供一种适用于多中文医疗语言处理任务的端到端模型训练装置,采用了如下所述的技术方案:In order to solve the above technical problems, the embodiment of the present application also provides an end-to-end model training device suitable for multi-Chinese medical language processing tasks, which adopts the following technical solutions:

数据获取模块,用于获取与医疗领域相对应的医疗语料数据;A data acquisition module, configured to acquire medical corpus data corresponding to the medical field;

预处理模块,用于对所述医疗语料数据进行预处理操作,得到训练语料数据;A preprocessing module, configured to perform a preprocessing operation on the medical corpus data to obtain training corpus data;

实体匹配模块,用于对所述训练语料数据进行实体匹配操作,得到训练语料实体,其中,所述训练语料实体包括头部实体、实体关系以及尾部实体;An entity matching module is used to perform an entity matching operation on the training corpus data to obtain a training corpus entity, wherein the training corpus entity includes a head entity, an entity relationship, and a tail entity;

模型创建模块,用于根据Seq2seq框架的mT5-small模型创建初始序列模型;A model creation module for creating an initial sequence model according to the mT5-small model of the Seq2seq framework;

实体识别数据构建模块,用于根据所述训练语料数据、实体识别软提示以及实体识别硬提示构建实体识别训练数据;Entity recognition data construction module, for constructing entity recognition training data according to described training corpus data, entity recognition soft prompt and entity recognition hard prompt;

实体识别训练模块,用于将所述实体识别训练数据作为输入数据、所述训练语料实体作为标签信息对所述初始序列模型进行实体识别训练操作;An entity recognition training module, configured to use the entity recognition training data as input data and the training corpus entity as label information to perform entity recognition training operations on the initial sequence model;

尾部预测数据构建模块,用于将所述头部实体、所述实体关系、尾部实体预测软提示以及尾部实体预测硬提示构建尾部预测训练数据;Tail prediction data construction module, used for constructing tail prediction training data with the head entity, the entity relationship, tail entity prediction soft prompt and tail entity prediction hard prompt;

尾部预测训练模块,用于将所述尾部预测训练数据作为输入数据、所述尾部实体作为标签信息对所述初始序列模型进行尾部预测训练操作;A tail prediction training module, configured to use the tail prediction training data as input data and the tail entity as label information to perform a tail prediction training operation on the initial sequence model;

模型确认模块,用于将完成所述实体识别训练操作以及所述尾部预测训练操作后的原始序列模型作为目标序列模型。The model confirmation module is used to use the original sequence model after the entity recognition training operation and the tail prediction training operation as the target sequence model.

为了解决上述技术问题,本申请实施例还提供一种计算机设备,采用了如下所述的技术方案:In order to solve the above technical problems, the embodiment of the present application also provides a computer device, which adopts the following technical solution:

包括存储器和处理器,所述存储器中存储有计算机可读指令,所述处理器执行所述计算机可读指令时实现如上所述的适用于多中文医疗语言处理任务的端到端模型训练方法的步骤。Comprising a memory and a processor, computer-readable instructions are stored in the memory, and when the processor executes the computer-readable instructions, the above-mentioned end-to-end model training method applicable to multi-Chinese medical language processing tasks is implemented step.

为了解决上述技术问题,本申请实施例还提供一种计算机可读存储介质,采用了如下所述的技术方案:In order to solve the above technical problems, the embodiment of the present application also provides a computer-readable storage medium, which adopts the following technical solution:

所述计算机可读存储介质上存储有计算机可读指令,所述计算机可读指令被处理器执行时实现如上所述的适用于多中文医疗语言处理任务的端到端模型训练方法的步骤。Computer-readable instructions are stored on the computer-readable storage medium, and when the computer-readable instructions are executed by a processor, the steps of the above-mentioned end-to-end model training method applicable to multi-Chinese medical language processing tasks are implemented.

本申请提供了一种适用于多中文医疗语言处理任务的端到端模型训练方法,包括:获取与医疗领域相对应的医疗语料数据;对所述医疗语料数据进行预处理操作,得到训练语料数据;对所述训练语料数据进行实体匹配操作,得到训练语料实体,其中,所述训练语料实体包括头部实体、实体关系以及尾部实体;根据Seq2seq框架的mT5-small模型创建初始序列模型;根据所述训练语料数据、实体识别软提示以及实体识别硬提示构建实体识别训练数据;将所述实体识别训练数据作为输入数据、所述训练语料实体作为标签信息对所述初始序列模型进行实体识别训练操作;将所述头部实体、所述实体关系、尾部实体预测软提示以及尾部实体预测硬提示构建尾部预测训练数据;将所述尾部预测训练数据作为输入数据、所述尾部实体作为标签信息对所述初始序列模型进行尾部预测训练操作;将完成所述实体识别训练操作以及所述尾部预测训练操作后的原始序列模型作为目标序列模型。与现有技术相比,本申请根据Seq2seq框架的mT5-small模型创建初始序列模型,并通过大量的医疗语料数据针对实体识别任务以及尾部预测任务进行预训练,使得预训练后的序列模型可以学习到隐藏于其他任务的医疗知识,有效提高多中文医疗语言处理任务的准确性。This application provides an end-to-end model training method suitable for multi-Chinese medical language processing tasks, including: obtaining medical corpus data corresponding to the medical field; performing preprocessing operations on the medical corpus data to obtain training corpus data Carry out entity matching operation to described training corpus data, obtain training corpus entity, wherein, described training corpus entity comprises head entity, entity relation and tail entity; Create initial sequence model according to the mT5-small model of Seq2seq framework; According to all The training corpus data, the entity recognition soft prompt and the entity recognition hard prompt construct the entity recognition training data; the entity recognition training data is used as the input data, and the training corpus entity is used as the label information to carry out the entity recognition training operation on the initial sequence model ; The head entity, the entity relationship, the tail entity prediction soft prompt and the tail entity prediction hard prompt are used to construct the tail prediction training data; the tail prediction training data is used as input data, and the tail entity is used as label information for all The tail prediction training operation is performed on the initial sequence model; the original sequence model after the entity recognition training operation and the tail prediction training operation is completed is used as the target sequence model. Compared with the existing technology, this application creates an initial sequence model based on the mT5-small model of the Seq2seq framework, and pre-trains entity recognition tasks and tail prediction tasks through a large amount of medical corpus data, so that the pre-trained sequence model can learn The medical knowledge hidden in other tasks can effectively improve the accuracy of multi-Chinese medical language processing tasks.

附图说明Description of drawings

为了更清楚地说明本申请中的方案,下面将对本申请实施例描述中所需要使用的附图作一个简单介绍,显而易见地,下面描述中的附图是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to illustrate the solution in this application more clearly, a brief introduction will be given below to the accompanying drawings that need to be used in the description of the embodiments of the application. Obviously, the accompanying drawings in the following description are some embodiments of the application. Ordinary technicians can also obtain other drawings based on these drawings on the premise of not paying creative work.

图1是本申请可以应用于其中的示例性系统架构图;FIG. 1 is an exemplary system architecture diagram to which the present application can be applied;

图2是本申请实施例一提供的适用于多中文医疗语言处理任务的端到端模型训练方法的实现流程图;Fig. 2 is the implementation flowchart of the end-to-end model training method applicable to multi-Chinese medical language processing tasks provided by Embodiment 1 of the present application;

图3是本申请实施例一提供的端到端模型训练方法的另一种具体实施方式的流程图;FIG. 3 is a flow chart of another specific implementation of the end-to-end model training method provided in Embodiment 1 of the present application;

图4是图2中步骤S202的一种具体实施方式的流程图;FIG. 4 is a flowchart of a specific implementation of step S202 in FIG. 2;

图5是本申请实施例一提供的端到端模型训练方法的再一种具体实施方式的流程图;FIG. 5 is a flow chart of yet another specific implementation of the end-to-end model training method provided in Embodiment 1 of the present application;

图6是图5中步骤S202的一种具体实施方式的流程图;FIG. 6 is a flowchart of a specific implementation of step S202 in FIG. 5;

图7是本申请实施例一提供的获取语义分析模型的一种具体实施方式的流程图;FIG. 7 is a flow chart of a specific implementation of acquiring a semantic analysis model provided in Embodiment 1 of the present application;

图8是本申请实施例二提供的适用于多中文医疗语言处理任务的端到端模型训练装置的结构示意图;8 is a schematic structural diagram of an end-to-end model training device suitable for multi-Chinese medical language processing tasks provided by Embodiment 2 of the present application;

图9是本申请实施例二提供的适用于多中文医疗语言处理任务的端到端模型训练装置的另一种具体实施方式的结构示意图;FIG. 9 is a schematic structural diagram of another specific embodiment of an end-to-end model training device suitable for multi-Chinese medical language processing tasks provided by Embodiment 2 of the present application;

图10是根据本申请的计算机设备的一个实施例的结构示意图。Fig. 10 is a schematic structural diagram of an embodiment of a computer device according to the present application.

具体实施方式detailed description

除非另有定义,本文所使用的所有的技术和科学术语与属于本申请的技术领域的技术人员通常理解的含义相同;本文中在申请的说明书中所使用的术语只是为了描述具体的实施例的目的,不是旨在于限制本申请;本申请的说明书和权利要求书及上述附图说明中的术语“包括”和“具有”以及它们的任何变形,意图在于覆盖不排他的包含。本申请的说明书和权利要求书或上述附图中的术语“第一”、“第二”等是用于区别不同对象,而不是用于描述特定顺序。Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by those skilled in the technical field of the application; the terms used herein in the description of the application are only to describe specific embodiments The purpose is not to limit the present application; the terms "comprising" and "having" and any variations thereof in the specification and claims of the present application and the description of the above drawings are intended to cover non-exclusive inclusion. The terms "first", "second" and the like in the description and claims of the present application or the above drawings are used to distinguish different objects, rather than to describe a specific order.

在本文中提及“实施例”意味着,结合实施例描述的特定特征、结构或特性可以包含在本申请的至少一个实施例中。在说明书中的各个位置出现该短语并不一定均是指相同的实施例,也不是与其它实施例互斥的独立的或备选的实施例。本领域技术人员显式地和隐式地理解的是,本文所描述的实施例可以与其它实施例相结合。Reference herein to an "embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the present application. The occurrences of this phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is understood explicitly and implicitly by those skilled in the art that the embodiments described herein can be combined with other embodiments.

为了使本技术领域的人员更好地理解本申请方案,下面将结合附图,对本申请实施例中的技术方案进行清楚、完整地描述。In order to enable those skilled in the art to better understand the solutions of the present application, the technical solutions in the embodiments of the present application will be clearly and completely described below in conjunction with the accompanying drawings.

如图1所示,系统架构100可以包括终端设备101、102、103,网络104和服务器105。网络104用以在终端设备101、102、103和服务器105之间提供通信链路的介质。网络104可以包括各种连接类型,例如有线、无线通信链路或者光纤电缆等等。As shown in FIG. 1 , a system architecture 100 may include terminal devices 101 , 102 , 103 , a network 104 and a server 105 . The network 104 is used as a medium for providing communication links between the terminal devices 101 , 102 , 103 and the server 105 . Network 104 may include various connection types, such as wires, wireless communication links, or fiber optic cables, among others.

用户可以使用终端设备101、102、103通过网络104与服务器105交互,以接收或发送消息等。终端设备101、102、103上可以安装有各种通讯客户端应用,例如网页浏览器应用、购物类应用、搜索类应用、即时通信工具、邮箱客户端、社交平台软件等。Users can use terminal devices 101 , 102 , 103 to interact with server 105 via network 104 to receive or send messages and the like. Various communication client applications can be installed on the terminal devices 101, 102, 103, such as web browser applications, shopping applications, search applications, instant messaging tools, email clients, social platform software, and the like.

终端设备101、102、103可以是具有显示屏并且支持网页浏览的各种电子设备,包括但不限于智能手机、平板电脑、电子书阅读器、MP3播放器(Moving Picture ExpertsGroup Audio Layer III,动态影像专家压缩标准音频层面3)、MP4(Moving PictureExperts Group Audio Layer IV,动态影像专家压缩标准音频层面4)播放器、膝上型便携计算机和台式计算机等等。Terminal devices 101, 102, 103 can be various electronic devices with display screens and support web browsing, including but not limited to smartphones, tablet computers, e-book readers, MP3 players (Moving Picture Experts Group Audio Layer III, moving picture Expert Compression Standard Audio Layer 3), MP4 (Moving PictureExperts Group Audio Layer IV, Moving Picture Experts Compression Standard Audio Layer 4) Players, Laptop Portable Computers and Desktop Computers, etc.

服务器105可以是提供各种服务的服务器,例如对终端设备101、102、103上显示的页面提供支持的后台服务器。The server 105 may be a server that provides various services, such as a background server that provides support for pages displayed on the terminal devices 101 , 102 , 103 .

需要说明的是,本申请实施例所提供的适用于多中文医疗语言处理任务的端到端模型训练方法一般由服务器/终端设备执行,相应地,适用于多中文医疗语言处理任务的端到端模型训练装置一般设置于服务器/终端设备中。It should be noted that the end-to-end model training method suitable for multi-Chinese medical language processing tasks provided by the embodiment of the present application is generally executed by a server/terminal device, and correspondingly, it is suitable for end-to-end multi-Chinese medical language processing tasks The model training device is generally set in the server/terminal device.

应该理解,图1中的终端设备、网络和服务器的数目仅仅是示意性的。根据实现需要,可以具有任意数目的终端设备、网络和服务器。It should be understood that the numbers of terminal devices, networks and servers in Fig. 1 are only illustrative. According to the implementation needs, there can be any number of terminal devices, networks and servers.

实施例一Embodiment one

继续参考图2,示出了本申请实施例一提供的适用于多中文医疗语言处理任务的端到端模型训练方法的实现流程图,为了便于说明,仅示出与本申请相关的部分。Continuing to refer to FIG. 2 , it shows the implementation flow chart of the end-to-end model training method suitable for multi-Chinese medical language processing tasks provided by Embodiment 1 of the present application. For the convenience of description, only the parts related to the present application are shown.

上述的适用于多中文医疗语言处理任务的端到端模型训练方法,包括以下步骤:The above-mentioned end-to-end model training method applicable to multi-Chinese medical language processing tasks includes the following steps:

步骤S201:获取与医疗领域相对应的医疗语料数据;Step S201: Obtain medical corpus data corresponding to the medical field;

步骤S202:对医疗语料数据进行预处理操作,得到训练语料数据;Step S202: Preprocessing the medical corpus data to obtain training corpus data;

步骤S203:对训练语料数据进行实体匹配操作,得到训练语料实体,其中,训练语料实体包括头部实体、实体关系以及尾部实体;Step S203: Perform entity matching operation on the training corpus data to obtain the training corpus entity, wherein the training corpus entity includes head entity, entity relationship and tail entity;

步骤S204:根据Seq2seq框架的mT5-small模型创建初始序列模型;Step S204: Create an initial sequence model according to the mT5-small model of the Seq2seq framework;

步骤S205:根据训练语料数据、实体识别软提示以及实体识别硬提示构建实体识别训练数据;Step S205: Construct entity recognition training data according to the training corpus data, entity recognition soft prompts and entity recognition hard prompts;

在本申请实施例中,利用一个特殊token作为soft prompt(即软提示),利用任务描述作为hard prompt(就是由具体的中文或英文词汇组成提示,它是人工可读的硬提示)的方法进行上述训练数据的创建。In the embodiment of this application, a special token is used as a soft prompt (soft prompt), and the task description is used as a hard prompt (a prompt composed of specific Chinese or English words, which is a human-readable hard prompt) Creation of the above training data.

步骤S206:将实体识别训练数据作为输入数据、训练语料实体作为标签信息对初始序列模型进行实体识别训练操作;Step S206: Using entity recognition training data as input data and training corpus entities as label information, perform entity recognition training operations on the initial sequence model;

步骤S207:将头部实体、实体关系、尾部实体预测软提示以及尾部实体预测硬提示构建尾部预测训练数据;Step S207: Construct tail prediction training data with the head entity, entity relationship, tail entity prediction soft prompt and tail entity prediction hard prompt;

步骤S208:将尾部预测训练数据作为输入数据、尾部实体作为标签信息对初始序列模型进行尾部预测训练操作;Step S208: Using the tail prediction training data as input data and tail entities as label information, perform tail prediction training operations on the initial sequence model;

步骤S209:将完成实体识别训练操作以及尾部预测训练操作后的原始序列模型作为目标序列模型。Step S209: use the original sequence model after the entity recognition training operation and the tail prediction training operation as the target sequence model.

在本实施例的一些可选的实现方式中,为了增强模型在生成医疗文本的准确性,我们引入了外部知识,我们为模型设计了两个步骤来增强模型对于知识的利用能力。1:知识选择(knowledge selection)训练模型,输入为文本,输出为知识图谱(KG)中与文本相关的三元组;2:知识灌输(knowledge indorsation),将从知识图谱中得到的相关知识与对话拼接共同作为模型的输入共同生成回复。In some optional implementations of this embodiment, in order to enhance the accuracy of the model in generating medical texts, we introduce external knowledge, and we design two steps for the model to enhance the ability of the model to utilize knowledge. 1: Knowledge selection (knowledge selection) training model, the input is text, and the output is the triplet related to the text in the knowledge map (KG); 2: Knowledge indorsation (knowledge indorsation), the relevant knowledge obtained from the knowledge map and Dialogue stitching together serves as input to the model to jointly generate responses.

在本申请实施例中,提供了一种适用于多中文医疗语言处理任务的端到端模型训练方法,包括:获取与医疗领域相对应的医疗语料数据;对医疗语料数据进行预处理操作,得到训练语料数据;对训练语料数据进行实体匹配操作,得到训练语料实体,其中,训练语料实体包括头部实体、实体关系以及尾部实体;根据Seq2seq框架的mT5-small模型创建初始序列模型;根据训练语料数据、实体识别软提示以及实体识别硬提示构建实体识别训练数据;将实体识别训练数据作为输入数据、训练语料实体作为标签信息对初始序列模型进行实体识别训练操作;将头部实体、实体关系、尾部实体预测软提示以及尾部实体预测硬提示构建尾部预测训练数据;将尾部预测训练数据作为输入数据、尾部实体作为标签信息对初始序列模型进行尾部预测训练操作;将完成实体识别训练操作以及尾部预测训练操作后的原始序列模型作为目标序列模型。与现有技术相比,本申请根据Seq2seq框架的mT5-small模型创建初始序列模型,并通过大量的医疗语料数据针对实体识别任务以及尾部预测任务进行预训练,使得预训练后的序列模型可以学习到隐藏于其他任务的医疗知识,有效提高多中文医疗语言处理任务的准确性。In the embodiment of this application, an end-to-end model training method suitable for multi-Chinese medical language processing tasks is provided, including: obtaining medical corpus data corresponding to the medical field; performing preprocessing operations on the medical corpus data to obtain Training corpus data; perform entity matching operation on the training corpus data to obtain training corpus entities, wherein the training corpus entities include head entities, entity relations and tail entities; create an initial sequence model according to the mT5-small model of the Seq2seq framework; according to the training corpus Data, entity recognition soft prompts and entity recognition hard prompts construct entity recognition training data; use entity recognition training data as input data and training corpus entities as label information to perform entity recognition training operations on the initial sequence model; head entities, entity relationships, Tail entity prediction soft hints and tail entity prediction hard hints construct tail prediction training data; use tail prediction training data as input data and tail entities as label information to perform tail prediction training operations on the initial sequence model; entity recognition training operations and tail predictions will be completed The original sequence model after the training operation is used as the target sequence model. Compared with the existing technology, this application creates an initial sequence model based on the mT5-small model of the Seq2seq framework, and pre-trains entity recognition tasks and tail prediction tasks through a large amount of medical corpus data, so that the pre-trained sequence model can learn The medical knowledge hidden in other tasks can effectively improve the accuracy of multi-Chinese medical language processing tasks.

继续参阅图3,示出了本申请实施例一提供的端到端模型训练方法的另一种具体实施方式的流程图,为了便于说明,仅示出与本申请相关的部分。Continuing to refer to FIG. 3 , it shows a flow chart of another specific implementation of the end-to-end model training method provided in Embodiment 1 of the present application. For the convenience of description, only the parts relevant to the present application are shown.

在本实施例的一些可选的实现方式中,在步骤S204之后,步骤S209之前,还包括:步骤S301和步骤S302,步骤S209包括:步骤S303。In some optional implementation manners of this embodiment, after step S204 and before step S209, further include: step S301 and step S302, and step S209 includes: step S303.

步骤S301:根据文章内容、文章总结软提示以及文章总结硬提示构建文章总结训练数据。Step S301: Construct article summary training data according to article content, article summary soft prompts, and article summary hard prompts.

步骤S302:将文章总结训练数据作为输入数据、文章标题作为标签信息对初始序列模型进行文章总结训练操作。Step S302: Using article summary training data as input data and article titles as label information to perform article summary training on the initial sequence model.

步骤S303:将完成实体识别训练操作、尾部预测训练操作以及文章总结训练操作后的原始序列模型作为目标序列模型。Step S303: The original sequence model after the entity recognition training operation, the tail prediction training operation and the article summary training operation is completed is used as the target sequence model.

继续参阅图4,示出了图2中步骤S202的一种具体实施方式的流程图,为了便于说明,仅示出与本申请相关的部分。Continuing to refer to FIG. 4 , it shows a flow chart of a specific implementation of step S202 in FIG. 2 , and for ease of description, only parts relevant to the present application are shown.

在本实施例的一些可选的实现方式中,步骤S202具体包括:步骤S401和/或步骤S402,其中:In some optional implementation manners of this embodiment, step S202 specifically includes: step S401 and/or step S402, wherein:

步骤S401:根据Jaccard相似度算法对医疗语料数据进行相似文本去重操作。Step S401: Perform similar text deduplication operation on medical corpus data according to Jaccard similarity algorithm.

在本申请实施例中,Jaccard相似度算法用于比较有限样本集之间的相似性与差异性。Jaccard系数值越大,样本相似度越高。In the embodiment of the present application, the Jaccard similarity algorithm is used to compare the similarity and difference between limited sample sets. The larger the Jaccard coefficient value, the higher the sample similarity.

步骤S402:根据正则匹配算法对医疗语料数据中噪声较大的文本进行删除操作,得到训练语料数据。Step S402: According to the regularization matching algorithm, the text with relatively large noise in the medical corpus data is deleted to obtain the training corpus data.

继续参阅图5,示出了本申请实施例一提供的端到端模型训练方法的再一种具体实施方式的流程图,为了便于说明,仅示出与本申请相关的部分。Continuing to refer to FIG. 5 , it shows a flow chart of yet another specific implementation of the end-to-end model training method provided in Embodiment 1 of the present application. For ease of description, only the parts relevant to the present application are shown.

在本实施例的一些可选的实现方式中,医疗语料数据包括携带有医疗提问信息以及医疗回答信息的医疗问答信息,在步骤S204之后,步骤S209之前,还包括:步骤S501和步骤S502,步骤S209包括:步骤S503。In some optional implementations of this embodiment, the medical corpus data includes medical question and answer information carrying medical question information and medical answer information. After step S204 and before step S209, it also includes: step S501 and step S502, step S209 includes: step S503.

步骤S501:根据医疗提问信息、医疗问答软提示以及医疗问答硬提示构建医疗问答训练数据。Step S501: Construct medical question-and-answer training data according to medical question information, medical question-and-answer soft prompts, and medical question-and-answer hard prompts.

步骤S502:将医疗问答训练数据作为输入数据、医疗回答信息作为标签信息对初始序列模型进行医疗问答训练操作。Step S502: Using medical question-answer training data as input data and medical answer information as label information, perform medical question-answer training on the initial sequence model.

步骤S503:将完成实体识别训练操作、尾部预测训练操作以及医疗问答训练操作后的原始序列模型作为目标序列模型。Step S503: The original sequence model after the entity recognition training operation, the tail prediction training operation and the medical question answering training operation is completed is used as the target sequence model.

继续参阅图6,示出了图5中步骤S202的一种具体实施方式的流程图,为了便于说明,仅示出与本申请相关的部分。Continuing to refer to FIG. 6 , it shows a flow chart of a specific implementation of step S202 in FIG. 5 , and for ease of description, only the parts relevant to the present application are shown.

在本实施例的一些可选的实现方式中,步骤S202具体包括:步骤S601、步骤S602、步骤S603、步骤S604和步骤S605,其中:In some optional implementations of this embodiment, step S202 specifically includes: step S601, step S602, step S603, step S604, and step S605, wherein:

步骤S601:判断医疗问答信息是否存在歧义词汇;Step S601: judging whether there are ambiguous words in the medical question-and-answer information;

步骤S602:若不存在歧义词汇,则将医疗语料数据作为训练语料数据;Step S602: If there is no ambiguous vocabulary, use the medical corpus data as the training corpus data;

步骤S603:若存在歧义词汇,则获取与歧义词汇上下文相关联的关联文本信息;Step S603: If there is an ambiguous vocabulary, obtain associated text information associated with the context of the ambiguous vocabulary;

步骤S604:将关联文本信息输入至语义分析模型进行词义识别操作,得到歧义词汇的真实词义信息;Step S604: Input the associated text information into the semantic analysis model to perform word meaning recognition operation, and obtain the real word meaning information of ambiguous words;

步骤S605:将医疗问答信息中的歧义词汇替换成真实词义信息,得到训练语料数据。Step S605: Replace the ambiguous words in the medical question-and-answer information with real word meaning information to obtain training corpus data.

继续参阅图7,示出了本申请实施例一提供的获取语义分析模型的一种具体实施方式的流程图,为了便于说明,仅示出与本申请相关的部分。Continuing to refer to FIG. 7 , it shows a flow chart of a specific implementation manner of acquiring a semantic analysis model provided by Embodiment 1 of the present application. For ease of description, only the parts relevant to the present application are shown.

在本实施例的一些可选的实现方式中,在步骤S604之前,还包括:步骤S701、步骤S702、步骤S703、步骤S704、步骤S705和步骤S706,其中:In some optional implementations of this embodiment, before step S604, further include: step S701, step S702, step S703, step S704, step S705 and step S706, wherein:

步骤S701:在本地数据库中获取样本文本,并确定样本文本中包含的每个分词。Step S701: Obtain a sample text in a local database, and determine each participle included in the sample text.

在本申请实施例中,可以先从上述本地数据库中获取多个文本,确定由获取的多个文本所构成的训练集,则,针对训练集中的每个文本,可将该文本作为样本文本。In the embodiment of the present application, a plurality of texts may be acquired from the above local database first, and a training set composed of the acquired plurality of texts may be determined, then, for each text in the training set, the text may be used as a sample text.

在本申请实施例中,确定该样本文本中包含的分词时,可先对该样本文本进行分词处理,以得到该样本文本中包含的每个分词。在对样本文本进行分词处理时,可采用任何分词方法,当然,也可将该样本文本中的每个字符都作为一个分词进行处理,应当理解,此处对分词处理的举例仅为方便理解,不用于限定本申请。In the embodiment of the present application, when the word segmentation contained in the sample text is determined, word segmentation processing may be performed on the sample text first, so as to obtain each word segmentation contained in the sample text. When performing word segmentation processing on the sample text, any word segmentation method can be used. Of course, each character in the sample text can also be processed as a word segmentation. It should be understood that the example of word segmentation processing here is only for convenience of understanding. It is not intended to limit this application.

步骤S702:基于待训练的语义分析模型确定每个分词对应的词向量。Step S702: Determine the word vector corresponding to each word segment based on the semantic analysis model to be trained.

在本申请实施例中,语义分析模型可至少包括四层,分别是:语义表征层、属性表征层、属性相关性表示层、分类层。In the embodiment of the present application, the semantic analysis model may include at least four layers, namely: a semantic representation layer, an attribute representation layer, an attribute correlation representation layer, and a classification layer.

在本申请实施例中,语义表征层中至少包括用于输出双向语义表示向量的子模型,如BERT(Bidirectional Encoder Representations from Transformers)模型等。可将每个分词输入语义分析模型中的语义表征层,得到语义表征层输出的每个分词分别对应的双向语义表示向量,作为每个分词对应的词向量。应当理解,用于输出双向语义表示向量的模型除了上述的BERT模型以外,还包括其他模型,此处对用于输出双向语义表示向量的模型的举例仅为方便理解,不用于限定本申请。In the embodiment of the present application, the semantic representation layer includes at least a sub-model for outputting a bidirectional semantic representation vector, such as a BERT (Bidirectional Encoder Representations from Transformers) model. Each word segment can be input into the semantic representation layer in the semantic analysis model, and a bidirectional semantic representation vector corresponding to each word segment output by the semantic representation layer can be obtained as a word vector corresponding to each word segment. It should be understood that the model for outputting the bidirectional semantic representation vector includes other models besides the above-mentioned BERT model. The examples of the model for outputting the bidirectional semantic representation vector here are only for convenience of understanding and are not intended to limit the present application.

步骤S703:在本地数据库中获取语义属性,根据待训练的语义分析模型中包含语义属性对应的注意力矩阵,以及每个分词对应的词向量,确定样本文本涉及语义属性的第一特征表示向量。Step S703: Obtain the semantic attributes in the local database, and determine the first feature representation vector of the sample text related to the semantic attributes according to the attention matrix corresponding to the semantic attributes contained in the semantic analysis model to be trained and the word vector corresponding to each word segment.

在本申请实施例中,可将每个分词对应的词向量输入语义分析模型中的属性表征层,通过属性表征层中包含的该语义属性对应的注意力矩阵,对每个分词对应的词向量进行注意力加权,根据注意力加权后的每个分词对应的词向量,确定样本文本涉及该语义属性的第一特征表示向量。In the embodiment of the present application, the word vector corresponding to each word segment can be input into the attribute representation layer in the semantic analysis model, and the word vector corresponding to each word segment can be calculated through the attention matrix corresponding to the semantic attribute contained in the attribute representation layer. Perform attention weighting, and determine the first feature representation vector of the sample text related to the semantic attribute according to the word vector corresponding to each word segment after attention weighting.

步骤S704:根据待训练的语义分析模型中包含的用于表示不同语义属性之间的相关性的自注意力矩阵,以及第一特征表示向量,确定样本文本涉及语义属性的第二特征表示向量。Step S704: According to the self-attention matrix used to represent the correlation between different semantic attributes contained in the semantic analysis model to be trained, and the first feature representation vector, determine the second feature representation vector related to the semantic attribute of the sample text.

在本申请实施例中,可将样本文本涉及每个语义属性的第一特征表示向量输入语音分析模型中的属性相关性表示层,通过属性相关性表示层中包含的上述自注意力矩阵,对样本文本涉及每个语义属性的第一特征表示向量进行自注意力加权,根据自注意力加权后的各第一特征表示向量,确定样本文本涉及每个语义属性的第二特征表示向量。In the embodiment of the present application, the first feature representation vector of the sample text related to each semantic attribute can be input into the attribute correlation representation layer in the speech analysis model, and through the above-mentioned self-attention matrix contained in the attribute correlation representation layer, the The first feature representation vectors of the sample text related to each semantic attribute are weighted by self-attention, and the second feature representation vectors of the sample text related to each semantic attribute are determined according to the first feature representation vectors weighted by self-attention.

步骤S705:根据待训练的语义分析模型以及第二特征表示向量,确定待训练的语义训练模型输出的分类结果,分类结果包括样本文本所属的语义属性以及样本文本所属的语义属性对应的情感极性。Step S705: According to the semantic analysis model to be trained and the second feature representation vector, determine the classification result output by the semantic training model to be trained, the classification result includes the semantic attribute to which the sample text belongs and the emotional polarity corresponding to the semantic attribute to which the sample text belongs .

在本申请实施例中,分类层至少包括隐层、全连接层和softmax层。In the embodiment of the present application, the classification layer includes at least a hidden layer, a fully connected layer and a softmax layer.

在本申请实施例中,可将样本文本涉及每个语义属性的第二特征表示向量依次输入分类层中的隐层、全连接层和softmax层,根据每个第二特征表示向量以及分类层的隐层、全连接层和softmax层中包含的与每个语义属性对应的分类参数,对样本文本进行分类,得到分类层输出的分类结果。In the embodiment of the present application, the second feature representation vector of the sample text related to each semantic attribute can be sequentially input into the hidden layer, fully connected layer and softmax layer in the classification layer, and according to each second feature representation vector and the classification layer The classification parameters corresponding to each semantic attribute contained in the hidden layer, the fully connected layer and the softmax layer classify the sample text and obtain the classification result output by the classification layer.

在本申请实施例中,的分类结果至少包括样本文本所属的语义属性以及样本文本在其所属的语义属性上对应的情感极性。In the embodiment of the present application, the classification result of at least includes the semantic attribute to which the sample text belongs and the emotional polarity corresponding to the semantic attribute to which the sample text belongs.

在本申请实施例中,该情感极性可以采用数值进行量化,例如,数值越接近于1,则表示情感极性越倾向于正面,数值越接近于-1,则表示情感极性越倾向于负面,数值接近于0,则表示情感极性倾向于中性。In the embodiment of this application, the emotional polarity can be quantified by numerical value. For example, the closer the numerical value is to 1, the more positive the emotional polarity is, and the closer the numerical value is to -1, the more inclined the emotional polarity is. Negative, with a value close to 0, indicates that the emotional polarity tends to be neutral.

步骤S706:根据分类结果和样本文本预设的标注,对语义分析模型中的模型参数进行调整,得到语义分析模型。Step S706: According to the classification result and the preset annotation of the sample text, adjust the model parameters in the semantic analysis model to obtain the semantic analysis model.

在本申请实施例中,需要调整的模型参数至少包括上述的分类参数,还可包括上述的注意力矩阵和自注意力矩阵。可采用传统的训练方法对语义分析模型中的模型参数进行调整。即,直接根据步骤S108得到的分类结果和针对样本文本预设的标注,确定分类结果对应的损失(以下将其称之为第一损失),并以该第一损失最小化为训练目标对语义分析模型中的模型参数进行调整,以完成对语义分析模型的训练。In the embodiment of the present application, the model parameters to be adjusted include at least the above-mentioned classification parameters, and may also include the above-mentioned attention matrix and self-attention matrix. The model parameters in the semantic analysis model can be adjusted using traditional training methods. That is, directly according to the classification result obtained in step S108 and the preset label for the sample text, the loss corresponding to the classification result (hereinafter referred to as the first loss) is determined, and the first loss is minimized as the training target pair semantic The model parameters in the analysis model are adjusted to complete the training of the semantic analysis model.

在本申请实施例中,由于上述语义分析模型中已经加入了用于表示不同语义属性之间相关性的自注意力矩阵,因此,采用上述传统的训练方法训练得到的语义分析模型可更加准确的对待分析文本的语义进行分析。In the embodiment of the present application, since the self-attention matrix used to represent the correlation between different semantic attributes has been added to the above-mentioned semantic analysis model, the semantic analysis model trained by the above-mentioned traditional training method can be more accurate Analyze the semantics of the text to be analyzed.

需要强调的是,为进一步保证上述目标序列模型的私密和安全性,上述目标序列模型还可以存储于一区块链的节点中。It should be emphasized that, in order to further ensure the privacy and security of the above-mentioned target sequence model, the above-mentioned target sequence model can also be stored in a node of a block chain.

本申请所指区块链是分布式数据存储、点对点传输、共识机制、加密算法等计算机技术的新型应用模式。区块链(Blockchain),本质上是一个去中心化的数据库,是一串使用密码学方法相关联产生的数据块,每一个数据块中包含了一批次网络交易的信息,用于验证其信息的有效性(防伪)和生成下一个区块。区块链可以包括区块链底层平台、平台产品服务层以及应用服务层等。The blockchain referred to in this application is a new application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm. Blockchain (Blockchain), essentially a decentralized database, is a series of data blocks associated with each other using cryptographic methods. Each data block contains a batch of network transaction information, which is used to verify its Validity of information (anti-counterfeiting) and generation of the next block. The blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.

本申请可用于众多通用或专用的计算机系统环境或配置中。例如:个人计算机、服务器计算机、手持设备或便携式设备、平板型设备、多处理器系统、基于微处理器的系统、置顶盒、可编程的消费电子设备、网络PC、小型计算机、大型计算机、包括以上任何系统或设备的分布式计算环境等等。本申请可以在由计算机执行的计算机可执行指令的一般上下文中描述,例如程序模块。一般地,程序模块包括执行特定任务或实现特定抽象数据类型的例程、程序、对象、组件、数据结构等等。也可以在分布式计算环境中实践本申请,在这些分布式计算环境中,由通过通信网络而被连接的远程处理设备来执行任务。在分布式计算环境中,程序模块可以位于包括存储设备在内的本地和远程计算机存储介质中。The application can be used in numerous general purpose or special purpose computer system environments or configurations. Examples: personal computers, server computers, handheld or portable devices, tablet-type devices, multiprocessor systems, microprocessor-based systems, set-top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, including A distributed computing environment for any of the above systems or devices, etc. This application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including storage devices.

本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机可读指令来指令相关的硬件来完成,该计算机可读指令可存储于一计算机可读取存储介质中,该计算机可读指令在执行时,可包括如上述各方法的实施例的流程。其中,前述的存储介质可为磁碟、光盘、只读存储记忆体(Read-Only Memory,ROM)等非易失性存储介质,或随机存储记忆体(Random Access Memory,RAM)等。Those of ordinary skill in the art can understand that all or part of the processes in the methods of the above embodiments can be implemented by instructing related hardware through computer-readable instructions, and the computer-readable instructions can be stored in a computer-readable storage medium. , when the computer-readable instructions are executed, they may include the processes of the embodiments of the above-mentioned methods. Wherein, the aforementioned storage medium may be a nonvolatile storage medium such as a magnetic disk, an optical disk, a read-only memory (Read-Only Memory, ROM), or a random access memory (Random Access Memory, RAM).

应该理解的是,虽然附图的流程图中的各个步骤按照箭头的指示依次显示,但是这些步骤并不是必然按照箭头指示的顺序依次执行。除非本文中有明确的说明,这些步骤的执行并没有严格的顺序限制,其可以以其他的顺序执行。而且,附图的流程图中的至少一部分步骤可以包括多个子步骤或者多个阶段,这些子步骤或者阶段并不必然是在同一时刻执行完成,而是可以在不同的时刻执行,其执行顺序也不必然是依次进行,而是可以与其他步骤或者其他步骤的子步骤或者阶段的至少一部分轮流或者交替地执行。It should be understood that although the various steps in the flow chart of the accompanying drawings are displayed sequentially according to the arrows, these steps are not necessarily executed sequentially in the order indicated by the arrows. Unless otherwise specified herein, there is no strict order restriction on the execution of these steps, and they can be executed in other orders. Moreover, at least some of the steps in the flowcharts of the accompanying drawings may include multiple sub-steps or multiple stages, and these sub-steps or stages are not necessarily executed at the same time, but may be executed at different times, and the order of execution is also It is not necessarily performed sequentially, but may be performed alternately or alternately with at least a part of other steps or sub-steps or stages of other steps.

实施例二Embodiment two

进一步参考图8,作为对上述图2所示方法的实现,本申请提供了一种适用于多中文医疗语言处理任务的端到端模型训练装置的一个实施例,该装置实施例与图2所示的方法实施例相对应,该装置具体可以应用于各种电子设备中。Further referring to FIG. 8 , as an implementation of the method shown in FIG. 2 above, the present application provides an embodiment of an end-to-end model training device suitable for multi-Chinese medical language processing tasks. This device embodiment is the same as that shown in FIG. 2 Corresponding to the method embodiments shown, the apparatus can be specifically applied to various electronic devices.

如图8所示,本实施例的适用于多中文医疗语言处理任务的端到端模型训练装置200包括:数据获取模块201、预处理模块202、实体匹配模块203、模型创建模块204、实体识别数据构建模块205、实体识别训练模块206、尾部预测数据构建模块207、尾部预测训练模块208以及模型确认模块209。其中:As shown in Figure 8, the end-to-end model training device 200 applicable to multi-Chinese medical language processing tasks in this embodiment includes: data acquisition module 201, preprocessing module 202, entity matching module 203, model creation module 204, entity recognition Data construction module 205 , entity recognition training module 206 , tail prediction data construction module 207 , tail prediction training module 208 and model confirmation module 209 . in:

数据获取模块201,用于获取与医疗领域相对应的医疗语料数据;A data acquisition module 201, configured to acquire medical corpus data corresponding to the medical field;

预处理模块202,用于对医疗语料数据进行预处理操作,得到训练语料数据;A preprocessing module 202, configured to perform preprocessing operations on medical corpus data to obtain training corpus data;

实体匹配模块203,用于对训练语料数据进行实体匹配操作,得到训练语料实体,其中,训练语料实体包括头部实体、实体关系以及尾部实体;Entity matching module 203, for carrying out entity matching operation to training corpus data, obtains training corpus entity, wherein, training corpus entity comprises head entity, entity relation and tail entity;

模型创建模块204,用于根据Seq2seq框架的mT5-small模型创建初始序列模型;A model creation module 204, configured to create an initial sequence model according to the mT5-small model of the Seq2seq framework;

实体识别数据构建模块205,用于根据训练语料数据、实体识别软提示以及实体识别硬提示构建实体识别训练数据;Entity recognition data construction module 205, for constructing entity recognition training data according to training corpus data, entity recognition soft prompt and entity recognition hard prompt;

实体识别训练模块206,用于将实体识别训练数据作为输入数据、训练语料实体作为标签信息对初始序列模型进行实体识别训练操作;Entity recognition training module 206, is used for carrying out entity recognition training operation to initial sequence model with entity recognition training data as input data, training corpus entity as label information;

尾部预测数据构建模块207,用于将头部实体、实体关系、尾部实体预测软提示以及尾部实体预测硬提示构建尾部预测训练数据;Tail prediction data construction module 207, used for constructing tail prediction training data with head entity, entity relationship, tail entity prediction soft prompt and tail entity prediction hard prompt;

尾部预测训练模块208,用于将尾部预测训练数据作为输入数据、尾部实体作为标签信息对初始序列模型进行尾部预测训练操作;The tail prediction training module 208 is used to use the tail prediction training data as the input data and the tail entity as the label information to carry out the tail prediction training operation on the initial sequence model;

模型确认模块209,用于将完成实体识别训练操作以及尾部预测训练操作后的原始序列模型作为目标序列模型。The model confirmation module 209 is configured to use the original sequence model after the entity recognition training operation and the tail prediction training operation as the target sequence model.

在本申请实施例中,利用一个特殊token作为soft prompt(即软提示),利用任务描述作为hard prompt(就是由具体的中文或英文词汇组成提示,它是人工可读的硬提示)的方法进行上述训练数据的创建。In the embodiment of this application, a special token is used as a soft prompt (soft prompt), and the task description is used as a hard prompt (a prompt composed of specific Chinese or English words, which is a human-readable hard prompt) Creation of the above training data.

在本实施例的一些可选的实现方式中,为了增强模型在生成医疗文本的准确性,我们引入了外部知识,我们为模型设计了两个步骤来增强模型对于知识的利用能力。1:知识选择(knowledge selection)训练模型,输入为文本,输出为知识图谱(KG)中与文本相关的三元组;2:知识灌输(knowledge indorsation),将从知识图谱中得到的相关知识与对话拼接共同作为模型的输入共同生成回复。In some optional implementations of this embodiment, in order to enhance the accuracy of the model in generating medical texts, we introduce external knowledge, and we design two steps for the model to enhance the ability of the model to utilize knowledge. 1: Knowledge selection (knowledge selection) training model, the input is text, and the output is the triplet related to the text in the knowledge map (KG); 2: Knowledge indorsation (knowledge indorsation), the relevant knowledge obtained from the knowledge map and Dialogue stitching together serves as input to the model to jointly generate responses.

在本申请实施例中,提供了一种适用于多中文医疗语言处理任务的端到端模型训练装置200,包括:数据获取模块201,用于获取与医疗领域相对应的医疗语料数据;预处理模块202,用于对医疗语料数据进行预处理操作,得到训练语料数据;实体匹配模块203,用于对训练语料数据进行实体匹配操作,得到训练语料实体,其中,训练语料实体包括头部实体、实体关系以及尾部实体;模型创建模块204,用于根据Seq2seq框架的mT5-small模型创建初始序列模型;实体识别数据构建模块205,用于根据训练语料数据、实体识别软提示以及实体识别硬提示构建实体识别训练数据;实体识别训练模块206,用于将实体识别训练数据作为输入数据、训练语料实体作为标签信息对初始序列模型进行实体识别训练操作;尾部预测数据构建模块207,用于将头部实体、实体关系、尾部实体预测软提示以及尾部实体预测硬提示构建尾部预测训练数据;尾部预测训练模块208,用于将尾部预测训练数据作为输入数据、尾部实体作为标签信息对初始序列模型进行尾部预测训练操作;模型确认模块209,用于将完成实体识别训练操作以及尾部预测训练操作后的原始序列模型作为目标序列模型。与现有技术相比,本申请根据Seq2seq框架的mT5-small模型创建初始序列模型,并通过大量的医疗语料数据针对实体识别任务以及尾部预测任务进行预训练,使得预训练后的序列模型可以学习到隐藏于其他任务的医疗知识,有效提高多中文医疗语言处理任务的准确性。In the embodiment of the present application, an end-to-end model training device 200 suitable for multi-Chinese medical language processing tasks is provided, including: a data acquisition module 201 for acquiring medical corpus data corresponding to the medical field; preprocessing Module 202, is used for carrying out preprocessing operation to medical corpus data, obtains training corpus data; Entity matching module 203, is used for carrying out entity matching operation to training corpus data, obtains training corpus entity, wherein, training corpus entity comprises head entity, Entity relationship and tail entity; model creation module 204, for creating initial sequence model according to the mT5-small model of Seq2seq framework; entity recognition data construction module 205, for building according to training corpus data, entity recognition soft prompt and entity recognition hard prompt Entity recognition training data; Entity recognition training module 206, is used for using entity recognition training data as input data, and training corpus entity carries out entity recognition training operation to initial sequence model as label information; Tail prediction data construction module 207, is used for head Entities, entity relationships, tail entity prediction soft hints and tail entity prediction hard hints construct tail prediction training data; tail prediction training module 208 is used to use tail prediction training data as input data and tail entities as label information to perform tail prediction on the initial sequence model Predictive training operation; the model confirmation module 209 is used to use the original sequence model after completing the entity recognition training operation and the tail prediction training operation as the target sequence model. Compared with the existing technology, this application creates an initial sequence model based on the mT5-small model of the Seq2seq framework, and pre-trains entity recognition tasks and tail prediction tasks through a large amount of medical corpus data, so that the pre-trained sequence model can learn The medical knowledge hidden in other tasks can effectively improve the accuracy of multi-Chinese medical language processing tasks.

继续参阅图9,示出了本申请实施例二提供的适用于多中文医疗语言处理任务的端到端模型训练装置的另一种具体实施方式的结构示意图,为了便于说明,仅示出与本申请相关的部分。Continue to refer to FIG. 9 , which shows a structural schematic diagram of another specific implementation of an end-to-end model training device suitable for multi-Chinese medical language processing tasks provided by Embodiment 2 of the present application. Apply for relevant parts.

在本实施例的一些可选的实现方式中,上述适用于多中文医疗语言处理任务的端到端模型训练装置200还包括:文章总结数据构建模块210以及文章总结训练模块211,模型确认模块209包括:第一模型确认子模块2091,其中:In some optional implementations of this embodiment, the above-mentioned end-to-end model training device 200 suitable for multi-Chinese medical language processing tasks also includes: article summary data construction module 210 and article summary training module 211, model confirmation module 209 Including: the first model confirmation sub-module 2091, wherein:

文章总结数据构建模块210,用于根据文章内容、文章总结软提示以及文章总结硬提示构建文章总结训练数据;Article summary data construction module 210, for constructing article summary training data according to article content, article summary soft prompt and article summary hard prompt;

文章总结训练模块211,用于将文章总结训练数据作为输入数据、文章标题作为标签信息对初始序列模型进行文章总结训练操作;The article summary training module 211 is used to use the article summary training data as the input data and the article title as the label information to carry out the article summary training operation to the initial sequence model;

第一模型确认子模块2091,用于将完成实体识别训练操作、尾部预测训练操作以及文章总结训练操作后的原始序列模型作为目标序列模型。The first model confirmation sub-module 2091 is used to use the original sequence model after the entity recognition training operation, the tail prediction training operation and the article summary training operation as the target sequence model.

在本实施例的一些可选的实现方式中,上述预处理模块202包括:去重子模块以及删除子模块,其中:In some optional implementations of this embodiment, the above-mentioned preprocessing module 202 includes: a deduplication submodule and a deletion submodule, wherein:

去重子模块,用于根据Jaccard相似度算法对所述医疗语料数据进行相似文本去重操作;The deduplication submodule is used to perform similar text deduplication operations on the medical corpus data according to the Jaccard similarity algorithm;

删除子模块,用于根据正则匹配算法对所述医疗语料数据中噪声较大的文本进行删除操作,得到所述训练语料数据。The deletion sub-module is used to delete the noisy text in the medical corpus data according to the regular matching algorithm to obtain the training corpus data.

在本实施例的一些可选的实现方式中,上述适用于多中文医疗语言处理任务的端到端模型训练装置200还包括:医疗问答数据构建模块和医疗问答训练模块,上述模型确认模块209包括:第二模型确定子模块,其中:In some optional implementations of this embodiment, the above-mentioned end-to-end model training device 200 applicable to multi-Chinese medical language processing tasks also includes: a medical question-and-answer data construction module and a medical question-and-answer training module, and the above-mentioned model confirmation module 209 includes : The second model determines the submodule, where:

医疗问答数据构建模块,用于根据所述医疗提问信息、医疗问答软提示以及医疗问答硬提示构建医疗问答训练数据;A medical question-and-answer data building module, used to construct medical question-and-answer training data according to the medical question-and-answer information, medical question-and-answer soft prompts and medical question-and-answer hard prompts;

医疗问答训练模块,用于将所述医疗问答训练数据作为输入数据、所述医疗回答信息作为标签信息对所述初始序列模型进行医疗问答训练操作;A medical question-and-answer training module, configured to use the medical question-and-answer training data as input data and the medical answer information as label information to perform medical question-and-answer training operations on the initial sequence model;

第二模型确定子模块,用于将完成所述实体识别训练操作、所述尾部预测训练操作以及所述医疗问答训练操作后的原始序列模型作为所述目标序列模型。The second model determining sub-module is used to use the original sequence model after completing the entity recognition training operation, the tail prediction training operation and the medical question answering training operation as the target sequence model.

在本实施例的一些可选的实现方式中,上述预处理模块202包括:歧义词汇判断子模块、歧义否认子模块、歧义确认子模块、真实词义获取子模块以及词汇替换子模块,其中:In some optional implementations of this embodiment, the above-mentioned preprocessing module 202 includes: an ambiguous vocabulary judging submodule, an ambiguous denying submodule, an ambiguous confirming submodule, a real word meaning acquiring submodule, and a vocabulary replacing submodule, wherein:

歧义词汇判断子模块,用于判断所述医疗问答信息是否存在歧义词汇;An ambiguous vocabulary judging submodule, used to judge whether there are ambiguous vocabulary in the medical question-and-answer information;

歧义否认子模块,用于若不存在所述歧义词汇,则将所述医疗语料数据作为所述训练语料数据;An ambiguity denying submodule, configured to use the medical corpus data as the training corpus data if the ambiguous vocabulary does not exist;

歧义确认子模块,用于若存在所述歧义词汇,则获取与所述歧义词汇上下文相关联的关联文本信息;The ambiguity confirmation submodule is used to obtain associated text information associated with the context of the ambiguous vocabulary if the ambiguous vocabulary exists;

真实词义获取子模块,用于将所述关联文本信息输入至语义分析模型进行词义识别操作,得到所述歧义词汇的真实词义信息;The real word meaning acquisition sub-module is used to input the associated text information into the semantic analysis model to perform word meaning recognition operation, and obtain the real word meaning information of the ambiguous vocabulary;

词汇替换子模块,用于将所述医疗问答信息中的所述歧义词汇替换成所述真实词义信息,得到所述训练语料数据。The vocabulary replacement submodule is used to replace the ambiguous vocabulary in the medical question-and-answer information with the real word meaning information to obtain the training corpus data.

在本实施例的一些可选的实现方式中,上述预处理模块202还包括:分词确定模块、词向量确定模块、第一特征表示向量确定模块、第二特征表示向量确定模块、分类结果确定模块以及模型获取模块。其中:In some optional implementations of this embodiment, the preprocessing module 202 further includes: a word segmentation determination module, a word vector determination module, a first feature representation vector determination module, a second feature representation vector determination module, and a classification result determination module and the model acquisition module. in:

分词确定模块,用于在本地数据库中获取样本文本,并确定样本文本中包含的每个分词;The participle determination module is used to obtain the sample text in the local database, and determine each participle contained in the sample text;

词向量确定模块,用于基于待训练的语义分析模型确定每个分词对应的词向量;The word vector determination module is used to determine the word vector corresponding to each word segmentation based on the semantic analysis model to be trained;

第一特征表示向量确定模块,用于在本地数据库中获取语义属性,根据待训练的语义分析模型中包含语义属性对应的注意力矩阵,以及每个分词对应的词向量,确定样本文本涉及语义属性的第一特征表示向量;The first feature representation vector determination module is used to obtain semantic attributes in the local database, and determine that the sample text involves semantic attributes according to the attention matrix corresponding to the semantic attributes included in the semantic analysis model to be trained, and the word vector corresponding to each word segment The first feature representation vector of ;

第二特征表示向量确定模块,用于根据待训练的语义分析模型中包含的用于表示不同语义属性之间的相关性的自注意力矩阵,以及第一特征表示向量,确定样本文本涉及语义属性的第二特征表示向量;The second feature representation vector determination module is used to determine that the sample text involves a semantic attribute according to the self-attention matrix used to represent the correlation between different semantic attributes contained in the semantic analysis model to be trained, and the first feature representation vector The second feature representation vector of ;

分类结果确定模块,用于根据待训练的语义分析模型以及第二特征表示向量,确定待训练的语义训练模型输出的分类结果,分类结果包括样本文本所属的语义属性以及样本文本所属的语义属性对应的情感极性;The classification result determination module is used to determine the classification result output by the semantic training model to be trained according to the semantic analysis model to be trained and the second feature representation vector. The classification result includes the semantic attribute to which the sample text belongs and the corresponding semantic attribute to which the sample text belongs. emotional polarity;

模型获取模块,用于根据分类结果和样本文本预设的标注,对语义分析模型中的模型参数进行调整,得到语义分析模型。The model acquisition module is used to adjust the model parameters in the semantic analysis model according to the classification result and the preset annotation of the sample text to obtain the semantic analysis model.

为解决上述技术问题,本申请实施例还提供计算机设备。具体请参阅图10,图10为本实施例计算机设备基本结构框图。In order to solve the above technical problems, the embodiment of the present application further provides computer equipment. Please refer to FIG. 10 for details. FIG. 10 is a block diagram of the basic structure of the computer device in this embodiment.

所述计算机设备300包括通过系统总线相互通信连接存储器310、处理器320、网络接口330。需要指出的是,图中仅示出了具有组件310-330的计算机设备300,但是应理解的是,并不要求实施所有示出的组件,可以替代的实施更多或者更少的组件。其中,本技术领域技术人员可以理解,这里的计算机设备是一种能够按照事先设定或存储的指令,自动进行数值计算和/或信息处理的设备,其硬件包括但不限于微处理器、专用集成电路(Application Specific Integrated Circuit,ASIC)、可编程门阵列(Field-Programmable Gate Array,FPGA)、数字处理器(Digital Signal Processor,DSP)、嵌入式设备等。The computer device 300 includes a memory 310 , a processor 320 , and a network interface 330 connected to each other through a system bus for communication. It should be noted that only computer device 300 is shown with components 310-330, but it should be understood that implementing all of the illustrated components is not required and that more or fewer components may instead be implemented. Among them, those skilled in the art can understand that the computer device here is a device that can automatically perform numerical calculation and/or information processing according to preset or stored instructions, and its hardware includes but is not limited to microprocessors, dedicated Integrated circuit (Application Specific Integrated Circuit, ASIC), programmable gate array (Field-Programmable Gate Array, FPGA), digital processor (Digital Signal Processor, DSP), embedded devices, etc.

所述计算机设备可以是桌上型计算机、笔记本、掌上电脑及云端服务器等计算设备。所述计算机设备可以与用户通过键盘、鼠标、遥控器、触摸板或声控设备等方式进行人机交互。The computer equipment may be computing equipment such as a desktop computer, a notebook, a palmtop computer, and a cloud server. The computer device can perform human-computer interaction with the user through keyboard, mouse, remote controller, touch panel or voice control device.

所述存储器310至少包括一种类型的可读存储介质,所述可读存储介质包括闪存、硬盘、多媒体卡、卡型存储器(例如,SD或DX存储器等)、随机访问存储器(RAM)、静态随机访问存储器(SRAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、可编程只读存储器(PROM)、磁性存储器、磁盘、光盘等。在一些实施例中,所述存储器310可以是所述计算机设备300的内部存储单元,例如该计算机设备300的硬盘或内存。在另一些实施例中,所述存储器310也可以是所述计算机设备300的外部存储设备,例如该计算机设备300上配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)等。当然,所述存储器310还可以既包括所述计算机设备300的内部存储单元也包括其外部存储设备。本实施例中,所述存储器310通常用于存储安装于所述计算机设备300的操作系统和各类应用软件,例如适用于多中文医疗语言处理任务的端到端模型训练方法的计算机可读指令等。此外,所述存储器310还可以用于暂时地存储已经输出或者将要输出的各类数据。The memory 310 includes at least one type of readable storage medium, and the readable storage medium includes flash memory, hard disk, multimedia card, card-type memory (for example, SD or DX memory, etc.), random access memory (RAM), static Random Access Memory (SRAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), Programmable Read Only Memory (PROM), Magnetic Memory, Magnetic Disk, Optical Disk, etc. In some embodiments, the storage 310 may be an internal storage unit of the computer device 300 , such as a hard disk or memory of the computer device 300 . In some other embodiments, the memory 310 may also be an external storage device of the computer device 300, such as a plug-in hard disk equipped on the computer device 300, a smart memory card (Smart Media Card, SMC), a secure digital (Secure Digital, SD) card, flash memory card (Flash Card), etc. Certainly, the memory 310 may also include both an internal storage unit of the computer device 300 and an external storage device thereof. In this embodiment, the memory 310 is usually used to store the operating system and various application software installed on the computer device 300, such as computer-readable instructions applicable to the end-to-end model training method for multi-Chinese medical language processing tasks Wait. In addition, the memory 310 can also be used to temporarily store various types of data that have been output or will be output.

所述处理器320在一些实施例中可以是中央处理器(Central Processing Unit,CPU)、控制器、微控制器、微处理器、或其他数据处理芯片。该处理器320通常用于控制所述计算机设备300的总体操作。本实施例中,所述处理器320用于运行所述存储器310中存储的计算机可读指令或者处理数据,例如运行所述适用于多中文医疗语言处理任务的端到端模型训练方法的计算机可读指令。The processor 320 may be a central processing unit (Central Processing Unit, CPU), controller, microcontroller, microprocessor, or other data processing chips in some embodiments. The processor 320 is generally used to control the overall operation of the computer device 300 . In this embodiment, the processor 320 is configured to run computer-readable instructions stored in the memory 310 or process data, for example, a computer running the end-to-end model training method applicable to multi-Chinese medical language processing tasks can read command.

所述网络接口330可包括无线网络接口或有线网络接口,该网络接口330通常用于在所述计算机设备300与其他电子设备之间建立通信连接。The network interface 330 may include a wireless network interface or a wired network interface, and the network interface 330 is generally used to establish a communication connection between the computer device 300 and other electronic devices.

本申请提供的计算机设备,根据Seq2seq框架的mT5-small模型创建初始序列模型,并通过大量的医疗语料数据针对实体识别任务以及尾部预测任务进行预训练,使得预训练后的序列模型可以学习到隐藏于其他任务的医疗知识,有效提高多中文医疗语言处理任务的准确性。The computer equipment provided by this application creates an initial sequence model based on the mT5-small model of the Seq2seq framework, and pre-trains entity recognition tasks and tail prediction tasks through a large amount of medical corpus data, so that the pre-trained sequence model can learn to hide The medical knowledge of other tasks can effectively improve the accuracy of multi-Chinese medical language processing tasks.

本申请还提供了另一种实施方式,即提供一种计算机可读存储介质,所述计算机可读存储介质存储有计算机可读指令,所述计算机可读指令可被至少一个处理器执行,以使所述至少一个处理器执行如上述的适用于多中文医疗语言处理任务的端到端模型训练方法的步骤。The present application also provides another implementation manner, which is to provide a computer-readable storage medium, the computer-readable storage medium stores computer-readable instructions, and the computer-readable instructions can be executed by at least one processor to The at least one processor is made to execute the steps of the above-mentioned end-to-end model training method applicable to multi-Chinese medical language processing tasks.

本申请提供的计算机可读存储介质,根据Seq2seq框架的mT5-small模型创建初始序列模型,并通过大量的医疗语料数据针对实体识别任务以及尾部预测任务进行预训练,使得预训练后的序列模型可以学习到隐藏于其他任务的医疗知识,有效提高多中文医疗语言处理任务的准确性。The computer-readable storage medium provided by this application creates an initial sequence model according to the mT5-small model of the Seq2seq framework, and performs pre-training for entity recognition tasks and tail prediction tasks through a large amount of medical corpus data, so that the sequence model after pre-training can be Learn medical knowledge hidden in other tasks, effectively improve the accuracy of multi-Chinese medical language processing tasks.

通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质(如ROM/RAM、磁碟、光盘)中,包括若干指令用以使得一台终端设备(可以是手机,计算机,服务器,空调器,或者网络设备等)执行本申请各个实施例所述的方法。Through the description of the above embodiments, those skilled in the art can clearly understand that the methods of the above embodiments can be implemented by means of software plus a necessary general-purpose hardware platform, and of course also by hardware, but in many cases the former is better implementation. Based on such an understanding, the technical solution of the present application can be embodied in the form of a software product in essence or the part that contributes to the prior art, and the computer software product is stored in a storage medium (such as ROM/RAM, disk, CD) contains several instructions to make a terminal device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) execute the methods described in the various embodiments of the present application.

显然,以上所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例,附图中给出了本申请的较佳实施例,但并不限制本申请的专利范围。本申请可以以许多不同的形式来实现,相反地,提供这些实施例的目的是使对本申请的公开内容的理解更加透彻全面。尽管参照前述实施例对本申请进行了详细的说明,对于本领域的技术人员来而言,其依然可以对前述各具体实施方式所记载的技术方案进行修改,或者对其中部分技术特征进行等效替换。凡是利用本申请说明书及附图内容所做的等效结构,直接或间接运用在其他相关的技术领域,均同理在本申请专利保护范围之内。Apparently, the embodiments described above are only some of the embodiments of the present application, not all of them. The drawings show preferred embodiments of the present application, but do not limit the patent scope of the present application. The present application can be implemented in many different forms, on the contrary, the purpose of providing these embodiments is to make the understanding of the disclosure of the present application more thorough and comprehensive. Although the present application has been described in detail with reference to the foregoing embodiments, those skilled in the art can still modify the technical solutions described in the foregoing specific embodiments, or perform equivalent replacements for some of the technical features . All equivalent structures made using the contents of the description and drawings of this application, directly or indirectly used in other related technical fields, are also within the scope of protection of this application.

Claims (10)

1. An end-to-end model training method suitable for multiple Chinese medical language processing tasks is characterized by comprising the following steps:
acquiring medical corpus data corresponding to the medical field;
preprocessing the medical corpus data to obtain training corpus data;
performing entity matching operation on the corpus data to obtain a corpus entity, wherein the corpus entity comprises a head entity, an entity relationship and a tail entity;
establishing an initial sequence model according to a mT5-small model of a Seq2Seq framework;
constructing entity recognition training data according to the training corpus data, the entity recognition soft prompt and the entity recognition hard prompt;
taking the entity recognition training data as input data and the corpus entity as label information to perform entity recognition training operation on the initial sequence model;
constructing tail prediction training data by using the head entity, the entity relation, the tail entity prediction soft prompt and the tail entity prediction hard prompt;
taking the tail prediction training data as input data and the tail entity as label information to carry out tail prediction training operation on the initial sequence model;
and taking the original sequence model after the entity recognition training operation and the tail prediction training operation as a target sequence model.
2. The method of claim 1, wherein the corpus data further includes medical article data carrying article titles and article contents, and wherein after the step of creating an initial sequence model according to the mT5-small model of the Seq2Seq framework, and before the step of using the initial sequence model after the entity recognition training operation and the tail prediction training operation as a target sequence model, the method further includes the steps of:
constructing article summary training data according to the article content, the article summary soft prompt and the article summary hard prompt;
performing article summarization training operation on the initial sequence model by using the article summarization training data as input data and the article titles as label information;
the step of using the original sequence model after the entity recognition training operation and the tail prediction training operation are completed as a target sequence model specifically includes the following steps:
and taking the original sequence model after the entity recognition training operation, the tail prediction training operation and the article summary training operation as the target sequence model.
3. The end-to-end model training method applicable to multiple chinese medical language processing tasks according to claim 1, wherein the step of preprocessing the medical corpus data to obtain training corpus data specifically comprises the steps of:
carrying out similar text duplication elimination operation on the medical corpus data according to a Jaccard similarity algorithm;
and deleting the text with larger noise in the medical corpus data according to a regular matching algorithm to obtain the training corpus data.
4. The end-to-end model training method applicable to multiple chinese medical language processing tasks according to claim 1, wherein the medical corpus data includes medical question and answer information carrying medical question and answer information, and after the step of creating an initial sequence model according to the mT5-small model of the Seq2Seq framework, and before the step of using the initial sequence model after the entity recognition training operation and the tail prediction training operation as a target sequence model, the method further comprises the steps of:
constructing medical question and answer training data according to the medical question and answer information, the medical question and answer soft prompt and the medical question and answer hard prompt;
taking the medical question-answer training data as input data and the medical answer information as label information to carry out medical question-answer training operation on the initial sequence model;
the step of using the original sequence model after the entity recognition training operation and the tail prediction training operation are completed as a target sequence model specifically includes the following steps:
and taking the original sequence model after the entity recognition training operation, the tail prediction training operation and the medical question-answer training operation as the target sequence model.
5. The end-to-end model training method applicable to multiple chinese medical language processing tasks according to claim 4, wherein the step of preprocessing the medical corpus data to obtain training corpus data specifically comprises the steps of:
judging whether the medical question-answer information has ambiguous vocabularies or not;
if the ambiguous vocabulary does not exist, taking the medical corpus data as the training corpus data;
if the ambiguous vocabulary exists, acquiring associated text information associated with the ambiguous vocabulary context;
inputting the associated text information into a semantic analysis model to perform word sense recognition operation to obtain real word sense information of the ambiguous vocabulary;
and replacing the ambiguous vocabulary in the medical question-answering information with the real word meaning information to obtain the training corpus data.
6. The method of claim 5, wherein before the step of inputting the associated text information into a semantic analysis model for word sense recognition to obtain the real word sense information of the ambiguous vocabulary, the method further comprises:
obtaining a sample text from the local database, and determining each participle contained in the sample text;
determining a word vector corresponding to each participle based on a semantic analysis model to be trained;
obtaining semantic attributes from the local database, and determining a first feature expression vector of the sample text related to the semantic attributes according to an attention matrix corresponding to the semantic attributes and a word vector corresponding to each participle in the semantic analysis model to be trained;
determining a second feature representation vector of the sample text related to the semantic attributes according to a self-attention matrix which is contained in the semantic analysis model to be trained and used for representing correlation among different semantic attributes and the first feature representation vector;
determining a classification result output by the semantic training model to be trained according to the semantic analysis model to be trained and the second feature expression vector, wherein the classification result comprises a semantic attribute to which the sample text belongs and an emotion polarity corresponding to the semantic attribute to which the sample text belongs;
and adjusting model parameters in the semantic analysis model according to the classification result and the preset label of the sample text to obtain the semantic analysis model.
7. An end-to-end model training device suitable for multiple chinese medical language processing tasks, comprising:
the data acquisition module is used for acquiring medical corpus data corresponding to the medical field;
the preprocessing module is used for preprocessing the medical corpus data to obtain training corpus data;
the entity matching module is used for carrying out entity matching operation on the corpus data to obtain a corpus entity, wherein the corpus entity comprises a head entity, an entity relation and a tail entity;
the model creating module is used for creating an initial sequence model according to the mT5-small model of the Seq2Seq framework;
the entity identification data construction module is used for constructing entity identification training data according to the training corpus data, the entity identification soft prompt and the entity identification hard prompt;
the entity recognition training module is used for carrying out entity recognition training operation on the initial sequence model by taking the entity recognition training data as input data and the training corpus entity as label information;
the tail prediction data construction module is used for constructing tail prediction training data by the head entity, the entity relation, the tail entity prediction soft prompt and the tail entity prediction hard prompt;
the tail prediction training module is used for carrying out tail prediction training operation on the initial sequence model by taking the tail prediction training data as input data and the tail entity as label information;
and the model confirmation module is used for taking the original sequence model after the entity recognition training operation and the tail prediction training operation are finished as a target sequence model.
8. The apparatus for end-to-end model training for multiple chinese medical language processing tasks according to claim 7, further comprising: the model confirmation module comprises: a first model validation submodule, wherein:
the article summary data construction module is used for constructing article summary training data according to the article content, the article summary soft prompt and the article summary hard prompt;
the article summarization training module is used for performing article summarization training operation on the initial sequence model by using the article summarization training data as input data and the article titles as label information;
and the first model confirming submodule is used for taking the original sequence model after the entity recognition training operation, the tail prediction training operation and the article summary training operation are finished as the target sequence model.
9. A computer device comprising a memory having computer readable instructions stored therein and a processor which when executed implements the steps of the method for end-to-end model training for multiple chinese medical language processing tasks of any one of claims 1 to 6.
10. A computer readable storage medium, characterized in that the computer readable storage medium has stored thereon computer readable instructions, which when executed by a processor, implement the steps of the end-to-end model training method for multiple chinese medical language processing tasks according to any one of claims 1 to 6.
CN202210981217.XA 2022-08-16 2022-08-16 End-to-end model training method and device, computer equipment and storage medium Pending CN115438149A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210981217.XA CN115438149A (en) 2022-08-16 2022-08-16 End-to-end model training method and device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210981217.XA CN115438149A (en) 2022-08-16 2022-08-16 End-to-end model training method and device, computer equipment and storage medium

Publications (1)

Publication Number Publication Date
CN115438149A true CN115438149A (en) 2022-12-06

Family

ID=84243294

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210981217.XA Pending CN115438149A (en) 2022-08-16 2022-08-16 End-to-end model training method and device, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN115438149A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116340778A (en) * 2023-05-25 2023-06-27 智慧眼科技股份有限公司 Method for constructing large medical model based on multimodality and related equipment
CN116542308A (en) * 2023-03-30 2023-08-04 华为技术有限公司 Task processing method and related equipment thereof
CN116737895A (en) * 2023-06-01 2023-09-12 华为技术有限公司 A data processing method and related equipment
CN116861087A (en) * 2023-07-06 2023-10-10 广州探迹科技有限公司 Customer intelligent recommendation method, system, equipment and medium based on large language model
CN117435933A (en) * 2023-12-22 2024-01-23 浙江大学 Transformer equipment health evaluation method integrating pre-trained language model and graph
WO2024167677A1 (en) * 2023-02-09 2024-08-15 Google Llc Soft knowledge prompts for language models

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109284396A (en) * 2018-09-27 2019-01-29 北京大学深圳研究生院 Medical knowledge graph construction method, device, server and storage medium
WO2021139247A1 (en) * 2020-08-06 2021-07-15 平安科技(深圳)有限公司 Construction method, apparatus and device for medical domain knowledge map, and storage medium
CN113657105A (en) * 2021-08-31 2021-11-16 平安医疗健康管理股份有限公司 Method, Apparatus, Equipment and Medium for Medical Entity Extraction Based on Vocabulary Enhancement
CN114385799A (en) * 2021-12-17 2022-04-22 上海交通大学 Medical automatic question-answering method and system based on common sense fusion
CN114780691A (en) * 2022-06-21 2022-07-22 安徽讯飞医疗股份有限公司 Model pre-training and natural language processing method, device, equipment and storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109284396A (en) * 2018-09-27 2019-01-29 北京大学深圳研究生院 Medical knowledge graph construction method, device, server and storage medium
WO2021139247A1 (en) * 2020-08-06 2021-07-15 平安科技(深圳)有限公司 Construction method, apparatus and device for medical domain knowledge map, and storage medium
CN113657105A (en) * 2021-08-31 2021-11-16 平安医疗健康管理股份有限公司 Method, Apparatus, Equipment and Medium for Medical Entity Extraction Based on Vocabulary Enhancement
CN114385799A (en) * 2021-12-17 2022-04-22 上海交通大学 Medical automatic question-answering method and system based on common sense fusion
CN114780691A (en) * 2022-06-21 2022-07-22 安徽讯飞医疗股份有限公司 Model pre-training and natural language processing method, device, equipment and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
XUE等: "mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer", PROCEEDINGS OF THE 2021 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES, 30 June 2021 (2021-06-30), pages 483 - 498 *

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2024167677A1 (en) * 2023-02-09 2024-08-15 Google Llc Soft knowledge prompts for language models
US12321706B2 (en) 2023-02-09 2025-06-03 Google Llc Soft knowledge prompts for language models
CN116542308A (en) * 2023-03-30 2023-08-04 华为技术有限公司 Task processing method and related equipment thereof
CN116340778A (en) * 2023-05-25 2023-06-27 智慧眼科技股份有限公司 Method for constructing large medical model based on multimodality and related equipment
CN116340778B (en) * 2023-05-25 2023-10-03 智慧眼科技股份有限公司 Multimodal-based medical large model construction method and related equipment
US12086716B1 (en) 2023-05-25 2024-09-10 AthenaEyes CO., LTD. Method for constructing multimodality-based medical large model, and related device thereof
CN116737895A (en) * 2023-06-01 2023-09-12 华为技术有限公司 A data processing method and related equipment
CN116861087A (en) * 2023-07-06 2023-10-10 广州探迹科技有限公司 Customer intelligent recommendation method, system, equipment and medium based on large language model
CN116861087B (en) * 2023-07-06 2025-11-28 广州探迹科技有限公司 Customer intelligent recommendation method, system, equipment and medium based on large language model
CN117435933A (en) * 2023-12-22 2024-01-23 浙江大学 Transformer equipment health evaluation method integrating pre-trained language model and graph
CN117435933B (en) * 2023-12-22 2024-04-16 浙江大学 Transformer equipment health evaluation method integrating pre-training language model and atlas

Similar Documents

Publication Publication Date Title
WO2021121198A1 (en) Semantic similarity-based entity relation extraction method and apparatus, device and medium
CN111241237B (en) Intelligent question-answer data processing method and device based on operation and maintenance service
CN115438149A (en) End-to-end model training method and device, computer equipment and storage medium
CN113505601B (en) A method, device, computer equipment and storage medium for constructing positive and negative sample pairs
CN112685565A (en) Text classification method based on multi-mode information fusion and related equipment thereof
CN111639163A (en) Training method of problem generation model, problem generation method and related equipment
WO2021135455A1 (en) Semantic recall method, apparatus, computer device, and storage medium
CN112231569A (en) News recommendation method and device, computer equipment and storage medium
CN112632244A (en) Man-machine conversation optimization method and device, computer equipment and storage medium
WO2019154411A1 (en) Word vector retrofitting method and device
CN112686053A (en) Data enhancement method and device, computer equipment and storage medium
WO2022073341A1 (en) Disease entity matching method and apparatus based on voice semantics, and computer device
CN116610784A (en) Insurance business scene question-answer recommendation method and related equipment thereof
CN115757731A (en) Dialogue question rewriting method, device, computer equipment and storage medium
CN117874234A (en) Semantic-based text classification method, device, computer equipment and storage medium
CN114637831A (en) Data query method and related equipment based on semantic analysis
CN115730597A (en) Multi-level semantic intention recognition method and related equipment thereof
CN113822040A (en) A kind of subjective question scoring method, device, computer equipment and storage medium
CN115376496A (en) Speech recognition method, device, computer equipment and storage medium
CN114048757A (en) A sign language synthesis method, device, computer equipment and storage medium
CN116701593A (en) Chinese question answering model training method and related equipment based on GraphQL
CN115062136B (en) Event disambiguation method based on graph neural network and related equipment thereof
CN116821298A (en) Keyword automatic identification method applied to application information and related equipment
CN115795310A (en) Data enhancement method, device, equipment and storage medium thereof
CN111639164A (en) Question-answer matching method and device of question-answer system, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination