CN115438149A

CN115438149A - End-to-end model training method and device, computer equipment and storage medium

Info

Publication number: CN115438149A
Application number: CN202210981217.XA
Authority: CN
Inventors: 刘佳瑞; 王世朋; 姚海申
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2022-08-16
Filing date: 2022-08-16
Publication date: 2022-12-06

Abstract

The embodiment of the application belongs to the technical field of artificial intelligence natural language processing, and relates to an end-to-end model training method and device suitable for multi-Chinese medical language processing tasks, computer equipment and a storage medium. In addition, the present application also relates to a block chain technology, and a target sequence model of a user can be stored in the block chain. According to the method and the device, an initial sequence model is established according to a mT5-small model of a Seq2Seq framework, and pre-training is performed on an entity recognition task and a tail prediction task through a large amount of medical corpus data, so that the pre-trained sequence model can learn medical knowledge hidden in other tasks, and the accuracy of multi-Chinese medical language processing tasks is effectively improved.

Description

End-to-end model training method and device, computer equipment and storage medium

Technical Field

The application relates to the technical field of artificial intelligence natural language processing, in particular to an end-to-end model training method and device suitable for multi-Chinese medical language processing tasks, computer equipment and a storage medium.

Background

For different NLP tasks of medical knowledge, the existing training scheme is respectively trained aiming at different tasks, for example, the NLU task is trained by utilizing a Bert model, the text generation task is trained by utilizing GPT, and the NER task is trained based on an LSTM related model.

However, the applicant finds that the traditional training mode cannot utilize information of other tasks in the same field, cannot learn medical knowledge hidden in other tasks, and if a large model such as Bert is utilized for some tasks with small sample size, overfitting is easily caused, so that the prediction accuracy of the traditional multi-Chinese medical language processing model is low.

Disclosure of Invention

The embodiment of the application aims to provide an end-to-end model training method, an end-to-end model training device, computer equipment and a storage medium which are suitable for multi-Chinese medical language processing tasks, so as to solve the problem that the traditional multi-Chinese medical language processing model is low in prediction accuracy.

In order to solve the above technical problem, an embodiment of the present application provides an end-to-end model training method suitable for multiple chinese medical language processing tasks, which adopts the following technical solutions:

acquiring medical corpus data corresponding to the medical field;

preprocessing the medical corpus data to obtain training corpus data;

performing entity matching operation on the corpus data to obtain a corpus entity, wherein the corpus entity comprises a head entity, an entity relationship and a tail entity;

establishing an initial sequence model according to a mT5-small model of a Seq2Seq framework;

constructing entity recognition training data according to the training corpus data, the entity recognition soft prompt and the entity recognition hard prompt;

taking the entity recognition training data as input data and the corpus entity as label information to perform entity recognition training operation on the initial sequence model;

constructing tail prediction training data by using the head entity, the entity relation, the tail entity prediction soft prompt and the tail entity prediction hard prompt;

taking the tail prediction training data as input data and the tail entity as label information to carry out tail prediction training operation on the initial sequence model;

and taking the original sequence model after the entity recognition training operation and the tail prediction training operation as a target sequence model.

In order to solve the above technical problem, an embodiment of the present application further provides an end-to-end model training device suitable for multiple chinese medical language processing tasks, which adopts the following technical solutions:

the data acquisition module is used for acquiring medical corpus data corresponding to the medical field;

the preprocessing module is used for preprocessing the medical corpus data to obtain training corpus data;

the entity matching module is used for carrying out entity matching operation on the corpus data to obtain a corpus entity, wherein the corpus entity comprises a head entity, an entity relation and a tail entity;

the model creating module is used for creating an initial sequence model according to the mT5-small model of the Seq2Seq framework;

the entity identification data construction module is used for constructing entity identification training data according to the training corpus data, the entity identification soft prompt and the entity identification hard prompt;

the entity recognition training module is used for carrying out entity recognition training operation on the initial sequence model by taking the entity recognition training data as input data and the training corpus entity as label information;

the tail prediction data construction module is used for constructing tail prediction training data by the head entity, the entity relation, the tail entity prediction soft prompt and the tail entity prediction hard prompt;

the tail prediction training module is used for carrying out tail prediction training operation on the initial sequence model by taking the tail prediction training data as input data and the tail entity as label information;

and the model confirmation module is used for taking the original sequence model after the entity recognition training operation and the tail prediction training operation are finished as a target sequence model.

In order to solve the above technical problem, an embodiment of the present application further provides a computer device, which adopts the following technical solutions:

comprising a memory having computer readable instructions stored therein which when executed by the processor implement the steps of the end-to-end model training method as described above for multiple chinese medical language processing tasks.

In order to solve the above technical problem, an embodiment of the present application further provides a computer-readable storage medium, which adopts the following technical solutions:

the computer readable storage medium has stored thereon computer readable instructions which, when executed by a processor, implement the steps of the end-to-end model training method for multi-chinese medical language processing tasks as described above.

The application provides an end-to-end model training method suitable for multi-Chinese medical language processing tasks, which comprises the following steps: acquiring medical corpus data corresponding to the medical field; preprocessing the medical corpus data to obtain training corpus data; performing entity matching operation on the corpus data to obtain a corpus entity, wherein the corpus entity comprises a head entity, an entity relationship and a tail entity; establishing an initial sequence model according to a mT5-small model of a Seq2Seq framework; constructing entity recognition training data according to the training corpus data, the entity recognition soft prompt and the entity recognition hard prompt; taking the entity recognition training data as input data and the corpus entity as label information to perform entity recognition training operation on the initial sequence model; constructing tail prediction training data by using the head entity, the entity relation, the tail entity prediction soft prompt and the tail entity prediction hard prompt; taking the tail prediction training data as input data and the tail entity as label information to carry out tail prediction training operation on the initial sequence model; and taking the original sequence model after the entity recognition training operation and the tail prediction training operation as a target sequence model. Compared with the prior art, the initial sequence model is established according to the mT5-small model of the Seq2Seq framework, and the pre-training is performed on the entity recognition task and the tail prediction task through a large amount of medical corpus data, so that the pre-trained sequence model can learn medical knowledge hidden in other tasks, and the accuracy of the multi-Chinese medical language processing task is effectively improved.

Drawings

In order to more clearly illustrate the solution of the present application, the drawings needed for describing the embodiments of the present application will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present application, and that other drawings can be obtained by those skilled in the art without inventive effort.

FIG. 1 is an exemplary system architecture diagram in which the present application may be applied;

FIG. 2 is a flowchart of an implementation of an end-to-end model training method suitable for multiple Chinese medical language processing tasks according to an embodiment of the present application;

FIG. 3 is a flowchart of another specific implementation of an end-to-end model training method provided in an embodiment of the present application;

FIG. 4 is a flowchart of one embodiment of step S202 of FIG. 2;

FIG. 5 is a flowchart of another embodiment of an end-to-end model training method according to an embodiment of the present disclosure;

FIG. 6 is a flowchart of one embodiment of step S202 in FIG. 5;

FIG. 7 is a flowchart of a specific implementation of obtaining a semantic analysis model according to an embodiment of the present application;

FIG. 8 is a schematic structural diagram of an end-to-end model training apparatus suitable for multiple Chinese medical language processing tasks according to a second embodiment of the present application;

FIG. 9 is a schematic structural diagram of another embodiment of an end-to-end model training apparatus suitable for multiple Chinese medical language processing tasks according to the second embodiment of the present application;

FIG. 10 is a schematic block diagram of one embodiment of a computer device according to the present application.

Detailed Description

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs; the terminology used in the description of the application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application; the terms "including" and "having," and any variations thereof, in the description and claims of this application and the description of the above figures are intended to cover non-exclusive inclusions. The terms "first," "second," and the like in the description and claims of this application or in the above-described drawings are used for distinguishing between different objects and not for describing a particular order.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.

In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings.

As shown in fig. 1, the system architecture 100 may include

terminal devices

101, 102, 103, a network 104, and a server 105. The network 104 serves as a medium for providing communication links between the

terminal devices

101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

The user may use the

terminal devices

101, 102, 103 to interact with the server 105 via the network 104 to receive or send messages or the like. The

terminal devices

101, 102, 103 may have various communication client applications installed thereon, such as a web browser application, a shopping application, a search application, an instant messaging tool, a mailbox client, social platform software, and the like.

The

terminal devices

101, 102, 103 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, e-book readers, MP3 players (Moving Picture Experts Group Audio Layer III, motion Picture Experts compression standard Audio Layer 3), MP4 players (Moving Picture Experts Group Audio Layer IV, motion Picture Experts compression standard Audio Layer 4), laptop portable computers, desktop computers, and the like.

The server 105 may be a server providing various services, such as a background server providing support for pages displayed on the

terminal devices

101, 102, 103.

It should be noted that the end-to-end model training method applicable to multiple chinese medical language processing tasks provided in the embodiments of the present application is generally executed by a server/terminal device, and accordingly, the end-to-end model training apparatus applicable to multiple chinese medical language processing tasks is generally disposed in the server/terminal device.

It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

Example one

With continuing reference to fig. 2, a flowchart of an implementation of an end-to-end model training method for multiple chinese medical language processing tasks provided in an embodiment of the present application is shown, and for convenience of illustration, only the relevant portions of the present application are shown.

The end-to-end model training method suitable for multiple Chinese medical language processing tasks comprises the following steps:

step S201: acquiring medical corpus data corresponding to the medical field;

step S202: preprocessing the medical corpus data to obtain training corpus data;

step S203: performing entity matching operation on the corpus data to obtain a corpus entity, wherein the corpus entity comprises a head entity, an entity relationship and a tail entity;

step S204: establishing an initial sequence model according to a mT5-small model of a Seq2Seq framework;

step S205: constructing entity recognition training data according to the training corpus data, the entity recognition soft prompt and the entity recognition hard prompt;

in the embodiment of the application, a special token is used as a soft prompt, and a task description is used as a hard prompt (namely, a prompt consisting of specific Chinese or English words and being a manually readable hard prompt) to create the training data.

Step S206: taking entity recognition training data as input data and training corpus entities as label information to perform entity recognition training operation on the initial sequence model;

step S207: constructing tail prediction training data by using the head entity, the entity relation, the tail entity prediction soft prompt and the tail entity prediction hard prompt;

step S208: tail prediction training operation is carried out on the initial sequence model by taking tail prediction training data as input data and a tail entity as label information;

step S209: and taking the original sequence model after the entity recognition training operation and the tail prediction training operation are finished as a target sequence model.

In some alternative implementations of the present embodiment, in order to enhance the accuracy of the model in generating the medical text, we introduce external knowledge, and we design two steps for the model to enhance the knowledge utilization capability of the model. 1: a knowledge selection (knowledge selection) training model, wherein the input is a text, and the output is a triple related to the text in a knowledge map (KG); 2: knowledge infusion (knowledge infusion) generates a reply by using relevant knowledge obtained from a knowledge graph and dialogue splicing together as input of a model.

In an embodiment of the present application, an end-to-end model training method suitable for multiple chinese medical language processing tasks is provided, including: acquiring medical corpus data corresponding to the medical field; preprocessing the medical corpus data to obtain training corpus data; performing entity matching operation on the corpus data to obtain a corpus entity, wherein the corpus entity comprises a head entity, an entity relationship and a tail entity; establishing an initial sequence model according to a mT5-small model of a Seq2Seq framework; constructing entity recognition training data according to the training corpus data, the entity recognition soft prompt and the entity recognition hard prompt; taking entity recognition training data as input data and training corpus entities as label information to perform entity recognition training operation on the initial sequence model; constructing tail prediction training data by using the head entity, the entity relation, the tail entity prediction soft prompt and the tail entity prediction hard prompt; tail prediction training operation is carried out on the initial sequence model by taking tail prediction training data as input data and a tail entity as tag information; and taking the original sequence model after the entity recognition training operation and the tail prediction training operation as a target sequence model. Compared with the prior art, the initial sequence model is established according to the mT5-small model of the Seq2Seq framework, and pre-training is carried out on the entity recognition task and the tail prediction task through a large amount of medical corpus data, so that the pre-trained sequence model can learn medical knowledge hidden in other tasks, and the accuracy of multi-Chinese medical language processing tasks is effectively improved.

Continuing to refer to fig. 3, a flow chart of another specific implementation of the end-to-end model training method provided in the first embodiment of the present application is shown, and for convenience of explanation, only the portion related to the present application is shown.

In some optional implementations of this embodiment, after step S204 and before step S209, the method further includes: step S301 and step S302, step S209 includes: step S303.

Step S301: and constructing article summary training data according to the article content, the article summary soft prompt and the article summary hard prompt.

Step S302: and performing article summarization training operation on the initial sequence model by using the article summarization training data as input data and the article titles as label information.

Step S303: and taking the original sequence model after the entity recognition training operation, the tail prediction training operation and the article summary training operation as a target sequence model.

Continuing to refer to fig. 4, a flowchart of one embodiment of step S202 of fig. 2 is shown, and for ease of illustration, only the portions relevant to the present application are shown.

In some optional implementations of this embodiment, step S202 specifically includes: step S401 and/or step S402, wherein:

step S401: and carrying out similar text duplication removal operation on the medical corpus data according to the Jaccard similarity algorithm.

In an embodiment of the present application, the Jaccard similarity algorithm is used to compare similarity and difference between finite sample sets. The larger the Jaccard coefficient value, the higher the sample similarity.

Step S402: and deleting the text with larger noise in the medical corpus data according to the regular matching algorithm to obtain training corpus data.

Continuing to refer to fig. 5, a flowchart of yet another specific implementation of the end-to-end model training method provided by the embodiment of the present application is shown, and for convenience of illustration, only the portion related to the present application is shown.

In some optional implementation manners of this embodiment, the medical corpus data includes medical question-answer information that carries medical question information and medical answer information, and after step S204 and before step S209, the method further includes: step S501 and step S502, step S209 includes: step S503.

Step S501: and constructing medical question and answer training data according to the medical question and answer information, the medical question and answer soft prompt and the medical question and answer hard prompt.

Step S502: and performing medical question-answering training operation on the initial sequence model by using the medical question-answering training data as input data and the medical answer information as label information.

Step S503: and taking the original sequence model after the entity recognition training operation, the tail prediction training operation and the medical question-answering training operation are finished as a target sequence model.

Continuing to refer to fig. 6, a flowchart of one embodiment of step S202 of fig. 5 is shown, and for ease of illustration, only the portions relevant to the present application are shown.

In some optional implementations of this embodiment, step S202 specifically includes: step S601, step S602, step S603, step S604, and step S605, wherein:

step S601: judging whether the medical question-answer information has ambiguous vocabularies or not;

step S602: if no ambiguous vocabulary exists, the medical corpus data is used as training corpus data;

step S603: if the ambiguous vocabulary exists, acquiring associated text information associated with the context of the ambiguous vocabulary;

step S604: inputting the associated text information into a semantic analysis model to perform word sense recognition operation to obtain real word sense information of ambiguous words;

step S605: and replacing ambiguous words in the medical question-answering information with real word meaning information to obtain training corpus data.

Continuing to refer to fig. 7, a flowchart of a specific implementation of obtaining a semantic analysis model according to an embodiment of the present application is shown, and for convenience of illustration, only the relevant portions of the present application are shown.

In some optional implementations of this embodiment, before step S604, the method further includes: step S701, step S702, step S703, step S704, step S705, and step S706, wherein:

step S701: and acquiring the sample text from the local database, and determining each participle contained in the sample text.

In this embodiment of the present application, a plurality of texts may be obtained from the local database, and a training set formed by the obtained plurality of texts is determined, so that each text in the training set may be used as a sample text.

In this embodiment of the present application, when determining the participles included in the sample text, the sample text may be subjected to a participle process first to obtain each participle included in the sample text. When performing word segmentation processing on a sample text, any word segmentation method may be adopted, and of course, each character in the sample text may also be processed as a word segmentation, and it should be understood that the example of word segmentation processing is only for convenience of understanding and is not limited to the present application.

Step S702: and determining a word vector corresponding to each participle based on the semantic analysis model to be trained.

In the embodiment of the present application, the semantic analysis model may include at least four layers, which are: the system comprises a semantic representation layer, an attribute correlation representation layer and a classification layer.

In the embodiment of the present application, at least a sub-model for outputting a bi-directional semantic representation vector, such as a BERT (Bidirectional Encoder representation from transforms) model, is included in the semantic representation layer. Each participle can be input into a semantic representation layer in a semantic analysis model, and a bidirectional semantic representation vector corresponding to each participle output by the semantic representation layer is obtained and serves as a word vector corresponding to each participle. It should be understood that the model for outputting the bi-directional semantic representation vector includes other models besides the BERT model described above, and the example of the model for outputting the bi-directional semantic representation vector is only for convenience of understanding and is not intended to limit the present application.

Step S703: obtaining semantic attributes from a local database, and determining a first feature expression vector of a sample text related to the semantic attributes according to an attention matrix corresponding to the semantic attributes and a word vector corresponding to each participle in a semantic analysis model to be trained.

In this embodiment of the present application, a word vector corresponding to each participle may be input to an attribute characterization layer in a semantic analysis model, the attention matrix corresponding to the semantic attribute included in the attribute characterization layer is used to perform attention weighting on the word vector corresponding to each participle, and a first feature expression vector of the sample text related to the semantic attribute is determined according to the word vector corresponding to each participle after the attention weighting.

Step S704: and determining a second feature representation vector of the sample text related to the semantic attributes according to a self-attention matrix which is contained in the semantic analysis model to be trained and is used for representing the correlation among different semantic attributes and the first feature representation vector.

In the embodiment of the present application, the first feature expression vector of the sample text related to each semantic attribute may be input to an attribute relevance expression layer in the speech analysis model, the first feature expression vector of the sample text related to each semantic attribute may be self-attention weighted by the above-mentioned self-attention matrix included in the attribute relevance expression layer, and a second feature expression vector of the sample text related to each semantic attribute may be determined according to each self-attention weighted first feature expression vector.

Step S705: and determining a classification result output by the semantic training model to be trained according to the semantic analysis model to be trained and the second feature expression vector, wherein the classification result comprises the semantic attribute to which the sample text belongs and the emotion polarity corresponding to the semantic attribute to which the sample text belongs.

In the embodiment of the application, the classification layer at least comprises a hidden layer, a full connection layer and a softmax layer.

In the embodiment of the application, the second feature representation vectors of the sample texts related to each semantic attribute can be sequentially input into the hidden layer, the full-link layer and the softmax layer in the classification layer, and the sample texts are classified according to the classification parameters corresponding to each semantic attribute contained in each second feature representation vector and the hidden layer, the full-link layer and the softmax layer of the classification layer, so that the classification result output by the classification layer is obtained.

In the embodiment of the present application, the classification result at least includes the semantic attribute to which the sample text belongs and the emotion polarity corresponding to the semantic attribute to which the sample text belongs.

In the embodiment of the present application, the emotion polarity may be quantified by a numerical value, for example, the closer the numerical value is to 1, the more positive the emotion polarity is, the closer the numerical value is to-1, the more negative the emotion polarity is, and the closer the numerical value is to 0, the more neutral the emotion polarity is.

Step S706: and adjusting model parameters in the semantic analysis model according to the classification result and the preset label of the sample text to obtain the semantic analysis model.

In the embodiment of the present application, the model parameters to be adjusted at least include the classification parameters described above, and may further include the attention matrix and the self-attention matrix described above. The model parameters in the semantic analysis model can be adjusted by using a traditional training method. That is, the loss (hereinafter referred to as a first loss) corresponding to the classification result is determined directly from the classification result obtained in step S108 and the label preset for the sample text, and the model parameters in the semantic analysis model are adjusted with the first loss minimized as the training target, so as to complete the training of the semantic analysis model.

In the embodiment of the application, because the self-attention matrix for representing the correlation between different semantic attributes is added to the semantic analysis model, the semantic analysis model obtained by training by adopting the traditional training method can analyze the semantics of the text to be analyzed more accurately.

It is emphasized that the target sequence model may also be stored in a node of a block chain in order to further ensure privacy and security of the target sequence model.

The block chain referred by the application is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.

The application is operational with numerous general purpose or special purpose computing system environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet-type devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like. The application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware associated with computer readable instructions, which can be stored in a computer readable storage medium, and when executed, can include processes of the embodiments of the methods described above. The storage medium may be a non-volatile storage medium such as a magnetic disk, an optical disk, a Read-Only Memory (ROM), or a Random Access Memory (RAM).

It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and may be performed in other orders unless otherwise indicated herein. Moreover, at least a portion of the steps in the flow chart of the figure may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed alternately or alternately with other steps or at least a portion of the sub-steps or stages of other steps.

Example two

With further reference to fig. 8, as an implementation of the method shown in fig. 2, the present application provides an embodiment of an end-to-end model training apparatus suitable for multiple chinese medical language processing tasks, which corresponds to the embodiment of the method shown in fig. 2, and which can be applied to various electronic devices.

As shown in fig. 8, the end-to-end model training apparatus 200 suitable for multiple chinese medical language processing tasks of the present embodiment includes: a data acquisition module 201, a pre-processing module 202, an entity matching module 203, a model creation module 204, an entity identification data construction module 205, an entity identification training module 206, a tail prediction data construction module 207, a tail prediction training module 208, and a model validation module 209. Wherein:

a data acquiring module 201, configured to acquire medical corpus data corresponding to a medical field;

the preprocessing module 202 is configured to perform preprocessing operation on the medical corpus data to obtain training corpus data;

the entity matching module 203 is configured to perform entity matching operation on the corpus data to obtain a corpus entity, where the corpus entity includes a head entity, an entity relationship, and a tail entity;

a model creation module 204, configured to create an initial sequence model according to the mT5-small model of the Seq2Seq framework;

an entity identification data construction module 205, configured to construct entity identification training data according to the training corpus data, the entity identification soft prompt and the entity identification hard prompt;

an entity identification training module 206, configured to perform entity identification training operation on the initial sequence model by using entity identification training data as input data and using a corpus entity as tag information;

a tail prediction data construction module 207, configured to construct tail prediction training data from the head entity, the entity relationship, the tail entity prediction soft cue, and the tail entity prediction hard cue;

a tail prediction training module 208, configured to perform tail prediction training operation on the initial sequence model by using tail prediction training data as input data and a tail entity as tag information;

and the model confirming module 209 is configured to use the original sequence model after the entity recognition training operation and the tail prediction training operation are completed as the target sequence model.

In the embodiment of the present application, a special token is used as a soft prompt, and a task description is used as a hard prompt (that is, a prompt composed of specific chinese or english words, which is a manually readable hard prompt).

In an embodiment of the present application, an end-to-end model training apparatus 200 suitable for multiple chinese medical language processing tasks is provided, including: a data acquiring module 201, configured to acquire medical corpus data corresponding to a medical field; the preprocessing module 202 is configured to perform preprocessing operation on the medical corpus data to obtain training corpus data; the entity matching module 203 is configured to perform entity matching operation on the corpus data to obtain a corpus entity, where the corpus entity includes a head entity, an entity relationship, and a tail entity; a model creation module 204, configured to create an initial sequence model according to the mT5-small model of the Seq2Seq framework; an entity identification data construction module 205, configured to construct entity identification training data according to the training corpus data, the entity identification soft prompt and the entity identification hard prompt; an entity identification training module 206, configured to perform entity identification training operation on the initial sequence model by using entity identification training data as input data and using a corpus entity as tag information; a tail prediction data construction module 207, configured to construct tail prediction training data from the head entity, the entity relationship, the tail entity prediction soft cue and the tail entity prediction hard cue; a tail prediction training module 208, configured to perform tail prediction training operation on the initial sequence model by using tail prediction training data as input data and a tail entity as tag information; and the model confirming module 209 is configured to use the original sequence model after the entity recognition training operation and the tail prediction training operation are completed as the target sequence model. Compared with the prior art, the initial sequence model is established according to the mT5-small model of the Seq2Seq framework, and pre-training is carried out on the entity recognition task and the tail prediction task through a large amount of medical corpus data, so that the pre-trained sequence model can learn medical knowledge hidden in other tasks, and the accuracy of multi-Chinese medical language processing tasks is effectively improved.

Continuing to refer to fig. 9, a schematic structural diagram of another specific implementation of the end-to-end model training apparatus suitable for multiple chinese medical language processing tasks according to the second embodiment of the present application is shown, and for convenience of description, only the portions relevant to the present application are shown.

In some optional implementations of the present embodiment, the above end-to-end model training apparatus 200 suitable for multiple chinese medical language processing tasks further includes: the article summary data construction module 210 and the article summary training module 211, and the model confirmation module 209 includes: the first model validation sub-module 2091, wherein:

an article summary data construction module 210, configured to construct article summary training data according to article content, an article summary soft prompt, and an article summary hard prompt;

the article summarization training module 211 is configured to perform article summarization training operation on the initial sequence model by using article summarization training data as input data and using an article title as tag information;

the first model confirmation sub-module 2091 is configured to use the original sequence model after the entity recognition training operation, the tail prediction training operation, and the article summarization training operation are completed as the target sequence model.

In some optional implementations of this embodiment, the preprocessing module 202 includes: duplicate removal submodule and deletion submodule, wherein:

the duplication removing submodule is used for carrying out similar text duplication removing operation on the medical corpus data according to the Jaccard similarity algorithm;

and the deleting submodule is used for deleting the text with higher noise in the medical corpus data according to a regular matching algorithm to obtain the training corpus data.

In some optional implementations of the present embodiment, the above end-to-end model training apparatus 200 suitable for multiple chinese medical language processing tasks further includes: a medical question-answer data construction module and a medical question-answer training module, wherein the model confirmation module 209 comprises: a second model determination submodule, wherein:

the medical question-answer data construction module is used for constructing medical question-answer training data according to the medical question-answer information, the medical question-answer soft prompt and the medical question-answer hard prompt;

the medical question-answer training module is used for performing medical question-answer training operation on the initial sequence model by taking the medical question-answer training data as input data and the medical answer information as label information;

and the second model determining submodule is used for taking the original sequence model after the entity recognition training operation, the tail prediction training operation and the medical question-answer training operation are completed as the target sequence model.

In some optional implementations of this embodiment, the preprocessing module 202 includes: an ambiguous vocabulary judging submodule, an ambiguous denial submodule, an ambiguous confirmation submodule, a real word sense obtaining submodule and a vocabulary replacing submodule, wherein:

the ambiguous vocabulary judging submodule is used for judging whether the medical question and answer information has ambiguous vocabularies;

the ambiguity denial submodule is used for taking the medical corpus data as the training corpus data if the ambiguity vocabulary does not exist;

the ambiguity confirming submodule is used for acquiring associated text information associated with the context of the ambiguous vocabulary if the ambiguous vocabulary exists;

the real word sense acquisition submodule is used for inputting the associated text information into a semantic analysis model to perform word sense recognition operation so as to obtain real word sense information of the ambiguous vocabulary;

and the vocabulary replacement submodule is used for replacing the ambiguous vocabulary in the medical question-answering information with the real vocabulary meaning information to obtain the training corpus data.

In some optional implementations of this embodiment, the preprocessing module 202 further includes: the system comprises a word segmentation determining module, a word vector determining module, a first feature expression vector determining module, a second feature expression vector determining module, a classification result determining module and a model obtaining module. Wherein:

the word segmentation determining module is used for acquiring a sample text from a local database and determining each word segmentation contained in the sample text;

the word vector determining module is used for determining a word vector corresponding to each participle based on the semantic analysis model to be trained;

the first feature expression vector determining module is used for acquiring semantic attributes from a local database, and determining a first feature expression vector of a sample text related to the semantic attributes according to an attention matrix corresponding to the semantic attributes and a word vector corresponding to each participle in a semantic analysis model to be trained;

the second feature expression vector determination module is used for determining a second feature expression vector of the sample text related to the semantic attributes according to a self-attention matrix which is contained in the semantic analysis model to be trained and is used for expressing the correlation among different semantic attributes and the first feature expression vector;

the classification result determining module is used for determining a classification result output by the semantic training model to be trained according to the semantic analysis model to be trained and the second feature expression vector, and the classification result comprises a semantic attribute to which the sample text belongs and an emotion polarity corresponding to the semantic attribute to which the sample text belongs;

and the model acquisition module is used for adjusting model parameters in the semantic analysis model according to the classification result and the preset label of the sample text to obtain the semantic analysis model.

In order to solve the technical problem, the embodiment of the application further provides computer equipment. Referring to fig. 10, fig. 10 is a block diagram of a basic structure of a computer device according to the present embodiment.

The computer device 300 includes a memory 310, a processor 320, and a network interface 330 communicatively coupled to each other via a system bus. It is noted that only computer device 300 having components 310-330 is shown, but it is understood that not all of the shown components are required to be implemented, and that more or fewer components may be implemented instead. As will be understood by those skilled in the art, the computer device is a device capable of automatically performing numerical calculation and/or information processing according to a preset or stored instruction, and the hardware includes, but is not limited to, a microprocessor, an Application Specific Integrated Circuit (ASIC), a Programmable Gate Array (FPGA), a Digital Signal Processor (DSP), an embedded device, and the like.

The computer device can be a desktop computer, a notebook, a palm computer, a cloud server and other computing devices. The computer equipment can carry out man-machine interaction with a user through a keyboard, a mouse, a remote controller, a touch panel or voice control equipment and the like.

The memory 310 includes at least one type of readable storage medium including a flash memory, a hard disk, a multimedia card, a card type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a Read Only Memory (ROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a Programmable Read Only Memory (PROM), a magnetic memory, a magnetic disk, an optical disk, etc. In some embodiments, the storage 310 may be an internal storage unit of the computer device 300, such as a hard disk or a memory of the computer device 300. In other embodiments, the memory 310 may also be an external storage device of the computer device 300, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), or the like, provided on the computer device 300. Of course, the memory 310 may also include both internal and external storage devices of the computer device 300. In this embodiment, the memory 310 is generally used for storing an operating system and various types of application software installed on the computer device 300, such as computer readable instructions for an end-to-end model training method suitable for multiple chinese medical language processing tasks. In addition, the memory 310 may also be used to temporarily store various types of data that have been output or are to be output.

The processor 320 may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor, or other data Processing chip in some embodiments. The processor 320 is generally operative to control overall operation of the computer device 300. In this embodiment, the processor 320 is configured to execute computer readable instructions or processing data stored in the memory 310, for example, computer readable instructions for executing the end-to-end model training method suitable for multiple chinese medical language processing tasks.

The network interface 330 may include a wireless network interface or a wired network interface, and the network interface 330 is generally used to establish a communication connection between the computer device 300 and other electronic devices.

According to the computer equipment, the initial sequence model is established according to the mT5-small model of the Seq2Seq framework, and pre-training is carried out on the entity recognition task and the tail prediction task through a large amount of medical corpus data, so that the pre-trained sequence model can learn medical knowledge hidden in other tasks, and the accuracy of multi-Chinese medical language processing tasks is effectively improved.

The present application further provides another embodiment, which is a computer-readable storage medium storing computer-readable instructions executable by at least one processor to cause the at least one processor to perform the steps of the end-to-end model training method for multiple chinese medical language processing tasks as described above.

According to the computer-readable storage medium, an initial sequence model is established according to a mT5-small model of a Seq2Seq framework, and pre-training is carried out on an entity recognition task and a tail prediction task through a large amount of medical corpus data, so that the pre-trained sequence model can learn medical knowledge hidden in other tasks, and the accuracy of multi-Chinese medical language processing tasks is effectively improved.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present application or portions thereof that contribute to the prior art may be embodied in the form of a software product, where the computer software product is stored in a storage medium (such as a ROM/RAM, a magnetic disk, and an optical disk), and includes several instructions for enabling a terminal device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present application.

It should be understood that the above-described embodiments are merely exemplary of some, and not all, embodiments of the present application, and that the drawings illustrate preferred embodiments of the present application without limiting the scope of the claims appended hereto. This application is capable of embodiments in many different forms and the embodiments are provided so that this disclosure will be thorough and complete. Although the present application has been described in detail with reference to the foregoing embodiments, it will be apparent to one skilled in the art that the present application may be practiced without modification or with equivalents of some of the features described in the foregoing embodiments. All equivalent structures made by using the contents of the specification and the drawings of the present application are directly or indirectly applied to other related technical fields and are within the protection scope of the present application.

Claims

1. An end-to-end model training method suitable for multiple Chinese medical language processing tasks is characterized by comprising the following steps:

acquiring medical corpus data corresponding to the medical field;

preprocessing the medical corpus data to obtain training corpus data;

2. The method of claim 1, wherein the corpus data further includes medical article data carrying article titles and article contents, and wherein after the step of creating an initial sequence model according to the mT5-small model of the Seq2Seq framework, and before the step of using the initial sequence model after the entity recognition training operation and the tail prediction training operation as a target sequence model, the method further includes the steps of:

constructing article summary training data according to the article content, the article summary soft prompt and the article summary hard prompt;

performing article summarization training operation on the initial sequence model by using the article summarization training data as input data and the article titles as label information;

the step of using the original sequence model after the entity recognition training operation and the tail prediction training operation are completed as a target sequence model specifically includes the following steps:

and taking the original sequence model after the entity recognition training operation, the tail prediction training operation and the article summary training operation as the target sequence model.

3. The end-to-end model training method applicable to multiple chinese medical language processing tasks according to claim 1, wherein the step of preprocessing the medical corpus data to obtain training corpus data specifically comprises the steps of:

carrying out similar text duplication elimination operation on the medical corpus data according to a Jaccard similarity algorithm;

and deleting the text with larger noise in the medical corpus data according to a regular matching algorithm to obtain the training corpus data.

4. The end-to-end model training method applicable to multiple chinese medical language processing tasks according to claim 1, wherein the medical corpus data includes medical question and answer information carrying medical question and answer information, and after the step of creating an initial sequence model according to the mT5-small model of the Seq2Seq framework, and before the step of using the initial sequence model after the entity recognition training operation and the tail prediction training operation as a target sequence model, the method further comprises the steps of:

constructing medical question and answer training data according to the medical question and answer information, the medical question and answer soft prompt and the medical question and answer hard prompt;

taking the medical question-answer training data as input data and the medical answer information as label information to carry out medical question-answer training operation on the initial sequence model;

and taking the original sequence model after the entity recognition training operation, the tail prediction training operation and the medical question-answer training operation as the target sequence model.

5. The end-to-end model training method applicable to multiple chinese medical language processing tasks according to claim 4, wherein the step of preprocessing the medical corpus data to obtain training corpus data specifically comprises the steps of:

judging whether the medical question-answer information has ambiguous vocabularies or not;

if the ambiguous vocabulary does not exist, taking the medical corpus data as the training corpus data;

if the ambiguous vocabulary exists, acquiring associated text information associated with the ambiguous vocabulary context;

inputting the associated text information into a semantic analysis model to perform word sense recognition operation to obtain real word sense information of the ambiguous vocabulary;

and replacing the ambiguous vocabulary in the medical question-answering information with the real word meaning information to obtain the training corpus data.

6. The method of claim 5, wherein before the step of inputting the associated text information into a semantic analysis model for word sense recognition to obtain the real word sense information of the ambiguous vocabulary, the method further comprises:

obtaining a sample text from the local database, and determining each participle contained in the sample text;

determining a word vector corresponding to each participle based on a semantic analysis model to be trained;

obtaining semantic attributes from the local database, and determining a first feature expression vector of the sample text related to the semantic attributes according to an attention matrix corresponding to the semantic attributes and a word vector corresponding to each participle in the semantic analysis model to be trained;

determining a second feature representation vector of the sample text related to the semantic attributes according to a self-attention matrix which is contained in the semantic analysis model to be trained and used for representing correlation among different semantic attributes and the first feature representation vector;

determining a classification result output by the semantic training model to be trained according to the semantic analysis model to be trained and the second feature expression vector, wherein the classification result comprises a semantic attribute to which the sample text belongs and an emotion polarity corresponding to the semantic attribute to which the sample text belongs;

and adjusting model parameters in the semantic analysis model according to the classification result and the preset label of the sample text to obtain the semantic analysis model.

7. An end-to-end model training device suitable for multiple chinese medical language processing tasks, comprising:

8. The apparatus for end-to-end model training for multiple chinese medical language processing tasks according to claim 7, further comprising: the model confirmation module comprises: a first model validation submodule, wherein:

the article summary data construction module is used for constructing article summary training data according to the article content, the article summary soft prompt and the article summary hard prompt;

the article summarization training module is used for performing article summarization training operation on the initial sequence model by using the article summarization training data as input data and the article titles as label information;

and the first model confirming submodule is used for taking the original sequence model after the entity recognition training operation, the tail prediction training operation and the article summary training operation are finished as the target sequence model.

9. A computer device comprising a memory having computer readable instructions stored therein and a processor which when executed implements the steps of the method for end-to-end model training for multiple chinese medical language processing tasks of any one of claims 1 to 6.

10. A computer readable storage medium, characterized in that the computer readable storage medium has stored thereon computer readable instructions, which when executed by a processor, implement the steps of the end-to-end model training method for multiple chinese medical language processing tasks according to any one of claims 1 to 6.