CN114528944B - Medical text coding method, device, equipment and readable storage medium - Google Patents

Medical text coding method, device, equipment and readable storage medium Download PDF

Info

Publication number
CN114528944B
CN114528944B CN202210169875.9A CN202210169875A CN114528944B CN 114528944 B CN114528944 B CN 114528944B CN 202210169875 A CN202210169875 A CN 202210169875A CN 114528944 B CN114528944 B CN 114528944B
Authority
CN
China
Prior art keywords
icd
classifier
document
word embedding
clinical
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210169875.9A
Other languages
Chinese (zh)
Other versions
CN114528944A (en
Inventor
滕飞
周晓敏
张恩铭
马征
黄路非
李暄
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Southwest Jiaotong University
Original Assignee
Southwest Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Southwest Jiaotong University filed Critical Southwest Jiaotong University
Priority to CN202210169875.9A priority Critical patent/CN114528944B/en
Publication of CN114528944A publication Critical patent/CN114528944A/en
Application granted granted Critical
Publication of CN114528944B publication Critical patent/CN114528944B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Medical Treatment And Welfare Office Work (AREA)

Abstract

The invention provides a medical text coding method, a device, equipment and a readable storage medium, wherein the method comprises the following steps: acquiring a first document set; generating a word embedding matrix by using a word embedding technology based on the clinical document; obtaining an upper layer sequence vector based on the clinical document, the word embedding matrix and the convolutional neural network; obtaining sentence vectors corresponding to each clinical document based on the upper-layer sequence vectors and the word embedding matrix; obtaining a preliminary classifier based on sentence vectors corresponding to each clinical document; and obtaining a final classifier based on sentence vectors corresponding to the first document set and the clinical documents, and performing ICD coding on the clinical documents by using the final classifier. The invention focuses on rare disease coding and improves the attention of coders to rare diseases. The invention can automatically extract the characteristics, does not depend on manual characteristics, eases writing styles of different doctors, and can reduce research time and reduce matching errors.

Description

Medical text coding method, device, equipment and readable storage medium
Technical Field
The present invention relates to the field of data technologies, and in particular, to a medical text encoding method, apparatus, device, and readable storage medium.
Background
Currently, codes frequently occurring in clinic (we call frequent codes) occupy only a small fraction of the total code number, while codes rarely occurring in clinic (we call small sample codes) occupy a large fraction of the total code number. For example, the invisible codes are removed from the medical dataset MIMIC-III, with 8,922 visible codes. Of these, 5,386 codes only occur 1 to 10 times. The reason for this phenomenon is that there are many rare diseases in clinic, such as premature senility of children, etc., and the incidence rate is very low. This places a great demand on the encoder's knowledge reserves and the encoder also needs to consult the relevant data to complete the encoding, which greatly reduces the encoding efficiency, while the long tail distribution of the ICD code means that automatic encoding is also a significant challenge.
Disclosure of Invention
It is an object of the present invention to provide a medical text encoding method, apparatus, device and readable storage medium to ameliorate the above problems.
In order to achieve the above purpose, the embodiment of the present application provides the following technical solutions:
in one aspect, an embodiment of the present application provides a medical text encoding method, including:
acquiring a first document set, wherein the first document set comprises at least one clinical document;
generating a word embedding matrix by using a word embedding technology based on the clinical document;
obtaining an upper layer sequence vector based on the clinical document, the word embedding matrix and a convolutional neural network;
obtaining sentence vectors corresponding to each clinical document based on the upper-layer sequence vectors and the word embedding matrix;
obtaining a preliminary classifier based on sentence vectors corresponding to each clinical document, wherein the preliminary classifier comprises classifier weights;
and obtaining new classifier weights based on sentence vectors corresponding to the first document set and the clinical documents, replacing the classifier weights with the new classifier weights to obtain a final classifier, and carrying out ICD coding on the clinical documents by using the final classifier.
In a second aspect, an embodiment of the present application provides a medical text encoding device, where the medical text encoding device includes an acquisition module, a first calculation module, a second calculation module, a third calculation module, a fourth calculation module, and a replacement module.
The acquisition module is used for acquiring a first document set, wherein the first document set comprises at least one clinical document;
the first calculation module is used for generating a word embedding matrix by using a word embedding technology based on the clinical document;
the second calculation module is used for obtaining an upper-layer sequence vector based on the clinical document, the word embedding matrix and the convolutional neural network;
the third calculation module is used for obtaining sentence vectors corresponding to each clinical document based on the upper-layer sequence vectors and the word embedding matrix;
a fourth calculation module, configured to obtain a preliminary classifier based on a sentence vector corresponding to each clinical document, where the preliminary classifier includes a classifier weight;
and the replacement module is used for obtaining new classifier weights based on sentence vectors corresponding to the first document set and the clinical documents, replacing the classifier weights with the new classifier weights to obtain a final classifier, and carrying out ICD coding on the clinical documents by using the final classifier.
In a third aspect, embodiments of the present application provide a medical text encoding device comprising a memory and a processor. The memory is used for storing a computer program; the processor is configured to implement the steps of the medical text encoding method described above when executing the computer program.
In a fourth aspect, embodiments of the present application provide a readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the medical text encoding method described above.
The beneficial effects of the invention are as follows:
1. in the invention, the meta-network is utilized to transfer the meta-knowledge from frequent ICD codes with rich data to few-sample ICD codes with poor data, so that the problem of few labeling examples of the few-sample ICD codes is solved, and the performance of the few-sample ICD codes is improved. Meanwhile, for ICD coding tasks, although the convolutional neural network can learn text related semantics, for clinical documents, the ultra-long text not only provides potential effective information, but also has a large amount of irrelevant noise data, so that the embodiment also adopts a tag attention mechanism to capture the parts closely related to ICD coding in the medical record text. And meanwhile, the problem of different writing styles of doctors is relieved by utilizing the characteristic representation.
2. The invention can automatically extract the characteristics, does not depend on manual characteristics, eases the writing style of different doctors, and can reduce the research time and reduce the matching errors.
Additional features and advantages of the invention will be set forth in the description which follows, and in part will be apparent from the description, or may be learned by practice of the embodiments of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims thereof as well as the appended drawings.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the embodiments will be briefly described below, it being understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and other related drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic flow chart of a medical text encoding method according to an embodiment of the invention;
FIG. 2 is a schematic view of a medical text encoding device according to an embodiment of the present invention;
fig. 3 is a schematic structural view of a medical text encoding apparatus according to an embodiment of the present invention.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. The components of the embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the invention, as presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
It should be noted that: like reference numerals or letters denote like items in the following figures, and thus once an item is defined in one figure, no further definition or explanation thereof is necessary in the following figures. Meanwhile, in the description of the present invention, the terms "first", "second", and the like are used only to distinguish the description, and are not to be construed as indicating or implying relative importance.
Example 1
As shown in fig. 1, the present embodiment provides a medical text encoding method, which includes steps S1, S2, S3, S4, S5, and S6.
S1, acquiring a first document set, wherein the first document set comprises at least one clinical document;
s2, generating a word embedding matrix by using a word embedding technology based on the clinical document;
s3, obtaining an upper layer sequence vector based on the clinical document, the word embedding matrix and the convolutional neural network;
s4, obtaining sentence vectors corresponding to each clinical document based on the upper-layer sequence vectors and the word embedding matrix;
s5, obtaining a preliminary classifier based on sentence vectors corresponding to each clinical document, wherein the preliminary classifier comprises classifier weights;
and S6, obtaining new classifier weights based on sentence vectors corresponding to the first document set and the clinical documents, replacing the classifier weights with the new classifier weights to obtain a final classifier, and carrying out ICD coding on the clinical documents by using the final classifier.
The aim of the embodiment is to solve the problems of few ICD coding labeling examples of few samples and high error rate of ICD automatic coding of different writing styles of different doctors on cases, and avoid dependence on manual characteristics;
therefore, in the embodiment, the meta-knowledge is transferred from frequent ICD codes with rich data to few-sample ICD codes with poor data by using a meta-network, so that the problem of few labeling examples of the few-sample ICD codes is solved, and the performance of the few-sample ICD codes is improved. Meanwhile, for ICD coding tasks, although the convolutional neural network can learn text related semantics, for clinical documents, the ultra-long text not only provides potential effective information, but also has a large amount of irrelevant noise data, so that the embodiment also adopts a tag attention mechanism to capture the parts closely related to ICD coding in the medical record text. And meanwhile, the problem of different writing styles of doctors is relieved by utilizing the characteristic representation. Furthermore, the embodiment can automatically extract the characteristics, does not depend on manual characteristics, eases writing styles of different doctors, and can reduce research time and reduce matching errors.
In a specific embodiment of the disclosure, the step S2 may further include a step S21, a step S22, a step S23, and a step S24.
S21, acquiring word embedding dimension d and preset words, wherein d is a positive integer between 100 and 300;
s22, extracting all words in the clinical document and removing duplication to obtain a first vocabulary;
step S23, replacing all words which do not appear in the first vocabulary with the preset words to obtain a second vocabulary;
step S24, randomly initializing d-dimensional vectors for each word in the second vocabulary to obtain the word embedding matrix.
In this embodiment, the word embedding dimension d may be set in a user-defined manner according to the requirement of the user; in this embodiment, the preset word may be "UNK".
In a specific embodiment of the disclosure, the step S3 may further include a step S31, a step S32, and a step S33.
Step S31, converting each word in the clinical document into a low-dimensional vector to obtain an input feature matrix, wherein the vector of each word in the clinical document is represented by the vector of the corresponding word in the word embedding matrix;
step S32, setting word embedding dimension, filter width and filter output size in a convolutional neural network;
and step S33, learning semantic information of the input feature matrix by using the set convolutional neural network to obtain the upper layer sequence vector.
In this embodiment, the expression of the vector of each word in the clinical document by the vector of the corresponding word in the word embedding matrix can be understood as: for example, the word "I" is included in the clinical document, then the vector corresponding to the word "I" is found in the word embedding matrix and then used as the vector corresponding to the word "I" in the clinical document;
the word embedding dimension in this embodiment is the same as the word embedding dimension d in step S21;
clinical documents are typically characterized by plain text unstructured data and possess very long text lengths and a large amount of irrelevant information to ICD encoding; the local core features of sentences can be extracted more accurately by using the convolutional neural network, and meanwhile, the convolutional kernels are shared, so that no pressure is generated on high-dimensional data processing.
In a specific embodiment of the disclosure, the step S4 may further include a step S41, a step S42, a step S43, and a step S44.
Step S41, obtaining each ICD code description;
step S42, lowercase all words in the ICD coding description and delete stop words to obtain deleted ICD coding description;
step S43, searching word embedding vectors corresponding to all words in the deleted ICD code description in the word embedding matrix, and carrying out average calculation after searching to obtain ICD code description vectors corresponding to each ICD code description;
and S44, the upper-layer sequence vector and all ICD code description vectors are subjected to a tag attention mechanism together to obtain sentence vectors corresponding to each clinical document, wherein the sentence vectors contain characteristic representations of each ICD code.
In the embodiment, the ICD code description is adopted, so that semantic information of each ICD code can be extracted better; meanwhile, because the clinical text is very long and each document is provided with a plurality of labels, and the related information of each label can be dispersed in the whole document, aiming at the problem, the embodiment adopts a label attention mechanism, and the text closely related to ICD coding information can be further extracted by the method.
In a specific embodiment of the disclosure, the step S5 may further include a step S51, a step S52, and a step S53.
Step S51, obtaining a true value of each ICD code corresponding to the clinical document;
step S52, sentence vectors corresponding to the clinical documents are sequentially subjected to a full connection layer and a sigmoid activation function, so that a predicted value of each ICD code corresponding to the clinical documents is obtained;
and step S53, taking the binary cross entropy of the true value and the predicted value as a target loss function, and minimizing the target loss function based on all the true value and the predicted value to obtain the preliminary classifier, wherein the preliminary classifier comprises the classifier weight, and the classifier weight consists of the classifier weight of each frequent ICD code and the classifier weight of each few-sample ICD code.
In this embodiment, the true value of each ICD code corresponding to the clinical document may be directly input by the user, and meanwhile, the true value of each ICD code corresponding to the clinical document may be understood as that for one clinical document, if the ICD code of the clinical document is a first ICD code and a second ICD code, where the first ICD code and the second ICD code are one of the ICD codes, the values corresponding to the first ICD code and the second ICD code are set to 1, and the values corresponding to the other ICD codes are set to 0;
in the present embodiment of the present invention, in the present embodiment,wherein, the method comprises the steps of, wherein,Wrepresenting the classifier weights, +.>For all frequent ICD encoded classifier weights, including the classifier weight of each frequent ICD encoding, < +.>The classifier weights of all the small sample ICD codes are included;
in a specific embodiment of the disclosure, the step S6 may further include a step S61, a step S62, a step S63, and a step S64.
Step S61, obtaining the characteristic representation after the average of each frequent ICD code and the characteristic representation after the average of each few-sample ICD code based on sentence vectors corresponding to the first document set and the clinical document;
step S62, mapping the characteristic representation after each frequent ICD code average to the corresponding classifier weight, and obtaining the meta-knowledge through a minimization formula (1), wherein the formula (1) is as follows:
(1)
in the formula (1),is the meta knowledge;sequence numbers encoded for the frequent ICDs;a total number of codes for the frequent ICDs;is the firstClassifier weights for each of the frequent ICD codes,is the firstrThe frequent ICDs encode an averaged representation of the features,is the loss function output value;
step S63, based on the meta-knowledge, calculating to obtain new classifier weights of all the ICD codes with less samples through a formula (2), wherein the formula (2) is as follows:
(2)
in the formula (2),new classifier weights for all low sample ICD codes; />Is the meta knowledge; />Average feature representation for all low sample ICD codes; wherein the average feature representation of all the small sample ICD codes includes the feature representation of each small sample ICD code averaged;
step S64, calculating a new classifier weight through a formula (3) based on the classifier weights of all frequent ICD codes and the new classifier weights of all the few-sample ICD codes, wherein the formula (3) is as follows:
(3)
in the formula (3),weights for the new classifier; />Classifier weights for all frequent ICD codes; />New classifier weights are encoded for all the few sample ICDs.
In the embodiment, the meta-knowledge is transferred from frequent ICD codes rich in data to few-sample ICD codes poor in data through a meta-network, and the performance of the few-sample ICD codes is improved under the condition that the performance of the frequent ICD codes is not affected. Through knowledge transfer of a meta-network, the problem that few samples lack labeling examples in the automatic coding process can be solved;
according to the embodiment, under the condition of not depending on task external data, the classification performance of the ICD codes with few samples is improved by using rich knowledge of frequent ICD codes, and the automatic coding performance of the ICD is improved; meanwhile, the method of the embodiment can be expanded to auxiliary coding tasks of hospitals.
In a specific embodiment of the disclosure, the step S61 may further include a step S611, a step S612, and a step S613.
Step S611, collecting clinical documents containing the same frequent ICD codes to obtain a second document set;
step S612, selecting a preset number of clinical documents from the second document set to be set, so as to obtain a third document set;
step S613, calculating an average value of the feature representation of the frequent ICD codes contained in the sentence vectors corresponding to all the clinical documents in the third document set, to obtain the feature representation of the average frequent ICD codes.
The scheme in this embodiment can be understood as: for example, the ICD codes of the document 1, the document 2, the document 3 and the document 4 in all clinical documents contain a first frequent ICD code, wherein the first frequent ICD code is one of the frequent ICD codes, then the document 1, the document 2, the document 3 and the document 4 are collected, some documents are selected from the document 1, the document 2, the document 3 and the document 4 after the collection, for example, the document 2, the document 3 and the document 4 are selected, and then the characteristic representations of the first frequent ICD codes contained in sentence vectors corresponding to the document 2, the document 3 and the document 4 are collected together to obtain an average value, so that the characteristic representation after the average of the first frequent ICD codes is obtained.
In a specific embodiment of the disclosure, the step S61 may further include a step S614, a step S615, and a step S616.
Step S614, collecting clinical documents containing the same ICD code with a small sample to obtain a fourth document set;
step S615, selecting a preset number of clinical documents from the fourth document set to be set, so as to obtain a fifth document set;
step S616, an average value is obtained for the feature representation of the small sample ICD code included in the sentence vectors corresponding to all the clinical documents in the fifth document set, so as to obtain the feature representation of the small sample ICD code after the average.
Example 2
As shown in fig. 2, the present embodiment provides a medical text encoding apparatus, which includes an acquisition module 701, a first calculation module 702, a second calculation module 703, a third calculation module 704, a fourth calculation module 705, and a replacement module 706.
An obtaining module 701, configured to obtain a first document set, where the first document set includes at least one clinical document;
a first computing module 702 for generating a word embedding matrix using a word embedding technique based on the clinical document;
a second calculation module 703, configured to obtain an upper layer sequence vector based on the clinical document, the word embedding matrix, and a convolutional neural network;
a third calculation module 704, configured to obtain a sentence vector corresponding to each clinical document based on the upper layer sequence vector and the word embedding matrix;
a fourth calculation module 705, configured to obtain a preliminary classifier based on the sentence vector corresponding to each clinical document, where the preliminary classifier includes a classifier weight;
and a replacing module 706, configured to obtain a new classifier weight based on the sentence vector corresponding to the first document set and the clinical document, replace the classifier weight with the new classifier weight, obtain a final classifier, and use the final classifier to perform ICD encoding on the clinical document.
In the embodiment, the meta-network is utilized to transfer meta-knowledge from frequent ICD codes with rich data to ICD codes with few samples with poor data, so that the problem of few labeling examples of the ICD codes with few samples is solved, and the performance of the ICD codes with few samples is improved. Meanwhile, for ICD coding tasks, although the convolutional neural network can learn text related semantics, for clinical documents, the ultra-long text not only provides potential effective information, but also has a large amount of irrelevant noise data, so that the embodiment also adopts a tag attention mechanism to capture the parts closely related to ICD coding in the medical record text. And meanwhile, the problem of different writing styles of doctors is relieved by utilizing the characteristic representation. Furthermore, the embodiment can automatically extract the characteristics, does not depend on manual characteristics, eases writing styles of different doctors, and can reduce research time and reduce matching errors.
In a specific embodiment of the disclosure, the first computing module 702 further includes a first obtaining unit 7021, an extracting unit 7022, a replacing unit 7023, and an initializing unit 7024.
The first obtaining unit 7021 is configured to obtain a word embedding dimension d and a preset word, where d is a positive integer between 100 and 300;
the extracting unit 7022 is used for extracting all words in the clinical document and removing duplication to obtain a first word list;
a replacing unit 7023, configured to replace all words that do not appear in the first vocabulary with the preset words, so as to obtain a second vocabulary;
an initializing unit 7024, configured to randomly initialize a d-dimension vector for each word in the second vocabulary, so as to obtain the word embedding matrix.
In a specific embodiment of the disclosure, the second computing module 703 further includes a conversion unit 7031, a setting unit 7032, and a learning unit 7033.
A conversion unit 7031, configured to convert each word in the clinical document into a low-dimensional vector, so as to obtain an input feature matrix, where the vector of each word in the clinical document is represented by the vector of the corresponding word in the word embedding matrix;
a setting unit 7032 for setting a word embedding dimension, a filter width, and a filter output size in a convolutional neural network;
and a learning unit 7033, configured to learn semantic information of the input feature matrix by using the set convolutional neural network, so as to obtain the upper layer sequence vector.
In a specific embodiment of the disclosure, the third computing module 704 further includes a second obtaining unit 7041, a deleting unit 7042, a first computing unit 7043, and a second computing unit 7044.
A second obtaining unit 7041, configured to obtain each ICD code description;
a deletion unit 7042, configured to lower all words in the ICD code description and delete stop words, so as to obtain a deleted ICD code description;
a first calculating unit 7043, configured to search the word embedding matrix for word embedding vectors corresponding to all words in the deleted ICD coding descriptions, and perform average calculation after searching to obtain an ICD coding description vector corresponding to each ICD coding description;
and a second calculating unit 7044, configured to obtain a sentence vector corresponding to each clinical document by using the upper layer sequence vector and all ICD code description vectors together through a tag attention mechanism, where the sentence vector includes a feature representation of each ICD code.
In a specific embodiment of the disclosure, the fourth computing module 705 further includes a third obtaining unit 7051, a third computing unit 7052, and a fourth computing unit 7053.
A third obtaining unit 7051, configured to obtain a true value of each ICD code corresponding to the clinical document;
a third computing unit 7052, configured to sequentially pass through a full connection layer and a sigmoid activation function on a sentence vector corresponding to each clinical document, so as to obtain a predicted value of each ICD code corresponding to each clinical document;
the fourth calculating unit 7053 is configured to take a binary cross entropy of the true value and the predicted value as a target loss function, minimize the target loss function based on all the true values and the predicted values, and obtain the preliminary classifier, where the preliminary classifier includes the classifier weights, and the classifier weights are composed of the classifier weights encoded by each frequent ICD and the classifier weights encoded by each few-sample ICD.
In a specific embodiment of the disclosure, the replacing module 706 further includes a fifth calculating unit 7061, a sixth calculating unit 7062, a seventh calculating unit 7063, and an eighth calculating unit 7064.
A fifth calculating unit 7061, configured to obtain, based on sentence vectors corresponding to the first document set and the clinical document, an averaged feature representation of each frequent ICD code and an averaged feature representation of each few-sample ICD code;
a sixth calculating unit 7062, configured to map the feature representation averaged by each frequent ICD code to its corresponding classifier weight, and obtain the meta-knowledge by minimizing equation (1), where equation (1) is:
(1)
in the formula (1),is the meta knowledge;rsequence numbers encoded for the frequent ICDs; />A total number of codes for the frequent ICDs; />Is the firstrClassifier weights for frequent ICD codes,/-for each of said frequent ICDs>Is the firstrThe frequent ICDs encode an averaged representation of the features, and (2)>Is the loss function output value;
a seventh calculating unit 7063, configured to calculate, based on the meta knowledge, new classifier weights of all ICD codes with fewer samples according to a formula (2), where formula (2) is:
(2)
in the formula (2),new classifier weights for the all small sample ICDsWeighing; />Is the meta knowledge; />Average feature representation for all low sample ICD codes;
an eighth calculating unit 7064, configured to calculate, based on all the classifier weights of frequent ICD codes and all the new classifier weights of few-sample ICD codes, a new classifier weight according to a formula (3), where formula (3) is:
(3)
in the formula (3),weights for the new classifier; />Classifier weights for all frequent ICD codes; />New classifier weights are encoded for all the few sample ICDs.
In a specific embodiment of the disclosure, the fifth computing unit 7061 further includes a first set subunit 70611, a second set subunit 70612, and a computing subunit 70613.
A first aggregation subunit 70611, configured to aggregate clinical documents that contain the same frequent ICD codes, to obtain a second document set;
a second collection subunit 70612, configured to select a preset number of clinical documents from the second document collection to perform collection, so as to obtain a third document collection;
the first calculating subunit 70613 is configured to average the feature representation of the frequent ICD code included in the sentence vectors corresponding to all the clinical documents in the third document set, so as to obtain the feature representation of the frequent ICD code after being averaged.
In a specific embodiment of the disclosure, the fifth computing unit 7061 further includes a third set subunit 70614, a fourth set subunit 70615, and a second computing subunit 70616.
A third aggregate subunit 70614, configured to aggregate clinical documents that include the same small sample ICD code, to obtain a fourth document set;
a fourth set subunit 70615, configured to select a preset number of clinical documents from the fourth document set to set, so as to obtain a fifth document set;
and a second calculating subunit 70616, configured to average the feature representation of the small sample ICD code included in the sentence vectors corresponding to all the clinical documents in the fifth document set, to obtain the feature representation of the small sample ICD code after being averaged.
It should be noted that, regarding the apparatus in the above embodiments, the specific manner in which the respective modules perform the operations has been described in detail in the embodiments regarding the method, and will not be described in detail herein.
Example 3
Corresponding to the above method embodiments, the present disclosure further provides a medical text encoding apparatus, and the medical text encoding apparatus described below and the medical text encoding method described above may be referred to correspondingly to each other.
Fig. 3 is a block diagram of a medical text encoding device 800, according to an example embodiment. As shown in fig. 3, the medical text encoding apparatus 800 may include: a processor 801, a memory 802. The medical text encoding device 800 may also include one or more of a multimedia component 803, an input/output (I/O) interface 804, and a communication component 805.
Wherein the processor 801 is configured to control the overall operation of the medical text encoding apparatus 800 to perform all or part of the steps of the medical text encoding method described above. The memory 802 is used to store various types of data to support operation at the medical text encoding device 800, which may include, for example, instructions for any application or method operating on the medical text encoding device 800, as well as application-related data, such as contact data, transceived messages, pictures, audio, video, and the like. The Memory 802 may be implemented by any type or combination of volatile or non-volatile Memory devices, such as static random access Memory (Static Random Access Memory, SRAM for short), electrically erasable programmable Read-Only Memory (Electrically Erasable Programmable Read-Only Memory, EEPROM for short), erasable programmable Read-Only Memory (Erasable Programmable Read-Only Memory, EPROM for short), programmable Read-Only Memory (Programmable Read-Only Memory, PROM for short), read-Only Memory (ROM for short), magnetic Memory, flash Memory, magnetic disk, or optical disk. The multimedia component 803 may include a screen and an audio component. Wherein the screen may be, for example, a touch screen, the audio component being for outputting and/or inputting audio signals. For example, the audio component may include a microphone for receiving external audio signals. The received audio signals may be further stored in the memory 802 or transmitted through the communication component 805. The audio assembly further comprises at least one speaker for outputting audio signals. The I/O interface 804 provides an interface between the processor 801 and other interface modules, which may be a keyboard, mouse, buttons, etc. These buttons may be virtual buttons or physical buttons. The communication component 805 is configured to provide wired or wireless communication between the medical text encoding device 800 and other devices. Wireless communication, such as Wi-Fi, bluetooth, near field communication (Near Field Communication, NFC for short), 2G, 3G or 4G, or a combination of one or more thereof, the respective communication component 805 may thus comprise: wi-Fi module, bluetooth module, NFC module.
In an exemplary embodiment, the medical text encoding device 800 may be implemented by one or more application specific integrated circuits (Application Specific Integrated Circuit, ASIC), digital signal processor (Digital Signal Processor, DSP), digital signal processing device (Digital Signal Processing Device, DSPD), programmable logic device (Programmable Logic Device, PLD), field programmable gate array (Field Programmable Gate Array, FPGA), controller, microcontroller, microprocessor, or other electronic element for performing the medical text encoding method described above.
In another exemplary embodiment, a computer readable storage medium is also provided comprising program instructions which, when executed by a processor, implement the steps of the medical text encoding method described above. For example, the computer readable storage medium may be the memory 802 described above including program instructions executable by the processor 801 of the medical text encoding apparatus 800 to perform the medical text encoding method described above.
Example 4
Corresponding to the above method embodiments, the present disclosure further provides a readable storage medium, and a readable storage medium described below and a medical text encoding method described above may be referred to correspondingly to each other.
A readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the medical text encoding method of the above method embodiments.
The readable storage medium may be a usb disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, and the like.
The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (10)

1. A medical text encoding method, comprising:
acquiring a first document set, wherein the first document set comprises at least one clinical document;
generating a word embedding matrix by using a word embedding technology based on the clinical document;
obtaining an upper layer sequence vector based on the clinical document, the word embedding matrix and a convolutional neural network;
obtaining sentence vectors corresponding to each clinical document based on the upper-layer sequence vectors and the word embedding matrix;
obtaining a preliminary classifier based on sentence vectors corresponding to each clinical document, wherein the preliminary classifier comprises classifier weights;
obtaining new classifier weights based on sentence vectors corresponding to the first document set and the clinical documents, replacing the classifier weights with the new classifier weights to obtain a final classifier, and performing ICD coding on the clinical documents by using the final classifier;
obtaining a preliminary classifier based on sentence vectors corresponding to each clinical document, wherein the preliminary classifier comprises classifier weights and comprises the following steps:
acquiring a true value of each ICD code corresponding to the clinical document;
sequentially passing sentence vectors corresponding to each clinical document through a full connection layer and a sigmoid activation function to obtain a predicted value of each ICD code corresponding to each clinical document;
taking binary cross entropy of the true value and the predicted value as a target loss function, and minimizing the target loss function based on all the true value and the predicted value to obtain the preliminary classifier, wherein the preliminary classifier comprises the classifier weight which consists of the classifier weight of each frequent ICD code and the classifier weight of each few-sample ICD code;
the method for obtaining the new classifier weight based on the sentence vectors corresponding to the first document set and the clinical document comprises the following steps:
based on sentence vectors corresponding to the first document set and the clinical document, obtaining an average characteristic representation of each frequent ICD code and an average characteristic representation of each few-sample ICD code;
mapping the feature representation averaged by each frequent ICD code to its corresponding classifier weight, and obtaining meta-knowledge by minimizing equation (1), the equation (1) is:
in the formula (1), W t Is the meta knowledge; r is the sequence number of the frequent ICD codes; l (L) m A total number of codes for the frequent ICDs;classifier weights encoded for the r-th said frequent ICD, +.>Coding the averaged feature representation for the r-th said frequent ICD,/for->Is the loss function output value;
based on the meta-knowledge, calculating to obtain new classifier weights of all the ICD codes with few samples through a formula (2), wherein the formula (2) is as follows:
in the formula (2),new classifier weights for all low sample ICD codes; w (W) t Is the meta knowledge; p is p few For all few samplesAn average feature representation of ICD codes; wherein the average feature representation of all the small sample ICD codes includes the feature representation of each small sample ICD code averaged;
based on all the classifier weights of frequent ICD codes and all the new classifier weights of the few-sample ICD codes, calculating to obtain the new classifier weights through a formula (3), wherein the formula (3) is as follows:
in the formula (3),weights for the new classifier; w (W) frequent Classifier weights for all frequent ICD codes; />New classifier weights are encoded for all the few sample ICDs.
2. The medical text encoding method of claim 1, wherein generating a word embedding matrix using a word embedding technique based on the clinical document comprises:
acquiring word embedding dimension d and preset words, wherein d is a positive integer between 100 and 300;
extracting all words in the clinical document and removing duplication to obtain a first vocabulary;
replacing all words which do not appear in the first vocabulary with the preset words to obtain a second vocabulary;
and randomly initializing d-dimensional vectors for each word in the second vocabulary to obtain the word embedding matrix.
3. The medical text encoding method according to claim 1, wherein deriving an upper layer sequence vector based on the clinical document, the word embedding matrix, and a convolutional neural network, comprises:
converting each word in the clinical document into a low-dimensional vector to obtain an input feature matrix, wherein the vector of each word in the clinical document is represented by the vector of the corresponding word in the word embedding matrix;
setting word embedding dimension, filter width and filter output size in a convolutional neural network;
and learning semantic information of the input feature matrix by using the set convolutional neural network to obtain the upper-layer sequence vector.
4. The medical text encoding method according to claim 1, wherein obtaining sentence vectors corresponding to each of the clinical documents based on the upper layer sequence vectors and the word embedding matrix, comprises:
acquiring each ICD code description;
lowercase all words in the ICD coding description and delete stop words to obtain deleted ICD coding description;
searching word embedding vectors corresponding to all words in the deleted ICD code description in the word embedding matrix, and carrying out average calculation after searching to obtain ICD code description vectors corresponding to each ICD code description;
and jointly passing the upper-layer sequence vector and all ICD coding description vectors through a tag attention mechanism to obtain sentence vectors corresponding to each clinical document, wherein the sentence vectors contain characteristic representations of each ICD coding.
5. A medical text encoding device, comprising:
the acquisition module is used for acquiring a first document set, wherein the first document set comprises at least one clinical document;
the first calculation module is used for generating a word embedding matrix by using a word embedding technology based on the clinical document;
the second calculation module is used for obtaining an upper-layer sequence vector based on the clinical document, the word embedding matrix and the convolutional neural network;
the third calculation module is used for obtaining sentence vectors corresponding to each clinical document based on the upper-layer sequence vectors and the word embedding matrix;
a fourth calculation module, configured to obtain a preliminary classifier based on a sentence vector corresponding to each clinical document, where the preliminary classifier includes a classifier weight;
a replacement module, configured to obtain a new classifier weight based on sentence vectors corresponding to the first document set and the clinical document, replace the classifier weight with the new classifier weight, obtain a final classifier, and perform ICD encoding on the clinical document using the final classifier;
the fourth computing module further includes:
a third obtaining unit, configured to obtain a true value of each ICD code corresponding to the clinical document;
the third calculation unit is used for sequentially passing the sentence vectors corresponding to each clinical document through the full connection layer and the sigmoid activation function to obtain the predicted value of each ICD code corresponding to each clinical document;
a fourth calculation unit, configured to take a binary cross entropy of the true value and the predicted value as a target loss function, and minimize the target loss function based on all the true value and the predicted value, so as to obtain the preliminary classifier, where the preliminary classifier includes the classifier weight, and the classifier weight is composed of the classifier weight of each frequent ICD code and the classifier weight of each few-sample ICD code;
the replacement module further comprises:
a fifth calculation unit, configured to obtain, based on sentence vectors corresponding to the first document set and the clinical document, an averaged feature representation of each frequent ICD code and an averaged feature representation of each few-sample ICD code;
a sixth calculation unit, configured to map the feature representation averaged by each frequent ICD code to its corresponding classifier weight, and obtain the meta-knowledge by minimizing equation (1), where equation (1) is:
in the formula (1), W t Is the meta knowledge; r is the sequence number of the frequent ICD codes; l (L) m A total number of codes for the frequent ICDs;classifier weights encoded for the r-th said frequent ICD, +.>Coding the averaged feature representation for the r-th said frequent ICD,/for->Is the loss function output value;
a seventh calculating unit 7063, configured to calculate, based on the meta knowledge, new classifier weights of all ICD codes with fewer samples according to a formula (2), where formula (2) is:
in the formula (2),new classifier weights that encode for all of the small sample ICDs; w (W) t Is the meta knowledge; p is p few Average feature representation for all low sample ICD codes;
an eighth calculation unit, configured to calculate, based on all the classifier weights of frequent ICD codes and all the new classifier weights of few-sample ICD codes, a new classifier weight by using a formula (3), where the formula (3) is:
in the formula (3),weights for the new classifier; w (W) frequent Classifier weights for all frequent ICD codes; />New classifier weights are encoded for all the few sample ICDs.
6. The medical text encoding device of claim 5, wherein the first computing module comprises:
the first acquisition unit is used for acquiring word embedding dimension d and preset words, wherein d is a positive integer between 100 and 300;
the extraction unit is used for extracting all words in the clinical document and removing duplication to obtain a first word list;
a replacing unit, configured to replace all words that do not appear in the first vocabulary with the preset words, so as to obtain a second vocabulary;
and the initializing unit is used for randomly initializing d-dimensional vectors for each word in the second vocabulary to obtain the word embedding matrix.
7. The medical text encoding device of claim 5, wherein the second computing module comprises:
the transformation unit is used for transforming each word in the clinical document into a low-dimensional vector to obtain an input feature matrix, wherein the vector of each word in the clinical document is represented by the vector of the corresponding word in the word embedding matrix;
the setting unit is used for setting word embedding dimension, filter width and filter output size in the convolutional neural network;
and the learning unit is used for learning the semantic information of the input feature matrix by using the set convolutional neural network to obtain the upper-layer sequence vector.
8. The medical text encoding device of claim 5, wherein the third computing module comprises:
a second obtaining unit, configured to obtain each ICD code description;
the deleting unit is used for lowercase all words in the ICD code description and deleting stop words to obtain a deleted ICD code description;
the first calculation unit is used for searching word embedding vectors corresponding to all words in the deleted ICD code description in the word embedding matrix, and carrying out average calculation after searching to obtain ICD code description vectors corresponding to each ICD code description;
and the second calculation unit is used for jointly passing the upper-layer sequence vector and all ICD coding description vectors through a tag attention mechanism to obtain sentence vectors corresponding to each clinical document, wherein the sentence vectors contain characteristic representations of each ICD coding.
9. A medical text encoding device, comprising:
a memory for storing a computer program;
a processor for implementing the steps of the medical text encoding method according to any one of claims 1 to 4 when executing said computer program.
10. A readable storage medium, characterized by: the readable storage medium has stored thereon a computer program which, when executed by a processor, implements the steps of the medical text encoding method according to any of claims 1 to 4.
CN202210169875.9A 2022-02-24 2022-02-24 Medical text coding method, device, equipment and readable storage medium Active CN114528944B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210169875.9A CN114528944B (en) 2022-02-24 2022-02-24 Medical text coding method, device, equipment and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210169875.9A CN114528944B (en) 2022-02-24 2022-02-24 Medical text coding method, device, equipment and readable storage medium

Publications (2)

Publication Number Publication Date
CN114528944A CN114528944A (en) 2022-05-24
CN114528944B true CN114528944B (en) 2023-08-01

Family

ID=81624415

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210169875.9A Active CN114528944B (en) 2022-02-24 2022-02-24 Medical text coding method, device, equipment and readable storage medium

Country Status (1)

Country Link
CN (1) CN114528944B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111382272B (en) * 2020-03-09 2022-11-01 西南交通大学 Electronic medical record ICD automatic coding method based on knowledge graph
CN116644719A (en) * 2023-05-29 2023-08-25 南通大学 Element coding method for clinical research literature and application of element coding method in diabetic retinopathy

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109918671A (en) * 2019-03-12 2019-06-21 西南交通大学 Electronic health record entity relation extraction method based on convolution loop neural network
CN112579778A (en) * 2020-12-23 2021-03-30 重庆邮电大学 Aspect-level emotion classification method based on multi-level feature attention
CN113779244A (en) * 2021-08-23 2021-12-10 华南师范大学 Document emotion classification method and device, storage medium and electronic equipment

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7403890B2 (en) * 2002-05-13 2008-07-22 Roushar Joseph C Multi-dimensional method and apparatus for automated language interpretation
US20150302436A1 (en) * 2003-08-25 2015-10-22 Thomas J. Reynolds Decision strategy analytics
US10824815B2 (en) * 2019-01-02 2020-11-03 Netapp, Inc. Document classification using attention networks
CN111382272B (en) * 2020-03-09 2022-11-01 西南交通大学 Electronic medical record ICD automatic coding method based on knowledge graph
US20230164336A1 (en) * 2020-04-09 2023-05-25 Nokia Technologies Oy Training a Data Coding System Comprising a Feature Extractor Neural Network

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109918671A (en) * 2019-03-12 2019-06-21 西南交通大学 Electronic health record entity relation extraction method based on convolution loop neural network
CN112579778A (en) * 2020-12-23 2021-03-30 重庆邮电大学 Aspect-level emotion classification method based on multi-level feature attention
CN113779244A (en) * 2021-08-23 2021-12-10 华南师范大学 Document emotion classification method and device, storage medium and electronic equipment

Also Published As

Publication number Publication date
CN114528944A (en) 2022-05-24

Similar Documents

Publication Publication Date Title
CN108959246B (en) Answer selection method and device based on improved attention mechanism and electronic equipment
CN110119765B (en) Keyword extraction method based on Seq2Seq framework
CN108334574B (en) Cross-modal retrieval method based on collaborative matrix decomposition
CN114528944B (en) Medical text coding method, device, equipment and readable storage medium
Wood et al. The sequence memoizer
CN110019794B (en) Text resource classification method and device, storage medium and electronic device
Pan et al. Product quantization with dual codebooks for approximate nearest neighbor search
CN110879834B (en) Viewpoint retrieval system based on cyclic convolution network and viewpoint retrieval method thereof
CN110727765B (en) Problem classification method and system based on multi-attention machine mechanism and storage medium
CN111597341B (en) Document-level relation extraction method, device, equipment and storage medium
CN111125457A (en) Deep cross-modal Hash retrieval method and device
CN112434131A (en) Text error detection method and device based on artificial intelligence, and computer equipment
CN112347223A (en) Document retrieval method, document retrieval equipment and computer-readable storage medium
CN111859967A (en) Entity identification method and device and electronic equipment
CN115408495A (en) Social text enhancement method and system based on multi-modal retrieval and keyword extraction
CN111767697A (en) Text processing method and device, computer equipment and storage medium
CN113065349A (en) Named entity recognition method based on conditional random field
CN113806554A (en) Knowledge graph construction method for massive conference texts
CN110705279A (en) Vocabulary selection method and device and computer readable storage medium
CN112598039A (en) Method for acquiring positive sample in NLP classification field and related equipment
CN117422065A (en) Natural language data processing system based on reinforcement learning algorithm
CN109815475B (en) Text matching method and device, computing equipment and system
CN116628192A (en) Text theme representation method based on Seq2Seq-Attention
CN113704466B (en) Text multi-label classification method and device based on iterative network and electronic equipment
Zhong et al. Deep convolutional hamming ranking network for large scale image retrieval

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant