CN113033155A - Automatic coding method for medical concepts by combining sequence generation and hierarchical word lists - Google Patents
Automatic coding method for medical concepts by combining sequence generation and hierarchical word lists Download PDFInfo
- Publication number
- CN113033155A CN113033155A CN202110597714.5A CN202110597714A CN113033155A CN 113033155 A CN113033155 A CN 113033155A CN 202110597714 A CN202110597714 A CN 202110597714A CN 113033155 A CN113033155 A CN 113033155A
- Authority
- CN
- China
- Prior art keywords
- data
- medical
- vector data
- clinical
- hierarchical
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 50
- 239000013598 vector Substances 0.000 claims description 99
- 238000004422 calculation algorithm Methods 0.000 claims description 18
- 238000013507 mapping Methods 0.000 claims description 17
- 230000008569 process Effects 0.000 claims description 13
- 230000006870 function Effects 0.000 claims description 9
- 230000004927 fusion Effects 0.000 claims description 7
- 238000003860 storage Methods 0.000 claims description 7
- 238000012163 sequencing technique Methods 0.000 claims description 3
- 238000005516 engineering process Methods 0.000 description 11
- 230000010365 information processing Effects 0.000 description 9
- 238000010586 diagram Methods 0.000 description 6
- 230000014509 gene expression Effects 0.000 description 6
- 238000011160 research Methods 0.000 description 5
- 208000024891 symptom Diseases 0.000 description 5
- 238000004590 computer program Methods 0.000 description 4
- 238000011161 development Methods 0.000 description 4
- 230000007246 mechanism Effects 0.000 description 4
- 206010050694 Congenital scoliosis Diseases 0.000 description 3
- 230000002457 bidirectional effect Effects 0.000 description 3
- 238000003745 diagnosis Methods 0.000 description 3
- 238000009826 distribution Methods 0.000 description 3
- 230000036541 health Effects 0.000 description 3
- 206010016717 Fistula Diseases 0.000 description 2
- 241000282414 Homo sapiens Species 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 2
- 230000008901 benefit Effects 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 2
- 238000013136 deep learning model Methods 0.000 description 2
- 230000003890 fistula Effects 0.000 description 2
- 230000010354 integration Effects 0.000 description 2
- 230000009191 jumping Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 208000010125 myocardial infarction Diseases 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 230000001360 synchronised effect Effects 0.000 description 2
- 241000283074 Equus asinus Species 0.000 description 1
- 206010034825 Pharyngeal fistula Diseases 0.000 description 1
- 230000004913 activation Effects 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004883 computer application Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 239000003814 drug Substances 0.000 description 1
- 230000003203 everyday effect Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 230000007721 medicinal effect Effects 0.000 description 1
- 230000001737 promoting effect Effects 0.000 description 1
- 238000012552 review Methods 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 238000012549 training Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/12—Use of codes for handling textual entities
- G06F40/126—Character encoding
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/237—Lexical tools
- G06F40/247—Thesauruses; Synonyms
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
- G06F40/295—Named entity recognition
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Medical Treatment And Welfare Office Work (AREA)
Abstract
The invention discloses a medical concept automatic coding method and a device combining sequence generation and a hierarchical word list. The method solves the problems that the medical concept in the clinical medical text is manually mapped into the standard medical term code by adopting a manual coding method in the prior art, the cost is high, the efficiency is limited and the accuracy is low.
Description
Technical Field
The invention relates to the field of medical concept coding, in particular to a medical concept automatic coding method combining sequence generation and a hierarchical word list.
Background
Automatic coding of medical concepts is an important research direction in the field of medical information processing. In a medical information system, the same standard medical term can have a plurality of different medical concept expression modes, and the non-uniformity and inaccuracy of the expression modes seriously hinder the integration, sharing and utilization of medical big data, thereby bringing great inconvenience to clinic, teaching and scientific research in the medical field. A medical code is a numeric and alphabetic tagging system that provides a unique and uniform coded representation for each diagnosis, symptom, or combination of symptoms, etc. At present, medical institutions need to manually map medical concepts in clinical medical texts into standard medical term codes by adopting a manual coding mode, and the manual coding needs a large number of professionals with medical knowledge to operate, so that the cost is high, the efficiency is limited, and the accuracy is not high.
Thus, there is still a need for improvement and development of the prior art.
Disclosure of Invention
The technical problem to be solved by the present invention is to provide a method and an apparatus for automatically encoding medical concepts in combination with sequence generation and hierarchical vocabularies, aiming at solving the problems of high cost, limited efficiency and low accuracy in the prior art that a manual encoding method is adopted to manually map medical concepts in clinical medical texts to standard medical term codes.
The technical scheme adopted by the invention for solving the problems is as follows:
in a first aspect, an embodiment of the present invention provides a method for automatically encoding medical concepts in combination with sequence generation and a hierarchical vocabulary, where the method includes:
acquiring a clinical medical text, and inputting the clinical medical text into a preset encoder to obtain initial vector data of the clinical medical text;
acquiring pre-constructed hierarchical word list data, inputting the hierarchical word list data into a preset learning algorithm, and acquiring standard medical term vector data of the hierarchical word list;
inputting the initial vector data of the clinical medical text and the generated standard medical term vector data into a preset decoder, sequentially generating encoded data corresponding to a plurality of standard medical terms, and forming the encoded data corresponding to the clinical medical text according to the encoded data.
In one embodiment, the obtaining of the clinical medical text and inputting the clinical medical text into a preset encoder to obtain initial vector data of the clinical medical text includes:
inputting a clinical medical text into a word embedding layer, and mapping the clinical medical text through the word embedding layer to obtain mapping data;
inputting the mapping data into an encoder, and acquiring initial vector data generated by the encoder based on the mapping data.
In one embodiment, the obtaining of pre-constructed hierarchical vocabulary data, inputting the hierarchical dictionary data into a preset learning algorithm, and obtaining standard medical term vector data of the hierarchical vocabulary includes:
acquiring coding information of standard medical term data in term dictionary data, and dividing the standard medical term data into a parent node and a child node according to the coding information;
acquiring the father node, the child node and the father-child relationship information between the father node and the child node, and constructing hierarchical vocabulary data according to the father node, the child node and the father-child relationship information between the father node and the child node;
inputting the hierarchical vocabulary data into a preset learning algorithm to obtain vector data representing the father node, the child nodes and the father-child relationship information;
and taking vector data representing the parent node, the child nodes and the parent-child relationship information as standard medical term vector data of the hierarchical vocabulary.
In one embodiment, the encoded information includes both alphabetic field information and numeric field information.
In one embodiment, the obtaining encoded information of standard medical term data in the term dictionary data, the dividing the standard medical term data into parent nodes and child nodes according to the encoded information includes:
taking each standard medical term data as a node;
taking a plurality of nodes with the same number before the preset sequence bit of the digital field information as the same type of nodes, wherein the types of all the letter field information are the same;
and in the same type of nodes, taking the node with the shortest digital field information as a father node and taking the nodes except the father node as child nodes.
In one embodiment, the decoder includes a classifier, the classifier includes tags of a plurality of standard medical terms, the inputting the initial vector data of the clinical medical text and the generated standard medical term vector data into a preset decoder sequentially generates encoded data corresponding to a plurality of standard medical terms, and forming standard medical term sequence data corresponding to the clinical medical text according to the encoded data includes:
acquiring sequence data consisting of all historical standard medical term vector data output by the decoder; the sequence data is standard medical term vector data corresponding to codes output by the decoder before the current time step;
determining, by the classifier, encoded data output by the decoder at a current time step corresponding to the clinical medical text based on the initial vector data and the sequence data; this process is repeated until no encoded data can be generated;
and forming standard medical term sequence data corresponding to the clinical medical text according to the coded data.
In one embodiment, a probability function is included in the classifier, and the encoded data output by the decoder at the current time step corresponding to the clinical medical text is determined by the classifier based on the initial vector data and the sequence data;
fusing the initial vector data with vector data corresponding to the sequence data to obtain fused vector data;
inputting the fusion vector data into the probability function, and acquiring probability values of a plurality of possible coded data generated by the probability function based on the fusion vector data;
and sequencing the probability values according to the numerical values, and taking the coded data with the maximum probability value as the coded data output by the decoder at the current moment.
In a second aspect, an embodiment of the present invention further provides an apparatus for automatically encoding medical concepts in combination with sequence generation and hierarchical vocabularies, wherein the apparatus includes:
the acquisition module is used for acquiring a clinical medical text, inputting the clinical medical text into a preset encoder and obtaining initial vector data of the clinical medical text;
the learning module is used for acquiring pre-constructed hierarchical word list data, inputting the hierarchical word list data into a preset learning algorithm and acquiring standard medical term vector data of the hierarchical word list;
and the encoding module is used for inputting the initial vector data of the clinical medical text and the generated standard medical term vector data into a preset decoder, sequentially generating encoded data corresponding to a plurality of standard medical terms, and forming standard medical term sequence data corresponding to the clinical medical text according to the encoded data.
The invention has the beneficial effects that: according to the embodiment of the invention, medical concept coding events in clinical medical texts are converted into sequence generation problems, the concept of a hierarchical word list is introduced to enhance the relationship among medical terms, and standard medical terms corresponding to the clinical medical texts are accurately determined and automatically coded according to the hierarchical word list in the sequence generation process. The method solves the problems that the medical concept in the clinical medical text is manually mapped into the standard medical term code by adopting a manual coding method in the prior art, the cost is high, the efficiency is limited and the accuracy is low.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments described in the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
Fig. 1 is a flow chart of a method for automatically encoding medical concepts in combination with sequence generation and hierarchical vocabularies according to an embodiment of the present invention.
Fig. 2 is a schematic diagram of a frame of a Seq2Seq model provided in the embodiment of the present invention.
Fig. 3 is a block connection diagram of an apparatus for automatically encoding medical concepts in combination with sequence generation and hierarchical vocabularies according to an embodiment of the present invention.
Fig. 4 is a schematic block diagram of a terminal according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer and clearer, the present invention is further described in detail below with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
It should be noted that, if directional indications (such as up, down, left, right, front, and back … …) are involved in the embodiment of the present invention, the directional indications are only used to explain the relative positional relationship between the components, the movement situation, and the like in a specific posture (as shown in the drawing), and if the specific posture is changed, the directional indications are changed accordingly.
With the rapid development and application popularization of information technologies such as the internet, big data, cloud computing and artificial intelligence, the production and the life of human beings are influenced unprecedentedly. In recent years, information technology has been increasingly applied to aspects of social life. In various industries, information technology has changed the way of human management, analysis and data application, and economic culture and other fields are not assisted by the information technology. In the application fields of these information technologies, medical treatment is one of the fields that is important and has unlimited potential. In the medical field, a large amount of information processing is involved. These medical information have the following characteristics:
1) the data volume is large and the growth speed is high;
2) the demand for sharing is high.
The greatest advantage of information technology lies in the efficiency of data processing, so that the information technology is widely applied in the medical field at present, and the computer application direction of medical information processing is generated. The medical information processing means organically combining the related technology of the computer with the requirements of the medical and health industry, meeting the requirements of medical institutions and related departments on collection, arrangement, storage, analysis and the like of medical and health information, improving the efficiency of the health industry and meeting the functional requirements of customers. The medical information processing technology improves the medical information processing efficiency and the accuracy of medical information processing, and the development of medical information enters a new height. For a long time, how to actually and effectively improve the medical level and promote the medical development by using the medical information processing technology is one of the hot problems of researches of scholars in the related field.
Automatic coding of medical concepts is an important research direction in the field of medical information processing. Clinical medical text generally refers to textual data describing the clinical performance of a patient formed by medical staff during a medical activity, which may contain several medically related concepts. In a medical information system, the same standard medical term may have a plurality of different medical concept expressions. First, since there may be differences in the recording styles of medical workers, sometimes in order to pursue work efficiency, medical texts recorded by the medical workers may contain more synonyms, acronyms, foreign languages, or spoken language expressions. Therefore, in clinical medical texts, the phenomenon that one term corresponds to multiple expressions is obvious. For example, in the Chinese clinical medical text, the term "congenital scoliosis" can be expressed as "congenital scoliosis" and also as "congenital scoliosis deformity"; in the english clinical medicine text, "heart attack", "MI" and "myocarpial attack" may all represent the meaning of "myocardial infarction". Second, in some cases, multiple diagnosis-or symptom-related medical concepts are closely related and confusing, and the same medical concept in the clinical medical context may correspond to different standard medical terms due to different context, such as in the chinese clinical medical context, the diagnosis-related medical concept "nasopharyngeal fistula" may correspond to the standard medical term "sinus fistula" or to the standard medical term "pharyngeal fistula" depending on the context information. The phenomenon of non-uniformity and inaccuracy of the expression mode seriously hinders integration, sharing and utilization of medical big data, and brings great inconvenience to clinical practice, teaching and scientific research in the medical field. A medical code is a numeric and alphabetic tagging system that provides a unique coded representation for each diagnosis, symptom, or combination of symptoms, etc. Therefore, the medical concept in the clinical medical text is normalized to the corresponding code of the standard medical term in the medical coding system according to the unification, which is particularly urgent in the process of promoting the medical informatization. Some current medical institutions manually map medical concepts in clinical medical texts to standard medical term codes by means of manual codes. In this process, the encoding personnel need to review the medical concepts or other relevant information in the clinical medical text and then manually assign the appropriate standard medical term codes to these medical concepts according to the code guidance. Since medical institutions generate massive text information every day, manual coding requires a large number of professionals with medical knowledge to operate, and is high in cost, limited in efficiency and low in accuracy.
Aiming at the defects of the prior art, the invention provides a medical concept automatic coding method combining sequence generation and a hierarchical word list, which is characterized in that medical concept coding events in clinical medical texts are converted into sequence generation problems, the concept of the hierarchical word list is introduced to enhance the relationship among medical terms, and standard medical terms corresponding to the clinical medical texts are accurately determined and automatically coded according to the hierarchical word list in the sequence generation process. The method solves the problems that the medical concept in the clinical medical text is manually mapped into the standard medical term code by adopting a manual coding method in the prior art, the cost is high, the efficiency is limited and the accuracy is low.
As shown in fig. 1, the present embodiment provides a method for automatically encoding medical concepts in combination with sequence generation and hierarchical vocabularies, the method comprising the steps of:
s100, obtaining a clinical medical text, and inputting the clinical medical text into a preset encoder to obtain initial vector data of the clinical medical text.
In order to automatically encode the medical concept in the clinical medical text, the embodiment first needs to acquire the clinical medical text for encoding. Since the main objective of the present embodiment is to automatically map the medical concept to the encoded data corresponding to the standard medical term by using the computer-related technology, a certain processing of the clinical medical text is required so that the computer can perform calculation according to the processed clinical medical text. Specifically, as shown in fig. 2, the present embodiment mainly employs a Seq2Seq model as a main generation framework, where the Seq2Seq model is a model employed when the length of the output is uncertain. For example, when a task of translating a sentence of chinese into english appears in the tasks translated by the machine, the translated english may be shorter or longer than chinese, and thus the output length is uncertain, and the Seq2Seq model is just suitable for this case. The Seq2Seq model in this embodiment comprises an encoder and a decoder, wherein the encoder may use bi-directional LSTM, which is responsible for converting the input text data into data in the form of a vector, and this vector can be regarded as a semantic vector of the input text. Specifically, after a clinical medical text is acquired, the clinical medical text is input into an encoder in a preset Seq2Seq model, and after the encoder acquires the clinical medical text, the clinical medical text is encoded into a vector form through learning input, so that initial vector data of the clinical medical text is acquired.
In one implementation, the step S100 specifically includes the following steps:
step S110, inputting a clinical medical text into a word embedding layer, and mapping the clinical medical text through the word embedding layer to obtain mapping data;
and step S120, inputting the mapping data into an encoder, and acquiring initial vector data of the clinical medical text generated by the encoder based on the mapping data.
In order to obtain the initial vector data, the embodiment needs to embed the clinical medical text input word into a layer first. After the word embedding layer obtains the clinical medical text, the characteristics of words in the text are mapped to a lower dimension, and mapping data are output after mapping is finished, so that model parameters are fewer, and training is faster. And then inputting the mapping data into an encoder, wherein after the encoder acquires the mapping data, the encoder encodes the mapping data and generates initial vector data. For example, in this embodiment, a bidirectional LSTM is used as an encoder, and first at the encoder end, a clinical medical text is first obtainedIn which words are mapped to vector representations by the word embedding layerThen is aligned withEncoding using bi-directional LSTM, resulting in a hidden layer representation:
since the encoder used in this embodiment is a bidirectional LSTM, two hidden layer representations output by the bidirectional LSTM need to be spliced to obtain a final hidden layer representation:
As shown in fig. 1, the method further comprises the steps of:
s200, acquiring pre-constructed hierarchical word list data, inputting the hierarchical word list data into a preset learning algorithm, and acquiring standard medical term vector data of the hierarchical word list.
In order to implement an automatic encoding process of a medical concept, this embodiment may need to acquire pre-constructed hierarchical vocabulary data, where the hierarchical vocabulary data actually refers to vocabulary data containing hierarchical relationships between various standard medical terms, and since a deep learning model is needed in this embodiment, it is also needed to convert the standard medical terms contained in the hierarchical vocabulary into a vector format that can be executed by the deep learning model, so it is needed to input the hierarchical vocabulary data into a preset learning algorithm and obtain standard medical term vector data of the hierarchical vocabulary.
In one implementation, the step S200 specifically includes the following steps:
step S210, acquiring coding information of standard medical term data in term dictionary data, and dividing the standard medical term data into a parent node and a child node according to the coding information;
step S220, acquiring the father node, the child node and the father-child relationship information between the father node and the child node, and constructing hierarchical vocabulary data according to the father node, the child node and the father-child relationship information between the father node and the child node;
step S230, inputting the hierarchical vocabulary data into a preset learning algorithm to obtain vector data representing the father node, the child nodes and the father-child relationship information;
and S240, taking vector data representing the father node, the child node and the father-child relationship information as standard medical term vector data of the hierarchical vocabulary.
First, the present embodiment pre-constructs a hierarchical vocabulary. Specifically, in order to construct the hierarchical vocabulary, the present embodiment needs to acquire encoding information of standard medical term data in term dictionary data, and divide the standard medical term data into parent nodes and child nodes according to the encoding information to determine the inclusion relationship between the respective standard medical terms. In one implementation, the encoded information may include letter segment information and number field information, for example, a standard medical term may correspond to a code of J001. Then, each standard medical term data is used as a node, then, the types of all letter segment information are the same, a plurality of nodes with the same number before the preset sequence position of the number segment information are used as the same type of node, then, in the same type of node, the node with the shortest number segment information is used as a father node, and the nodes except the father node are used as child nodes.
In one implementation, this embodiment may design an algorithm to determine the parent node and the child node, and the structure formed by the parent node and the child node is a tree-shaped hierarchical structure, where the specific algorithm is as follows:
A. defining a data structure of nodes in a tree, each tree node comprising two parts: b, coding the character string and the child node list, and turning to the step B;
B. c, initializing a root node of the tree, wherein the corresponding code of the root node is an empty character string, and the list of child nodes is empty, and turning to the step C;
C. if the standard medical term dictionary is empty, the algorithm ends. Otherwise, taking a (code, term) pair from the standard medical term dictionary, and turning to the step D;
D. setting the current node as a root node, if the code of the current node is the prefix (letter segment information) of the taken code and the two codes are different, setting a jump-out circulation mark as false, turning to the step E, otherwise, turning to the step H;
E. and if the child node list of the current node is empty, the step G is carried out. Otherwise, taking a child node, and turning to the step F;
F. if the code of the child node is the prefix of the taken code and the two codes are different, setting the current node as the child node, setting the jump-out loop flag as true, jumping to the step F, otherwise, jumping to the step E;
G. if the skip cycle flag is true, go to step C. Otherwise, turning to the step D;
H. and initializing a new node, wherein the corresponding code of the new node is the taken code, and the child node is empty. And C, adding the new node into the child node list of the current node, and turning to the step C.
After determining a father node and a child node, acquiring father-child relationship information among the father node, the child node, and the father-child relationship information, constructing hierarchical vocabulary data according to the father node, the child node, and the father-child relationship information, and inputting the hierarchical vocabulary data into a preset learning algorithm, such as a TransE algorithm, so as to obtain vector data representing the father node, the child node, and the father-child relationship information. And finally, taking vector data representing the father node, the child nodes and the father-child relationship information as standard medical term vector data of the hierarchical word list. In short, the present embodiment is intended to represent the standard medical terms by low-dimensional dense vectors and determine the inclusion relationship between the respective standard medical term vector data, thereby distributing the standard medical terms of similar semantics in the approximate, adjacent vector space. In addition, the embodiment determines the dependency relationship or the hierarchical relationship of each standard medical term by constructing a hierarchical vocabulary, so that the hierarchical relationship between the standard medical terms is clearer, the standard medical terms corresponding to the medical concept in the clinical medical text can be better screened out, and the purpose of automatically coding the medical concept is finally achieved.
As shown in fig. 1, the method further comprises the steps of:
step S300, inputting the initial vector data of the clinical medical text and the generated standard medical term vector data into a preset decoder, sequentially generating coded data corresponding to a plurality of standard medical terms, and forming standard medical term sequence data corresponding to the clinical medical text according to the coded data.
The Seq2Seq model used in this embodiment further includes a decoder, and in one implementation, a unidirectional LSTM may be used as the decoder. Specifically, after the initial vector data and the standard medical term vector data of the clinical medical text are acquired, the two data need to be input into a decoder, after the decoder acquires the two data, encoded data corresponding to a plurality of standard medical terms are sequentially generated, and the standard medical term sequence data corresponding to the clinical medical text can be formed according to the encoded data. In short, when the medical concept in the clinical medical text is coded, the present embodiment refers to not only the hierarchical relationship between the standard medical terms, but also the sequence generated by the plurality of standard medical terms, and finally determines the correct code corresponding to the medical concept in the clinical medical text.
In one implementation, the decoder includes a classifier, and the classifier includes a plurality of labels of standard medical terms, and the step S300 specifically includes the following steps:
step S310, acquiring sequence data consisting of all historical standard medical term vector data output by the decoder; the sequence data is standard medical term vector data corresponding to codes output by the decoder before the current time step;
step S320, determining the coded data output by the decoder at the current time step corresponding to the clinical medical text by the classifier based on the initial vector data and the sequence data; this process is repeated until no encoded data can be generated;
and S330, forming standard medical term sequence data corresponding to the clinical medical text according to the coded data.
In order to realize automatic coding of medical concepts in clinical medical texts, the present embodiment first obtains sequence data composed of all historical standard medical term vector data output by the decoder, and it is understood that the sequence data is the standard medical term vector data corresponding to the code output by the decoder before the current time step. Then, in order to determine a standard medical term vector data with the highest similarity to the clinical medical text at the current time step, the embodiment further inputs the vector data and the sequence data into a classifier, where a classification space in the classifier can be actually regarded as a label set, and each label corresponds to a standard medical term. The present embodiment is intended to determine, by the classifier, encoded data to be output by a decoder at a current time step corresponding to the clinical medical text, based on the initial vector data and the sequence data. Specifically, in this embodiment, a catch-up donkey function is further set in the classifier, and in order to determine the encoded data output by the decoder at the current time step, in this embodiment, the initial vector data and the vector data corresponding to the sequence data need to be fused to obtain fused vector data. Then, the fusion vector data is input into the probability function, and probability values of a plurality of possible encoded data generated by the probability function based on the fusion vector data are obtained, it can be understood that the probability values can indicate the correlation or the degree of association between each splicing sequence data and the clinical medical text to some extent, that is, in this embodiment, an attention mechanism is introduced to determine the standard medical term corresponding to the clinical medical text.
For example, at the decoder side, the timeHidden layer representation for a decoderIs calculated as follows, whereinIs composed ofCoding with maximum probability in time-of-day output distributionVector representation of corresponding standard medical terms, namely vector representation obtained by learning through a TransE algorithm;
wherein, when calculating the probability, an attention mechanism is needed to obtainCorrelation coefficient with clinical medical text, specifically, attention mechanism calculation process is as follows,to representThe moment is calculated by using an attention mechanism to obtain a vector:
wherein,to representCompared with the clinical medical textiThe degree of correlation between the words is,to representCompared with the clinical medical textiThe coefficient of the degree of correlation between words,、、all are learning parameters.
Then the probability distribution is output through a classifier:
wherein,to representTime of dayAndby a non-linear activation functionf(e.g., tanh, ReL)u, etc.) of the vectors after the fusion,to representtThe distribution of the output at the decoder side is,、、all are learning parameters.
And then sequencing the obtained probability values according to the numerical values, and taking the coded data with the maximum probability value as the coded data output by the decoder at the current moment. This process is repeated until no encoded data can be generated. And finally, standard medical term sequence data corresponding to the clinical medical text is formed according to the coded data, so that the clinical medical text is automatically coded.
In summary, although there are techniques for encoding medical concepts by using a machine learning method in the prior art, most of them use a greedy search strategy to generate codes, and the greedy search strategy screens out the vectors of the standard medical terms with the highest probability at each time step at the decoder side, so that the search space is relatively limited. However, the cluster search is actually adopted in this embodiment, that is, at each time step, the first sequences with the highest probability are considered as candidate sequences, and the sequences with the highest probability are selected as final target sequences.
Based on the above embodiment, the present invention further provides an apparatus for automatically encoding medical concepts in combination with sequence generation and hierarchical vocabulary, as shown in fig. 3, the apparatus comprising:
the acquisition module 01 is used for acquiring a clinical medical text, inputting the clinical medical text into a preset encoder, and acquiring initial vector data generated by the encoder based on the clinical medical text;
the learning module 02 is used for acquiring pre-constructed hierarchical word list data, inputting the hierarchical word list data into a preset learning algorithm, and acquiring standard medical term vector data of the hierarchical word list;
the encoding module 03 is configured to input the initial vector data of the clinical medical text and the generated standard medical term vector data into a preset decoder, sequentially generate encoded data corresponding to a plurality of standard medical terms, and form standard medical term sequence data corresponding to the clinical medical text according to the encoded data.
Based on the above embodiments, the present invention further provides a terminal, and a schematic block diagram thereof may be as shown in fig. 4. The terminal comprises a processor, a memory, a network interface and a display screen which are connected through a system bus. Wherein the processor of the terminal is configured to provide computing and control capabilities. The memory of the terminal comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The network interface of the terminal is used for connecting and communicating with an external terminal through a network. The computer program is executed by a processor to implement a method of automatically encoding medical concepts in conjunction with sequence generation and hierarchical vocabularies. The display screen of the terminal can be a liquid crystal display screen or an electronic ink display screen.
It will be understood by those skilled in the art that the block diagram of fig. 4 is a block diagram of only a portion of the structure associated with the inventive arrangements and is not intended to limit the terminals to which the inventive arrangements may be applied, and that a particular terminal may include more or less components than those shown, or may have some components combined, or may have a different arrangement of components.
In one implementation, one or more programs are stored in a memory of the terminal and configured to be executed by one or more processors include instructions for performing a method of medical concept auto-encoding in conjunction with sequence generation and hierarchical vocabularies.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, databases, or other media used in embodiments provided herein may include non-volatile and/or volatile memory. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).
In summary, the invention discloses a method and a device for automatically coding medical concepts in combination with sequence generation and a hierarchical vocabulary, which convert medical concept coding events in clinical medical texts into sequence generation problems, introduce the concept of the hierarchical vocabulary to enhance the relationship among medical terms, and accurately determine standard medical terms corresponding to the clinical medical texts and automatically code the medical terms according to the hierarchical vocabulary in the process of sequence generation. The method solves the problems that the medical concept in the clinical medical text is manually mapped into the standard medical term code by adopting a manual coding method in the prior art, the cost is high, the efficiency is limited and the accuracy is low.
It is to be understood that the invention is not limited to the examples described above, but that modifications and variations may be effected thereto by those of ordinary skill in the art in light of the foregoing description, and that all such modifications and variations are intended to be within the scope of the invention as defined by the appended claims.
Claims (9)
1. A method for automatic coding of medical concepts in conjunction with sequence generation and hierarchical vocabularies, the method comprising:
acquiring a clinical medical text, and inputting the clinical medical text into a preset encoder to obtain initial vector data of the clinical medical text;
acquiring pre-constructed hierarchical word list data, inputting the hierarchical word list data into a preset learning algorithm, and acquiring standard medical term vector data of the hierarchical word list;
inputting the initial vector data of the clinical medical text and the generated standard medical term vector data into a preset decoder, sequentially generating encoded data corresponding to a plurality of standard medical terms, and forming standard medical term sequence data corresponding to the clinical medical text according to the encoded data.
2. The method of claim 1, wherein the obtaining clinical medical texts, inputting the clinical medical texts into a preset encoder, and obtaining initial vector data of the clinical medical texts comprises:
inputting a clinical medical text into a word embedding layer, and mapping the clinical medical text through the word embedding layer to obtain mapping data;
inputting the mapping data into an encoder, and acquiring initial vector data of the clinical medical text generated by the encoder based on the mapping data.
3. The method as claimed in claim 1, wherein the step of obtaining pre-constructed hierarchical vocabulary data, inputting the hierarchical vocabulary data into a predetermined learning algorithm, and obtaining standard medical term vector data of the hierarchical vocabulary comprises:
acquiring coding information of standard medical term data in term dictionary data, and dividing the standard medical term data into a parent node and a child node according to the coding information;
acquiring the father node, the child node and the father-child relationship information between the father node and the child node, and constructing hierarchical vocabulary data according to the father node, the child node and the father-child relationship information between the father node and the child node;
inputting the hierarchical vocabulary data into a preset learning algorithm to obtain vector data representing the father node, the child nodes and the father-child relationship information;
and taking vector data representing the parent node, the child nodes and the parent-child relationship information as standard medical term vector data of the hierarchical vocabulary.
4. The method as claimed in claim 3, wherein the coding information comprises letter field information and number field information.
5. The method as claimed in claim 4, wherein the obtaining of encoding information of standard medical term data in term dictionary data, and the dividing of the standard medical term data into parent nodes and child nodes according to the encoding information comprises:
taking each standard medical term data as a node;
taking a plurality of nodes with the same number before the preset sequence bit of the digital field information as the same type of nodes, wherein the types of all the letter field information are the same;
and in the same type of nodes, taking the node with the shortest digital field information as a father node and taking the nodes except the father node as child nodes.
6. The method as claimed in claim 1, wherein the decoder comprises a classifier containing labels of a plurality of standard medical terms, the inputting the initial vector data of the clinical medical text and the generated vector data of the standard medical terms into a preset decoder sequentially generates encoded data corresponding to a plurality of standard medical terms, and the forming of the sequence data of the standard medical terms corresponding to the clinical medical text from the encoded data comprises:
acquiring sequence data consisting of all historical standard medical term vector data output by the decoder; the sequence data is standard medical term vector data corresponding to codes output by the decoder before the current time step;
determining, by the classifier, encoded data output by the decoder at a current time step corresponding to the clinical medical text based on the initial vector data and the sequence data; this process is repeated until no encoded data can be generated;
and forming standard medical term sequence data corresponding to the clinical medical text according to the coded data.
7. The method of claim 6, wherein the classifier comprises a probability function, and the classifier determines the encoded data output by the decoder at the current time step corresponding to the clinical medical text based on the initial vector data and the sequence data;
fusing the initial vector data with vector data corresponding to the sequence data to obtain fused vector data;
inputting the fusion vector data into the probability function, and acquiring probability values of a plurality of possible coded data generated by the probability function based on the fusion vector data;
and sequencing the probability values according to the numerical values, and taking the coded data with the maximum probability value as the coded data output by the decoder at the current moment.
8. An apparatus for automatic coding of medical concepts in conjunction with sequence generation and hierarchical vocabularies, the apparatus comprising:
the acquisition module is used for acquiring a clinical medical text, inputting the clinical medical text into a preset encoder and acquiring initial vector data generated by the encoder based on the clinical medical text;
the learning module is used for acquiring pre-constructed hierarchical word list data, inputting the hierarchical word list data into a preset learning algorithm and acquiring standard medical term vector data of the hierarchical word list;
and the encoding module is used for inputting the initial vector data of the clinical medical text and the generated standard medical term vector data into a preset decoder, sequentially generating encoded data corresponding to a plurality of standard medical terms, and forming standard medical term sequence data corresponding to the clinical medical text according to the encoded data.
9. A computer readable storage medium having stored thereon a plurality of instructions adapted to be loaded and executed by a processor to perform the steps of a method for automatic encoding of medical concepts in conjunction with sequence generation and hierarchical vocabularies of any of the preceding claims 1-7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110597714.5A CN113033155B (en) | 2021-05-31 | 2021-05-31 | Automatic coding method for medical concepts by combining sequence generation and hierarchical word lists |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110597714.5A CN113033155B (en) | 2021-05-31 | 2021-05-31 | Automatic coding method for medical concepts by combining sequence generation and hierarchical word lists |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113033155A true CN113033155A (en) | 2021-06-25 |
CN113033155B CN113033155B (en) | 2021-10-26 |
Family
ID=76455886
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110597714.5A Active CN113033155B (en) | 2021-05-31 | 2021-05-31 | Automatic coding method for medical concepts by combining sequence generation and hierarchical word lists |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113033155B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114091411A (en) * | 2021-11-30 | 2022-02-25 | 云知声智能科技股份有限公司 | Method, device and system for generating standardized medical text based on pointer network |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108182976A (en) * | 2017-12-28 | 2018-06-19 | 西安交通大学 | A kind of clinical medicine information extracting method based on neural network |
CN109299273A (en) * | 2018-11-02 | 2019-02-01 | 广州语义科技有限公司 | Based on the multi-source multi-tag file classification method and its system for improving seq2seq model |
CN109408820A (en) * | 2018-10-17 | 2019-03-01 | 长沙瀚云信息科技有限公司 | A kind of medical terminology mapped system and method, equipment and storage medium |
CN110705214A (en) * | 2019-08-27 | 2020-01-17 | 天津开心生活科技有限公司 | Automatic coding method and device |
CN110827929A (en) * | 2019-11-05 | 2020-02-21 | 中山大学 | Disease classification code recognition method and device, computer equipment and storage medium |
CN111063446A (en) * | 2019-12-17 | 2020-04-24 | 医渡云(北京)技术有限公司 | Method, apparatus, device and storage medium for standardizing medical text data |
CN112802568A (en) * | 2021-02-03 | 2021-05-14 | 紫东信息科技(苏州)有限公司 | Multi-label stomach disease classification method and device based on medical history text |
-
2021
- 2021-05-31 CN CN202110597714.5A patent/CN113033155B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108182976A (en) * | 2017-12-28 | 2018-06-19 | 西安交通大学 | A kind of clinical medicine information extracting method based on neural network |
CN109408820A (en) * | 2018-10-17 | 2019-03-01 | 长沙瀚云信息科技有限公司 | A kind of medical terminology mapped system and method, equipment and storage medium |
CN109299273A (en) * | 2018-11-02 | 2019-02-01 | 广州语义科技有限公司 | Based on the multi-source multi-tag file classification method and its system for improving seq2seq model |
CN110705214A (en) * | 2019-08-27 | 2020-01-17 | 天津开心生活科技有限公司 | Automatic coding method and device |
CN110827929A (en) * | 2019-11-05 | 2020-02-21 | 中山大学 | Disease classification code recognition method and device, computer equipment and storage medium |
CN111063446A (en) * | 2019-12-17 | 2020-04-24 | 医渡云(北京)技术有限公司 | Method, apparatus, device and storage medium for standardizing medical text data |
CN112802568A (en) * | 2021-02-03 | 2021-05-14 | 紫东信息科技(苏州)有限公司 | Multi-label stomach disease classification method and device based on medical history text |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114091411A (en) * | 2021-11-30 | 2022-02-25 | 云知声智能科技股份有限公司 | Method, device and system for generating standardized medical text based on pointer network |
Also Published As
Publication number | Publication date |
---|---|
CN113033155B (en) | 2021-10-26 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11501182B2 (en) | Method and apparatus for generating model | |
CN110688854B (en) | Named entity recognition method, device and computer readable storage medium | |
CN114530223A (en) | NLP-based cardiovascular disease medical record structuring system | |
US20230244704A1 (en) | Sequenced data processing method and device, and text processing method and device | |
CN112100332A (en) | Word embedding expression learning method and device and text recall method and device | |
CN110991185A (en) | Method and device for extracting attributes of entities in article | |
CN116737879A (en) | Knowledge base query method and device, electronic equipment and storage medium | |
CN111563380A (en) | Named entity identification method and device | |
CN113657105A (en) | Medical entity extraction method, device, equipment and medium based on vocabulary enhancement | |
CN111581972A (en) | Method, device, equipment and medium for identifying corresponding relation between symptom and part in text | |
CN113158676A (en) | Professional entity and relationship combined extraction method and system and electronic equipment | |
CN113421657A (en) | Construction method and device of knowledge representation model of clinical practice guideline | |
CN113033155B (en) | Automatic coding method for medical concepts by combining sequence generation and hierarchical word lists | |
CN115935914A (en) | Admission record missing text supplementing method | |
CN116386895B (en) | Epidemic public opinion entity identification method and device based on heterogeneous graph neural network | |
CN117131873A (en) | Double-encoder pre-training small sample relation extraction method based on contrast learning | |
CN116702777A (en) | Chinese named entity recognition method, device, electronic equipment and storage medium | |
CN115270792A (en) | Medical entity identification method and device | |
CN116069946A (en) | Biomedical knowledge graph construction method based on deep learning | |
CN115659989A (en) | Web table abnormal data discovery method based on text semantic mapping relation | |
CN114358021A (en) | Task type dialogue statement reply generation method based on deep learning and storage medium | |
CN114372467A (en) | Named entity extraction method and device, electronic equipment and storage medium | |
CN112487811B (en) | Cascading information extraction system and method based on reinforcement learning | |
CN114911940A (en) | Text emotion recognition method and device, electronic equipment and storage medium | |
CN114444492A (en) | Non-standard word class distinguishing method and computer readable storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |