CN113033155B - Automatic coding method for medical concepts by combining sequence generation and hierarchical word lists - Google Patents

Automatic coding method for medical concepts by combining sequence generation and hierarchical word lists Download PDF

Info

Publication number
CN113033155B
CN113033155B CN202110597714.5A CN202110597714A CN113033155B CN 113033155 B CN113033155 B CN 113033155B CN 202110597714 A CN202110597714 A CN 202110597714A CN 113033155 B CN113033155 B CN 113033155B
Authority
CN
China
Prior art keywords
data
medical
node
vector data
hierarchical
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110597714.5A
Other languages
Chinese (zh)
Other versions
CN113033155A (en
Inventor
汤步洲
黄源航
熊英
陈清财
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Harbin Institute Of Technology shenzhen Shenzhen Institute Of Science And Technology Innovation Harbin Institute Of Technology
Original Assignee
Harbin Institute Of Technology shenzhen Shenzhen Institute Of Science And Technology Innovation Harbin Institute Of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harbin Institute Of Technology shenzhen Shenzhen Institute Of Science And Technology Innovation Harbin Institute Of Technology filed Critical Harbin Institute Of Technology shenzhen Shenzhen Institute Of Science And Technology Innovation Harbin Institute Of Technology
Priority to CN202110597714.5A priority Critical patent/CN113033155B/en
Publication of CN113033155A publication Critical patent/CN113033155A/en
Application granted granted Critical
Publication of CN113033155B publication Critical patent/CN113033155B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/126Character encoding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/247Thesauruses; Synonyms
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Medical Treatment And Welfare Office Work (AREA)

Abstract

The invention discloses a medical concept automatic coding method and a device combining sequence generation and a hierarchical word list. The method solves the problems that the medical concept in the clinical medical text is manually mapped into the standard medical term code by adopting a manual coding method in the prior art, the cost is high, the efficiency is limited and the accuracy is low.

Description

Automatic coding method for medical concepts by combining sequence generation and hierarchical word lists
Technical Field
The invention relates to the field of medical concept coding, in particular to a medical concept automatic coding method combining sequence generation and a hierarchical word list.
Background
Automatic coding of medical concepts is an important research direction in the field of medical information processing. In a medical information system, the same standard medical term can have a plurality of different medical concept expression modes, and the non-uniformity and inaccuracy of the expression modes seriously hinder the integration, sharing and utilization of medical big data, thereby bringing great inconvenience to clinic, teaching and scientific research in the medical field. A medical code is a numeric and alphabetic tagging system that provides a unique and uniform coded representation for each diagnosis, symptom, or combination of symptoms, etc. At present, medical institutions need to manually map medical concepts in clinical medical texts into standard medical term codes by adopting a manual coding mode, and the manual coding needs a large number of professionals with medical knowledge to operate, so that the cost is high, the efficiency is limited, and the accuracy is not high.
Thus, there is still a need for improvement and development of the prior art.
Disclosure of Invention
The technical problem to be solved by the present invention is to provide a method and an apparatus for automatically encoding medical concepts in combination with sequence generation and hierarchical vocabularies, aiming at solving the problems of high cost, limited efficiency and low accuracy in the prior art that a manual encoding method is adopted to manually map medical concepts in clinical medical texts to standard medical term codes.
The technical scheme adopted by the invention for solving the problems is as follows:
in a first aspect, an embodiment of the present invention provides a method for automatically encoding medical concepts in combination with sequence generation and a hierarchical vocabulary, where the method includes:
acquiring a clinical medical text, and inputting the clinical medical text into a preset encoder to obtain initial vector data of the clinical medical text;
acquiring pre-constructed hierarchical word list data, inputting the hierarchical word list data into a preset learning algorithm, and acquiring standard medical term vector data of the hierarchical word list;
inputting the initial vector data of the clinical medical text and the generated standard medical term vector data into a preset decoder, sequentially generating encoded data corresponding to a plurality of standard medical terms, and forming the encoded data corresponding to the clinical medical text according to the encoded data.
In one embodiment, the obtaining of the clinical medical text and inputting the clinical medical text into a preset encoder to obtain initial vector data of the clinical medical text includes:
inputting a clinical medical text into a word embedding layer, and mapping the clinical medical text through the word embedding layer to obtain mapping data;
inputting the mapping data into an encoder, and acquiring initial vector data generated by the encoder based on the mapping data.
In one embodiment, the obtaining of pre-constructed hierarchical vocabulary data, inputting the hierarchical dictionary data into a preset learning algorithm, and obtaining standard medical term vector data of the hierarchical vocabulary includes:
acquiring coding information of standard medical term data in term dictionary data, and dividing the standard medical term data into a parent node and a child node according to the coding information;
acquiring the father node, the child node and the father-child relationship information between the father node and the child node, and constructing hierarchical vocabulary data according to the father node, the child node and the father-child relationship information between the father node and the child node;
inputting the hierarchical vocabulary data into a preset learning algorithm to obtain vector data representing the father node, the child nodes and the father-child relationship information;
and taking vector data representing the parent node, the child nodes and the parent-child relationship information as standard medical term vector data of the hierarchical vocabulary.
In one embodiment, the encoded information includes both alphabetic field information and numeric field information.
In one embodiment, the obtaining encoded information of standard medical term data in the term dictionary data, the dividing the standard medical term data into parent nodes and child nodes according to the encoded information includes:
taking each standard medical term data as a node;
taking a plurality of nodes with the same number before the preset sequence bit of the digital field information as the same type of nodes, wherein the types of all the letter field information are the same;
and in the same type of nodes, taking the node with the shortest digital field information as a father node and taking the nodes except the father node as child nodes.
In one embodiment, the decoder includes a classifier, the classifier includes tags of a plurality of standard medical terms, the inputting the initial vector data of the clinical medical text and the generated standard medical term vector data into a preset decoder sequentially generates encoded data corresponding to a plurality of standard medical terms, and forming standard medical term sequence data corresponding to the clinical medical text according to the encoded data includes:
acquiring sequence data consisting of all historical standard medical term vector data output by the decoder; the sequence data is standard medical term vector data corresponding to codes output by the decoder before the current time step;
determining, by the classifier, encoded data output by the decoder at a current time step corresponding to the clinical medical text based on the initial vector data and the sequence data; this process is repeated until no encoded data can be generated;
and forming standard medical term sequence data corresponding to the clinical medical text according to the coded data.
In one embodiment, a probability function is included in the classifier, and the encoded data output by the decoder at the current time step corresponding to the clinical medical text is determined by the classifier based on the initial vector data and the sequence data;
fusing the initial vector data with vector data corresponding to the sequence data to obtain fused vector data;
inputting the fusion vector data into the probability function, and acquiring probability values of a plurality of possible coded data generated by the probability function based on the fusion vector data;
and sequencing the probability values according to the numerical values, and taking the coded data with the maximum probability value as the coded data output by the decoder at the current moment.
In a second aspect, an embodiment of the present invention further provides an apparatus for automatically encoding medical concepts in combination with sequence generation and hierarchical vocabularies, wherein the apparatus includes:
the acquisition module is used for acquiring a clinical medical text, inputting the clinical medical text into a preset encoder and obtaining initial vector data of the clinical medical text;
the learning module is used for acquiring pre-constructed hierarchical word list data, inputting the hierarchical word list data into a preset learning algorithm and acquiring standard medical term vector data of the hierarchical word list;
and the encoding module is used for inputting the initial vector data of the clinical medical text and the generated standard medical term vector data into a preset decoder, sequentially generating encoded data corresponding to a plurality of standard medical terms, and forming standard medical term sequence data corresponding to the clinical medical text according to the encoded data.
The invention has the beneficial effects that: according to the embodiment of the invention, medical concept coding events in clinical medical texts are converted into sequence generation problems, the concept of a hierarchical word list is introduced to enhance the relationship among medical terms, and standard medical terms corresponding to the clinical medical texts are accurately determined and automatically coded according to the hierarchical word list in the sequence generation process. The method solves the problems that the medical concept in the clinical medical text is manually mapped into the standard medical term code by adopting a manual coding method in the prior art, the cost is high, the efficiency is limited and the accuracy is low.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments described in the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
Fig. 1 is a flow chart of a method for automatically encoding medical concepts in combination with sequence generation and hierarchical vocabularies according to an embodiment of the present invention.
Fig. 2 is a schematic diagram of a frame of a Seq2Seq model provided in the embodiment of the present invention.
Fig. 3 is a block connection diagram of an apparatus for automatically encoding medical concepts in combination with sequence generation and hierarchical vocabularies according to an embodiment of the present invention.
Fig. 4 is a schematic block diagram of a terminal according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer and clearer, the present invention is further described in detail below with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
It should be noted that, if directional indications (such as up, down, left, right, front, and back … …) are involved in the embodiment of the present invention, the directional indications are only used to explain the relative positional relationship between the components, the movement situation, and the like in a specific posture (as shown in the drawing), and if the specific posture is changed, the directional indications are changed accordingly.
With the rapid development and application popularization of information technologies such as the internet, big data, cloud computing and artificial intelligence, the production and the life of human beings are influenced unprecedentedly. In recent years, information technology has been increasingly applied to aspects of social life. In various industries, information technology has changed the way of human management, analysis and data application, and economic culture and other fields are not assisted by the information technology. In the application fields of these information technologies, medical treatment is one of the fields that is important and has unlimited potential. In the medical field, a large amount of information processing is involved. These medical information have the following characteristics:
1) the data volume is large and the growth speed is high;
2) the demand for sharing is high.
The greatest advantage of information technology lies in the efficiency of data processing, so that the information technology is widely applied in the medical field at present, and the computer application direction of medical information processing is generated. The medical information processing means organically combining the related technology of the computer with the requirements of the medical and health industry, meeting the requirements of medical institutions and related departments on collection, arrangement, storage, analysis and the like of medical and health information, improving the efficiency of the health industry and meeting the functional requirements of customers. The medical information processing technology improves the medical information processing efficiency and the accuracy of medical information processing, and the development of medical information enters a new height. For a long time, how to actually and effectively improve the medical level and promote the medical development by using the medical information processing technology is one of the hot problems of researches of scholars in the related field.
Automatic coding of medical concepts is an important research direction in the field of medical information processing. Clinical medical text generally refers to textual data describing the clinical performance of a patient formed by medical staff during a medical activity, which may contain several medically related concepts. In a medical information system, the same standard medical term may have a plurality of different medical concept expressions. First, since there may be differences in the recording styles of medical workers, sometimes in order to pursue work efficiency, medical texts recorded by the medical workers may contain more synonyms, acronyms, foreign languages, or spoken language expressions. Therefore, in clinical medical texts, the phenomenon that one term corresponds to multiple expressions is obvious. For example, in the Chinese clinical medical text, the term "congenital scoliosis" can be expressed as "congenital scoliosis" and also as "congenital scoliosis deformity"; in the english clinical medicine text, "heart attack", "MI" and "myocarpial attack" may all represent the meaning of "myocardial infarction". Second, in some cases, multiple diagnosis-or symptom-related medical concepts are closely related and confusing, and the same medical concept in the clinical medical context may correspond to different standard medical terms due to different context, such as in the chinese clinical medical context, the diagnosis-related medical concept "nasopharyngeal fistula" may correspond to the standard medical term "sinus fistula" or to the standard medical term "pharyngeal fistula" depending on the context information. The phenomenon of non-uniformity and inaccuracy of the expression mode seriously hinders integration, sharing and utilization of medical big data, and brings great inconvenience to clinical practice, teaching and scientific research in the medical field. A medical code is a numeric and alphabetic tagging system that provides a unique coded representation for each diagnosis, symptom, or combination of symptoms, etc. Therefore, the medical concept in the clinical medical text is normalized to the corresponding code of the standard medical term in the medical coding system according to the unification, which is particularly urgent in the process of promoting the medical informatization. Some current medical institutions manually map medical concepts in clinical medical texts to standard medical term codes by means of manual codes. In this process, the encoding personnel need to review the medical concepts or other relevant information in the clinical medical text and then manually assign the appropriate standard medical term codes to these medical concepts according to the code guidance. Since medical institutions generate massive text information every day, manual coding requires a large number of professionals with medical knowledge to operate, and is high in cost, limited in efficiency and low in accuracy.
Aiming at the defects of the prior art, the invention provides a medical concept automatic coding method combining sequence generation and a hierarchical word list, which is characterized in that medical concept coding events in clinical medical texts are converted into sequence generation problems, the concept of the hierarchical word list is introduced to enhance the relationship among medical terms, and standard medical terms corresponding to the clinical medical texts are accurately determined and automatically coded according to the hierarchical word list in the sequence generation process. The method solves the problems that the medical concept in the clinical medical text is manually mapped into the standard medical term code by adopting a manual coding method in the prior art, the cost is high, the efficiency is limited and the accuracy is low.
As shown in fig. 1, the present embodiment provides a method for automatically encoding medical concepts in combination with sequence generation and hierarchical vocabularies, the method comprising the steps of:
s100, obtaining a clinical medical text, and inputting the clinical medical text into a preset encoder to obtain initial vector data of the clinical medical text.
In order to automatically encode the medical concept in the clinical medical text, the embodiment first needs to acquire the clinical medical text for encoding. Since the main objective of the present embodiment is to automatically map the medical concept to the encoded data corresponding to the standard medical term by using the computer-related technology, a certain processing of the clinical medical text is required so that the computer can perform calculation according to the processed clinical medical text. Specifically, as shown in fig. 2, the present embodiment mainly employs a Seq2Seq model as a main generation framework, where the Seq2Seq model is a model employed when the length of the output is uncertain. For example, when a task of translating a sentence of chinese into english appears in the tasks translated by the machine, the translated english may be shorter or longer than chinese, and thus the output length is uncertain, and the Seq2Seq model is just suitable for this case. The Seq2Seq model in this embodiment comprises an encoder and a decoder, wherein the encoder may use bi-directional LSTM, which is responsible for converting the input text data into data in the form of a vector, and this vector can be regarded as a semantic vector of the input text. Specifically, after a clinical medical text is acquired, the clinical medical text is input into an encoder in a preset Seq2Seq model, and after the encoder acquires the clinical medical text, the clinical medical text is encoded into a vector form through learning input, so that initial vector data of the clinical medical text is acquired.
In one implementation, the step S100 specifically includes the following steps:
step S110, inputting a clinical medical text into a word embedding layer, and mapping the clinical medical text through the word embedding layer to obtain mapping data;
and step S120, inputting the mapping data into an encoder, and acquiring initial vector data of the clinical medical text generated by the encoder based on the mapping data.
In order to obtain the initial vector data, the embodiment needs to embed the clinical medical text input word into a layer first. After the word embedding layer obtains the clinical medical text, the characteristics of words in the text are mapped to a lower dimension, and mapping data are output after mapping is finished, so that model parameters are fewer, and training is faster. Then inputting the mapping dataIn the encoder, after the encoder acquires the mapping data, the encoder encodes the mapping data and generates initial vector data. For example, in this embodiment, a bidirectional LSTM is used as an encoder, and first at the encoder end, a clinical medical text is first obtained
Figure 766160DEST_PATH_IMAGE001
In which words are mapped to vector representations by the word embedding layer
Figure 877205DEST_PATH_IMAGE002
Then is aligned with
Figure 449132DEST_PATH_IMAGE003
Encoding using bi-directional LSTM, resulting in a hidden layer representation:
Figure 546401DEST_PATH_IMAGE004
Figure 238282DEST_PATH_IMAGE005
since the encoder used in this embodiment is a bidirectional LSTM, two hidden layer representations output by the bidirectional LSTM need to be spliced to obtain a final hidden layer representation:
Figure 864435DEST_PATH_IMAGE006
wherein,
Figure 517134DEST_PATH_IMAGE007
namely the initial vector data of the clinical medical text.
As shown in fig. 1, the method further comprises the steps of:
s200, acquiring pre-constructed hierarchical word list data, inputting the hierarchical word list data into a preset learning algorithm, and acquiring standard medical term vector data of the hierarchical word list.
In order to implement an automatic encoding process of a medical concept, this embodiment may need to acquire pre-constructed hierarchical vocabulary data, where the hierarchical vocabulary data actually refers to vocabulary data containing hierarchical relationships between various standard medical terms, and since a deep learning model is needed in this embodiment, it is also needed to convert the standard medical terms contained in the hierarchical vocabulary into a vector format that can be executed by the deep learning model, so it is needed to input the hierarchical vocabulary data into a preset learning algorithm and obtain standard medical term vector data of the hierarchical vocabulary.
In one implementation, the step S200 specifically includes the following steps:
step S210, acquiring coding information of standard medical term data in term dictionary data, and dividing the standard medical term data into a parent node and a child node according to the coding information;
step S220, acquiring the father node, the child node and the father-child relationship information between the father node and the child node, and constructing hierarchical vocabulary data according to the father node, the child node and the father-child relationship information between the father node and the child node;
step S230, inputting the hierarchical vocabulary data into a preset learning algorithm to obtain vector data representing the father node, the child nodes and the father-child relationship information;
and S240, taking vector data representing the father node, the child node and the father-child relationship information as standard medical term vector data of the hierarchical vocabulary.
First, the present embodiment pre-constructs a hierarchical vocabulary. Specifically, in order to construct the hierarchical vocabulary, the present embodiment needs to acquire encoding information of standard medical term data in term dictionary data, and divide the standard medical term data into parent nodes and child nodes according to the encoding information to determine the inclusion relationship between the respective standard medical terms. In one implementation, the encoded information may include letter segment information and number field information, for example, a standard medical term may correspond to a code of J001. Then, each standard medical term data is used as a node, then, the types of all letter segment information are the same, a plurality of nodes with the same number before the preset sequence position of the number segment information are used as the same type of node, then, in the same type of node, the node with the shortest number segment information is used as a father node, and the nodes except the father node are used as child nodes.
In one implementation, this embodiment may design an algorithm to determine the parent node and the child node, and the structure formed by the parent node and the child node is a tree-shaped hierarchical structure, where the specific algorithm is as follows:
A. defining a data structure of nodes in a tree, each tree node comprising two parts: b, coding the character string and the child node list, and turning to the step B;
B. c, initializing a root node of the tree, wherein the corresponding code of the root node is an empty character string, and the list of child nodes is empty, and turning to the step C;
C. if the standard medical term dictionary is empty, the algorithm ends. Otherwise, taking a (code, term) pair from the standard medical term dictionary, and turning to the step D;
D. setting the current node as a root node, if the code of the current node is the prefix (letter segment information) of the taken code and the two codes are different, setting a jump-out circulation mark as false, turning to the step E, otherwise, turning to the step H;
E. and if the child node list of the current node is empty, the step G is carried out. Otherwise, taking a child node, and turning to the step F;
F. if the code of the child node is the prefix of the taken code and the two codes are different, setting the current node as the child node, setting the jump-out loop flag as true, jumping to the step F, otherwise, jumping to the step E;
G. if the skip cycle flag is true, go to step C. Otherwise, turning to the step D;
H. and initializing a new node, wherein the corresponding code of the new node is the taken code, and the child node is empty. And C, adding the new node into the child node list of the current node, and turning to the step C.
After determining a father node and a child node, acquiring father-child relationship information among the father node, the child node, and the father-child relationship information, constructing hierarchical vocabulary data according to the father node, the child node, and the father-child relationship information, and inputting the hierarchical vocabulary data into a preset learning algorithm, such as a TransE algorithm, so as to obtain vector data representing the father node, the child node, and the father-child relationship information. And finally, taking vector data representing the father node, the child nodes and the father-child relationship information as standard medical term vector data of the hierarchical word list. In short, the present embodiment is intended to represent the standard medical terms by low-dimensional dense vectors and determine the inclusion relationship between the respective standard medical term vector data, thereby distributing the standard medical terms of similar semantics in the approximate, adjacent vector space. In addition, the embodiment determines the dependency relationship or the hierarchical relationship of each standard medical term by constructing a hierarchical vocabulary, so that the hierarchical relationship between the standard medical terms is clearer, the standard medical terms corresponding to the medical concept in the clinical medical text can be better screened out, and the purpose of automatically coding the medical concept is finally achieved.
As shown in fig. 1, the method further comprises the steps of:
step S300, inputting the initial vector data of the clinical medical text and the generated standard medical term vector data into a preset decoder, sequentially generating coded data corresponding to a plurality of standard medical terms, and forming standard medical term sequence data corresponding to the clinical medical text according to the coded data.
The Seq2Seq model used in this embodiment further includes a decoder, and in one implementation, a unidirectional LSTM may be used as the decoder. Specifically, after the initial vector data and the standard medical term vector data of the clinical medical text are acquired, the two data need to be input into a decoder, after the decoder acquires the two data, encoded data corresponding to a plurality of standard medical terms are sequentially generated, and the standard medical term sequence data corresponding to the clinical medical text can be formed according to the encoded data. In short, when the medical concept in the clinical medical text is coded, the present embodiment refers to not only the hierarchical relationship between the standard medical terms, but also the sequence generated by the plurality of standard medical terms, and finally determines the correct code corresponding to the medical concept in the clinical medical text.
In one implementation, the decoder includes a classifier, and the classifier includes a plurality of labels of standard medical terms, and the step S300 specifically includes the following steps:
step S310, acquiring sequence data consisting of all historical standard medical term vector data output by the decoder; the sequence data is standard medical term vector data corresponding to codes output by the decoder before the current time step;
step S320, determining the coded data output by the decoder at the current time step corresponding to the clinical medical text by the classifier based on the initial vector data and the sequence data; this process is repeated until no encoded data can be generated;
and S330, forming standard medical term sequence data corresponding to the clinical medical text according to the coded data.
In order to realize automatic coding of medical concepts in clinical medical texts, the present embodiment first obtains sequence data composed of all historical standard medical term vector data output by the decoder, and it is understood that the sequence data is the standard medical term vector data corresponding to the code output by the decoder before the current time step. Then, in order to determine a standard medical term vector data with the highest similarity to the clinical medical text at the current time step, the embodiment further inputs the vector data and the sequence data into a classifier, where a classification space in the classifier can be actually regarded as a label set, and each label corresponds to a standard medical term. The present embodiment is intended to determine, by the classifier, encoded data to be output by a decoder at a current time step corresponding to the clinical medical text, based on the initial vector data and the sequence data. Specifically, in this embodiment, a catch-up donkey function is further set in the classifier, and in order to determine the encoded data output by the decoder at the current time step, in this embodiment, the initial vector data and the vector data corresponding to the sequence data need to be fused to obtain fused vector data. Then, the fusion vector data is input into the probability function, and probability values of a plurality of possible encoded data generated by the probability function based on the fusion vector data are obtained, it can be understood that the probability values can indicate the correlation or the degree of association between each splicing sequence data and the clinical medical text to some extent, that is, in this embodiment, an attention mechanism is introduced to determine the standard medical term corresponding to the clinical medical text.
For example, at the decoder side, the time
Figure 152514DEST_PATH_IMAGE008
Hidden layer representation for a decoder
Figure 184055DEST_PATH_IMAGE009
Is calculated as follows, wherein
Figure 246689DEST_PATH_IMAGE010
Is composed of
Figure 386684DEST_PATH_IMAGE011
Coding with maximum probability in time-of-day output distribution
Figure 950389DEST_PATH_IMAGE012
Vector representation of corresponding standard medical terms, namely vector representation obtained by learning through a TransE algorithm;
Figure 226649DEST_PATH_IMAGE013
wherein, when calculating the probability, it is necessary to useAttention is paid to a mechanism, thereby obtaining
Figure 460185DEST_PATH_IMAGE009
Correlation coefficient with clinical medical text, specifically, attention mechanism calculation process is as follows,
Figure 821896DEST_PATH_IMAGE014
to represent
Figure 674445DEST_PATH_IMAGE008
The moment is calculated by using an attention mechanism to obtain a vector:
Figure 805212DEST_PATH_IMAGE015
Figure 209649DEST_PATH_IMAGE016
Figure 58656DEST_PATH_IMAGE017
wherein,
Figure 695655DEST_PATH_IMAGE018
to represent
Figure 946508DEST_PATH_IMAGE019
Compared with the clinical medical textiThe degree of correlation between the words is,
Figure 521846DEST_PATH_IMAGE018
to represent
Figure 858149DEST_PATH_IMAGE019
Compared with the clinical medical textiThe coefficient of the degree of correlation between words,
Figure 52501DEST_PATH_IMAGE020
Figure 157861DEST_PATH_IMAGE021
Figure 904100DEST_PATH_IMAGE022
all are learning parameters.
Then the probability distribution is output through a classifier:
Figure 852333DEST_PATH_IMAGE023
Figure 975010DEST_PATH_IMAGE024
wherein,
Figure 934876DEST_PATH_IMAGE025
to represent
Figure 852016DEST_PATH_IMAGE026
Time of day
Figure 38278DEST_PATH_IMAGE019
And
Figure 699066DEST_PATH_IMAGE027
by a non-linear activation functionf(such as tanh, ReLu, etc.) are fused,
Figure 513439DEST_PATH_IMAGE028
to representtThe distribution of the output at the decoder side is,
Figure 867060DEST_PATH_IMAGE029
Figure 524306DEST_PATH_IMAGE030
Figure 988785DEST_PATH_IMAGE031
all are learning parameters.
And then sequencing the obtained probability values according to the numerical values, and taking the coded data with the maximum probability value as the coded data output by the decoder at the current moment. This process is repeated until no encoded data can be generated. And finally, standard medical term sequence data corresponding to the clinical medical text is formed according to the coded data, so that the clinical medical text is automatically coded.
In summary, although there are techniques for encoding medical concepts by using a machine learning method in the prior art, most of them use a greedy search strategy to generate codes, and the greedy search strategy screens out the vectors of the standard medical terms with the highest probability at each time step at the decoder side, so that the search space is relatively limited. However, the cluster search is actually adopted in this embodiment, that is, at each time step, the first sequences with the highest probability are considered as candidate sequences, and the sequences with the highest probability are selected as final target sequences.
Based on the above embodiment, the present invention further provides an apparatus for automatically encoding medical concepts in combination with sequence generation and hierarchical vocabulary, as shown in fig. 3, the apparatus comprising:
the acquisition module 01 is used for acquiring a clinical medical text, inputting the clinical medical text into a preset encoder, and acquiring initial vector data generated by the encoder based on the clinical medical text;
the learning module 02 is used for acquiring pre-constructed hierarchical word list data, inputting the hierarchical word list data into a preset learning algorithm, and acquiring standard medical term vector data of the hierarchical word list;
the encoding module 03 is configured to input the initial vector data of the clinical medical text and the generated standard medical term vector data into a preset decoder, sequentially generate encoded data corresponding to a plurality of standard medical terms, and form standard medical term sequence data corresponding to the clinical medical text according to the encoded data.
Based on the above embodiments, the present invention further provides a terminal, and a schematic block diagram thereof may be as shown in fig. 4. The terminal comprises a processor, a memory, a network interface and a display screen which are connected through a system bus. Wherein the processor of the terminal is configured to provide computing and control capabilities. The memory of the terminal comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The network interface of the terminal is used for connecting and communicating with an external terminal through a network. The computer program is executed by a processor to implement a method of automatically encoding medical concepts in conjunction with sequence generation and hierarchical vocabularies. The display screen of the terminal can be a liquid crystal display screen or an electronic ink display screen.
It will be understood by those skilled in the art that the block diagram of fig. 4 is a block diagram of only a portion of the structure associated with the inventive arrangements and is not intended to limit the terminals to which the inventive arrangements may be applied, and that a particular terminal may include more or less components than those shown, or may have some components combined, or may have a different arrangement of components.
In one implementation, one or more programs are stored in a memory of the terminal and configured to be executed by one or more processors include instructions for performing a method of medical concept auto-encoding in conjunction with sequence generation and hierarchical vocabularies.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, databases, or other media used in embodiments provided herein may include non-volatile and/or volatile memory. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).
In summary, the invention discloses a method and a device for automatically coding medical concepts in combination with sequence generation and a hierarchical vocabulary, which convert medical concept coding events in clinical medical texts into sequence generation problems, introduce the concept of the hierarchical vocabulary to enhance the relationship among medical terms, and accurately determine standard medical terms corresponding to the clinical medical texts and automatically code the medical terms according to the hierarchical vocabulary in the process of sequence generation. The method solves the problems that the medical concept in the clinical medical text is manually mapped into the standard medical term code by adopting a manual coding method in the prior art, the cost is high, the efficiency is limited and the accuracy is low.
It is to be understood that the invention is not limited to the examples described above, but that modifications and variations may be effected thereto by those of ordinary skill in the art in light of the foregoing description, and that all such modifications and variations are intended to be within the scope of the invention as defined by the appended claims.

Claims (8)

1. A method for automatic coding of medical concepts in conjunction with sequence generation and hierarchical vocabularies, the method comprising:
acquiring a clinical medical text, and inputting the clinical medical text into a preset encoder to obtain initial vector data of the clinical medical text;
acquiring pre-constructed hierarchical word list data, inputting the hierarchical word list data into a preset learning algorithm, and acquiring standard medical term vector data of the hierarchical word list;
inputting the initial vector data of the clinical medical text and the generated standard medical term vector data into a preset decoder, sequentially generating coded data corresponding to a plurality of standard medical terms, and forming standard medical term sequence data corresponding to the clinical medical text according to the coded data;
the acquiring of the pre-constructed hierarchical vocabulary data, inputting the hierarchical vocabulary data into a preset learning algorithm, and acquiring the standard medical term vector data of the hierarchical vocabulary comprises:
acquiring coding information of standard medical term data in term dictionary data, and dividing the standard medical term data into a parent node and a child node according to the coding information;
acquiring the father node, the child node and the father-child relationship information between the father node and the child node, and constructing hierarchical vocabulary data according to the father node, the child node and the father-child relationship information between the father node and the child node;
inputting the hierarchical vocabulary data into a preset learning algorithm to obtain vector data representing the father node, the child nodes and the father-child relationship information;
and taking vector data representing the parent node, the child nodes and the parent-child relationship information as standard medical term vector data of the hierarchical vocabulary.
2. The method of claim 1, wherein the obtaining clinical medical texts, inputting the clinical medical texts into a preset encoder, and obtaining initial vector data of the clinical medical texts comprises:
inputting a clinical medical text into a word embedding layer, and mapping the clinical medical text through the word embedding layer to obtain mapping data;
inputting the mapping data into an encoder, and acquiring initial vector data of the clinical medical text generated by the encoder based on the mapping data.
3. The method of claim 1, wherein the encoded information comprises both alphabetic field information and numeric field information.
4. The method as claimed in claim 3, wherein the obtaining of the encoding information of the standard medical term data in the term dictionary data, and the dividing of the standard medical term data into parent nodes and child nodes according to the encoding information comprises:
taking each standard medical term data as a node;
taking a plurality of nodes with the same number before the preset sequence bit of the digital field information as the same type of nodes, wherein the types of all the letter field information are the same;
and in the same type of nodes, taking the node with the shortest digital field information as a father node and taking the nodes except the father node as child nodes.
5. The method as claimed in claim 1, wherein the decoder comprises a classifier containing labels of a plurality of standard medical terms, the inputting the initial vector data of the clinical medical text and the generated vector data of the standard medical terms into a preset decoder sequentially generates encoded data corresponding to a plurality of standard medical terms, and the forming of the sequence data of the standard medical terms corresponding to the clinical medical text from the encoded data comprises:
acquiring sequence data consisting of all historical standard medical term vector data output by the decoder; the sequence data is standard medical term vector data corresponding to codes output by the decoder before the current time step;
determining, by the classifier, encoded data output by the decoder at a current time step corresponding to the clinical medical text based on the initial vector data and the sequence data; this process is repeated until no encoded data can be generated;
and forming standard medical term sequence data corresponding to the clinical medical text according to the coded data.
6. The method of claim 5, wherein the classifier comprises a probability function, and the determining, by the classifier, the encoded data output by the decoder at the current time step corresponding to the clinical medical text based on the initial vector data and the sequence data comprises:
fusing the initial vector data with vector data corresponding to the sequence data to obtain fused vector data;
inputting the fusion vector data into the probability function, and acquiring probability values of a plurality of possible coded data generated by the probability function based on the fusion vector data;
and sequencing the probability values according to the numerical values, and taking the coded data with the maximum probability value as the coded data output by the decoder at the current moment.
7. An apparatus for automatic coding of medical concepts in conjunction with sequence generation and hierarchical vocabularies, the apparatus comprising:
the acquisition module is used for acquiring a clinical medical text, inputting the clinical medical text into a preset encoder and acquiring initial vector data generated by the encoder based on the clinical medical text;
the learning module is used for acquiring pre-constructed hierarchical word list data, inputting the hierarchical word list data into a preset learning algorithm and acquiring standard medical term vector data of the hierarchical word list; the acquiring of the pre-constructed hierarchical vocabulary data, inputting the hierarchical vocabulary data into a preset learning algorithm, and acquiring the standard medical term vector data of the hierarchical vocabulary comprises: acquiring coding information of standard medical term data in term dictionary data, and dividing the standard medical term data into a parent node and a child node according to the coding information; acquiring the father node, the child node and the father-child relationship information between the father node and the child node, and constructing hierarchical vocabulary data according to the father node, the child node and the father-child relationship information between the father node and the child node; inputting the hierarchical vocabulary data into a preset learning algorithm to obtain vector data representing the father node, the child nodes and the father-child relationship information; using vector data representing the parent node, the child nodes and the parent-child relationship information as standard medical term vector data of the hierarchical vocabulary;
and the encoding module is used for inputting the initial vector data of the clinical medical text and the generated standard medical term vector data into a preset decoder, sequentially generating encoded data corresponding to a plurality of standard medical terms, and forming standard medical term sequence data corresponding to the clinical medical text according to the encoded data.
8. A computer readable storage medium having stored thereon a plurality of instructions adapted to be loaded and executed by a processor to perform the steps of a method for automatic encoding of medical concepts in conjunction with sequence generation and hierarchical vocabularies of any of the preceding claims 1-6.
CN202110597714.5A 2021-05-31 2021-05-31 Automatic coding method for medical concepts by combining sequence generation and hierarchical word lists Active CN113033155B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110597714.5A CN113033155B (en) 2021-05-31 2021-05-31 Automatic coding method for medical concepts by combining sequence generation and hierarchical word lists

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110597714.5A CN113033155B (en) 2021-05-31 2021-05-31 Automatic coding method for medical concepts by combining sequence generation and hierarchical word lists

Publications (2)

Publication Number Publication Date
CN113033155A CN113033155A (en) 2021-06-25
CN113033155B true CN113033155B (en) 2021-10-26

Family

ID=76455886

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110597714.5A Active CN113033155B (en) 2021-05-31 2021-05-31 Automatic coding method for medical concepts by combining sequence generation and hierarchical word lists

Country Status (1)

Country Link
CN (1) CN113033155B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109408820A (en) * 2018-10-17 2019-03-01 长沙瀚云信息科技有限公司 A kind of medical terminology mapped system and method, equipment and storage medium
CN110705214A (en) * 2019-08-27 2020-01-17 天津开心生活科技有限公司 Automatic coding method and device

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108182976A (en) * 2017-12-28 2018-06-19 西安交通大学 A kind of clinical medicine information extracting method based on neural network
CN109299273B (en) * 2018-11-02 2020-06-23 广州语义科技有限公司 Multi-source multi-label text classification method and system based on improved seq2seq model
CN110827929B (en) * 2019-11-05 2022-06-07 中山大学 Disease classification code recognition method and device, computer equipment and storage medium
CN111063446B (en) * 2019-12-17 2023-06-16 医渡云(北京)技术有限公司 Method, apparatus, device and storage medium for standardizing medical text data
CN112802568A (en) * 2021-02-03 2021-05-14 紫东信息科技(苏州)有限公司 Multi-label stomach disease classification method and device based on medical history text

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109408820A (en) * 2018-10-17 2019-03-01 长沙瀚云信息科技有限公司 A kind of medical terminology mapped system and method, equipment and storage medium
CN110705214A (en) * 2019-08-27 2020-01-17 天津开心生活科技有限公司 Automatic coding method and device

Also Published As

Publication number Publication date
CN113033155A (en) 2021-06-25

Similar Documents

Publication Publication Date Title
US11501182B2 (en) Method and apparatus for generating model
CN110688854B (en) Named entity recognition method, device and computer readable storage medium
CN114530223A (en) NLP-based cardiovascular disease medical record structuring system
CN112100332A (en) Word embedding expression learning method and device and text recall method and device
CN113591457A (en) Text error correction method, device, equipment and storage medium
US20230244704A1 (en) Sequenced data processing method and device, and text processing method and device
CN110991185A (en) Method and device for extracting attributes of entities in article
CN111563380A (en) Named entity identification method and device
CN113657105A (en) Medical entity extraction method, device, equipment and medium based on vocabulary enhancement
CN111581972A (en) Method, device, equipment and medium for identifying corresponding relation between symptom and part in text
CN113158676A (en) Professional entity and relationship combined extraction method and system and electronic equipment
CN115935914A (en) Admission record missing text supplementing method
CN113297852A (en) Medical entity word recognition method and device
CN113421657A (en) Construction method and device of knowledge representation model of clinical practice guideline
CN116386895B (en) Epidemic public opinion entity identification method and device based on heterogeneous graph neural network
CN113033155B (en) Automatic coding method for medical concepts by combining sequence generation and hierarchical word lists
CN117131873A (en) Double-encoder pre-training small sample relation extraction method based on contrast learning
CN116702777A (en) Chinese named entity recognition method, device, electronic equipment and storage medium
CN115270792A (en) Medical entity identification method and device
CN115659989A (en) Web table abnormal data discovery method based on text semantic mapping relation
CN114372467A (en) Named entity extraction method and device, electronic equipment and storage medium
CN112487811B (en) Cascading information extraction system and method based on reinforcement learning
CN114358021A (en) Task type dialogue statement reply generation method based on deep learning and storage medium
CN114911940A (en) Text emotion recognition method and device, electronic equipment and storage medium
CN114444492A (en) Non-standard word class distinguishing method and computer readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant