CN116757207A - ICD automatic coding method based on artificial intelligence and related equipment - Google Patents

ICD automatic coding method based on artificial intelligence and related equipment Download PDF

Info

Publication number
CN116757207A
CN116757207A CN202310463927.8A CN202310463927A CN116757207A CN 116757207 A CN116757207 A CN 116757207A CN 202310463927 A CN202310463927 A CN 202310463927A CN 116757207 A CN116757207 A CN 116757207A
Authority
CN
China
Prior art keywords
icd
entity
term
coding
diagnosis
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310463927.8A
Other languages
Chinese (zh)
Inventor
苏国辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN202310463927.8A priority Critical patent/CN116757207A/en
Publication of CN116757207A publication Critical patent/CN116757207A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/232Orthographic correction, e.g. spell checking or vowelisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/20ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for computer-aided diagnosis, e.g. based on medical expert systems
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Medical Informatics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Biomedical Technology (AREA)
  • Public Health (AREA)
  • Animal Behavior & Ethology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Pathology (AREA)
  • Epidemiology (AREA)
  • Primary Health Care (AREA)
  • Medical Treatment And Welfare Office Work (AREA)

Abstract

The application provides an ICD automatic coding method, device, electronic equipment and storage medium based on artificial intelligence, wherein the ICD automatic coding method based on artificial intelligence comprises the following steps: constructing an atlas embedding layer based on different entity categories in the medical knowledge atlas; constructing an ICD coding initial model based on the map embedding layer, wherein the ICD coding initial model comprises an input layer, an embedding layer and an output layer, and the embedding layer comprises the map embedding layer; collecting a plurality of groups of sample pairs of diagnosis terms and ICD terms as a training set; training an ICD coding initial model based on the training set to obtain an ICD coding target model, wherein the ICD coding target model is input into a diagnosis term and an ICD term, and output into the similarity of the diagnosis term and the ICD term; inputting the diagnostic terms to be encoded into an ICD encoding target model, calculating the similarity between the diagnostic terms to be encoded and each ICD term, and taking the ICD term corresponding to the maximum value of the similarity as the encoding result of the diagnostic terms to be encoded. The application can improve the accuracy of ICD coding in the digital medical field.

Description

ICD automatic coding method based on artificial intelligence and related equipment
Technical Field
The application relates to the technical field of artificial intelligence and digital medical treatment, in particular to an ICD automatic coding method and related equipment based on artificial intelligence.
Background
International disease classification (international Classification of diseases, ICD) is an important component of the health information standard system as an international classification standard for diseases and related health problems. Disease classification is the classification of diseases according to certain rules based on certain characteristics of the disease, and is also a grouping in practice, sometimes a group may contain several diseases of the same or similar nature, sometimes only a single disease. The ICD is used for grouping diseases by using a coding method, and different disease categories correspond to different ICD codes. ICD codes not only can unify and normalize disease names, but also can reflect national health conditions. Meanwhile, disease classification is also an important basis for medical insurance auditing and payment.
At present, text information is often obtained by carrying out image recognition on diagnostic information written by a clinician, the text information is directly matched with all ICD terms by utilizing a natural language processing model, and an ICD coding result is obtained, however, the conditions of wrongly written words, spoken language, common names and the like often appear in the diagnostic information written by the clinician, the obtained text information is short text, and the natural language processing model is difficult to extract high-quality text features under the condition of lacking context, so that the accuracy of ICD coding is reduced.
Disclosure of Invention
In view of the foregoing, it is necessary to propose an ICD automatic coding method based on artificial intelligence and a related device, so as to solve the technical problem of how to improve the accuracy of ICD coding, where the related device includes an ICD automatic coding device based on artificial intelligence, an electronic device and a storage medium.
The application provides an ICD automatic coding method based on artificial intelligence, which comprises the following steps:
constructing an atlas embedding layer based on at least one entity category in the medical knowledge atlas;
constructing an ICD coding initial model based on the map embedding layer, wherein the ICD coding initial model comprises an input layer, an embedding layer and an output layer, and the embedding layer comprises a text embedding layer and the map embedding layer;
collecting a plurality of sets of sample pairs of diagnostic terms and ICD terms as a training set;
training the ICD coding initial model based on the training set to obtain the ICD coding target model, wherein the ICD coding target model is input into a diagnosis term and an ICD term, and the ICD coding target model is output into the similarity of the diagnosis term and the ICD term;
inputting diagnostic terms to be encoded into the ICD encoding target model, calculating the similarity between the diagnostic terms to be encoded and each ICD term, and taking the ICD term corresponding to the maximum value of the similarity as the encoding result of the diagnostic terms to be encoded.
In some embodiments, collecting a plurality of sets of pairs of samples of diagnostic terms and ICD terms as a training set includes:
collecting an image with case information or diagnosis results, and extracting text information in the image by utilizing an optical character recognition technology to obtain diagnosis terms;
obtaining a coding result of any diagnosis term, wherein the coding result is an ICD term corresponding to the diagnosis term;
taking the diagnosis term and the coding result as a set of positive sample pairs, and taking any ICD term except the diagnosis term and the coding result as a set of negative sample pairs;
multiple positive sample pairs and multiple negative sample pairs are collected and stored as training sets.
In some embodiments, the medical knowledge graph includes an association relationship between all entities and any entity, and building the graph embedding layer based on at least one entity class in the medical knowledge graph includes:
extracting all entities of the same entity category from the medical knowledge graph and the association relation between the entities to obtain a sub-knowledge graph of each entity category;
the input of the map embedding layer is a sub-knowledge map of each entity category, the output is a map embedding vector used for representing the traditional Chinese medical knowledge in all sub-knowledge maps, the map embedding layer comprises a plurality of feature extraction layers and feature fusion layers, and the feature extraction layers are in one-to-one correspondence with the entity categories;
The feature extraction layer is used for extracting features of sub-knowledge maps of corresponding entity categories to obtain coding vectors of each entity in the entity categories, and adding the coding vectors of all the entities to obtain medical feature vectors of the entity categories;
the feature fusion layer is used for fusing medical feature vectors of all entity categories to obtain a spectrum embedding vector, and the spectrum embedding vector meets the relation:
wherein N is the number of all entity categories, h n Is the medical feature vector of entity class n, alpha n And (3) the weight coefficient of the entity class n, beta is a bias parameter, and H is the map embedding vector.
In some embodiments, the building an ICD encoding initial model based on the atlas embedding layer includes:
the input layer is used for receiving any diagnosis term and any ICD term;
inputting the diagnostic term into the embedded layer to perform preprocessing operation to obtain first input data and second input data;
inputting the first input data into the atlas embedding layer to obtain an atlas embedding vector, inputting the second input data into the text embedding layer to obtain a text embedding vector, and splicing the atlas embedding vector and the text embedding vector to obtain a diagnosis vector of the diagnosis term;
Inputting the ICD term into the embedding layer to execute the preprocessing operation, and obtaining an ICD vector of the ICD term based on the text embedding layer and the map embedding layer;
and inputting the diagnosis vector and the ICD vector into the output layer, and outputting the similarity of the diagnosis term and the ICD term.
In some embodiments, the inputting the diagnostic term into the embedded layer to perform a preprocessing operation to obtain first input data and second input data includes:
extracting entities of different entity categories from the diagnostic terms based on the medical knowledge graph to construct an entity extraction set of each entity category;
judging whether each entity in the corresponding sub-knowledge graph is in the entity extraction set or not according to each entity category, if the entity is in the entity extraction set, reserving the entity in the sub-knowledge graph, and if the entity is not in the entity extraction set, replacing the entity in the sub-knowledge graph with an empty text;
after traversing all entities in the sub-knowledge maps corresponding to each entity category, obtaining a diagnosis sub-knowledge map of each entity category;
Taking the diagnosis sub-knowledge maps of all entity categories as first input data;
and replacing all entities in the diagnosis term with preset characters, and filling the preset characters at the tail of the diagnosis term to reach a preset length to obtain second input data.
In some embodiments, the extracting entities of different entity classes from the diagnostic term based on the medical knowledge-graph to construct an entity extraction set for each entity class comprises:
obtaining confusion entities of each entity in the medical knowledge graph;
constructing a dictionary tree based on the entities in the medical knowledge graph and the confusing entities;
querying the dictionary tree to obtain entities or confusing entities contained in the diagnosis terms, and classifying and storing the entities or confusing entities contained in the diagnosis terms according to entity categories to obtain an initial entity extraction set of each entity category;
and replacing all the confused entities in the initial entity extraction set of each entity category with the entities corresponding to the confused entities to obtain the entity extraction set of each entity category.
In some embodiments, the training the ICD encoding initial model based on the training set to obtain the ICD encoding target model includes:
Randomly selecting a preset number of sample pairs from the training set, wherein the sample pairs comprise the positive sample pairs and the negative sample pairs;
inputting each sample pair into the ICD coding initial model to obtain the prediction similarity of each sample pair;
calculating a value of a loss function based on the predicted similarity of each sample pair, the loss function satisfying the relationship:
wherein M is + Number of positive sample pairs, sim u For the predicted similarity of positive samples to u, M - Number of negative sample pairs, sim v Is the predicted similarity of negative samples to v u -1‖ 2 Representing the computation of Sim u And 1, loss is the value of the Loss function;
updating the ICD coding initial model according to a back propagation algorithm to reduce the value of the loss function;
and continuously randomly selecting sample pairs from the training set to update the ICD coding initial model, and stopping until the value of the loss function is smaller than a preset loss value to obtain the ICD coding target model.
The embodiment of the application also provides an ICD automatic coding device based on artificial intelligence, which comprises:
the first building unit is used for building an atlas embedding layer based on at least one entity category in the medical knowledge atlas;
The second building unit is used for building an ICD coding initial model based on the map embedding layer, wherein the ICD coding initial model comprises an input layer, an embedding layer and an output layer, and the embedding layer comprises a text embedding layer and the map embedding layer;
an acquisition unit for acquiring a plurality of sets of pairs of samples of diagnostic terms and ICD terms as a training set;
the training unit is used for training the ICD coding initial model based on the training set to obtain the ICD coding target model, wherein the ICD coding target model is input into a diagnosis term and an ICD term, and the ICD coding target model is output into the similarity of the diagnosis term and the ICD term;
and the coding unit is used for inputting the diagnostic terms to be coded into the ICD coding target model, calculating the similarity between the diagnostic terms to be coded and each ICD term, and taking the ICD term corresponding to the maximum value of the similarity as a coding result of the diagnostic terms to be coded.
The embodiment of the application also provides electronic equipment, which comprises:
a memory storing at least one instruction;
and a processor executing the instructions stored in the memory to implement the artificial intelligence-based ICD automatic coding method.
Embodiments of the present application also provide a computer readable storage medium having stored therein at least one instruction for execution by a processor in an electronic device to implement the artificial intelligence based ICD auto-encoding method.
In summary, the application extracts the entities of different entity categories from the diagnosis term and ICD term by means of the medical knowledge graph to construct the sub-knowledge graph of each entity category, and realizes the feature extraction of each entity category by means of the graph embedding layer, thus the graph embedding vector can be obtained rapidly; extracting features of texts which do not belong to entities in the diagnosis terms and the ICD terms by means of the text embedding layer to obtain text embedding vectors, and fusing the atlas embedding vectors and the text embedding layer to obtain accurate diagnosis vectors and ICD vectors so as to obtain similarity between the diagnosis terms and the ICD terms; ICD automatic coding is achieved by calculating the similarity between the diagnostic terms and the ICD terms, and accuracy of ICD automatic coding is improved.
Drawings
Fig. 1 is a flow chart of a preferred embodiment of an artificial intelligence based ICD auto-coding method in accordance with the present application.
Fig. 2 is a schematic diagram of the structure of the atlas-embedding layer according to the present application.
Fig. 3 is a schematic structural diagram of an ICD encoding initial model according to the present application.
Fig. 4 is a functional block diagram of a preferred embodiment of an artificial intelligence based ICD automatic coding device in accordance with the present application.
Fig. 5 is a schematic structural diagram of an electronic device according to a preferred embodiment of the ICD automatic coding method based on artificial intelligence according to the present application.
Detailed Description
The application will be described in detail below with reference to the drawings and the specific embodiments thereof in order to more clearly understand the objects, features and advantages of the application. It should be noted that, without conflict, embodiments of the present application and features in the embodiments may be combined with each other. In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present application, the described embodiments are merely some, rather than all, embodiments of the present application.
Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include one or more of the described features. In the description of the present application, the meaning of "a plurality" is two or more, unless explicitly defined otherwise.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein in the description of the application is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. The term "and/or" as used herein includes any and all combinations of one or more of the associated listed items.
The embodiment of the application provides an ICD automatic coding method based on artificial intelligence, which can be applied to one or more electronic devices, wherein the electronic devices are devices capable of automatically performing numerical calculation and/or information processing according to preset or stored instructions, and the hardware of the electronic devices comprises, but is not limited to, a microprocessor, an application specific integrated circuit (Application Specific Integrated Circuit, an ASIC), a programmable gate array (Field-Programmable Gate Array, an FPGA), a digital processor (Digital Signal Processor, a DSP), an embedded device and the like.
The electronic device may be any electronic product that can interact with a customer in a human-machine manner, such as a personal computer, tablet, smart phone, personal digital assistant (Personal Digital Assistant, PDA), gaming machine, interactive web television (Internet Protocol Television, IPTV), smart wearable device, etc.
The electronic device may also include a network device and/or a client device. Wherein the network device includes, but is not limited to, a single network server, a server group composed of a plurality of network servers, or a Cloud based Cloud Computing (Cloud Computing) composed of a large number of hosts or network servers.
The network in which the electronic device is located includes, but is not limited to, the internet, a wide area network, a metropolitan area network, a local area network, a virtual private network (Virtual Private Network, VPN), and the like.
As shown in fig. 1, a flowchart of a preferred embodiment of the ICD automatic coding method based on artificial intelligence of the present application is shown. The order of the steps in the flowchart may be changed and some steps may be omitted according to various needs. The ICD automatic coding method based on artificial intelligence provided by the embodiment of the application can be applied to any scene needing ICD automatic coding, and can be applied to products of the scenes, such as disease classification in the field of digital medical treatment and the like.
S10, building an atlas embedding layer based on at least one entity category in the medical knowledge atlas.
In an alternative embodiment, the medical knowledge graph is a semantic network representing an association relationship between any different entities, the entities include text expressions such as entity concepts, names, aliases, and the like of at least one entity category, the association relationship between the entities includes attribution, inclusion, causing, representing, modifying, limiting, deteriorating, alleviating, and the like, the association relationship between the entities is bidirectional, for example, a belongs to B, and then B includes a, and the concepts of a representing C and C are a. The medical knowledge graph is composed of a large number of triples, the triples include two entities and an association relationship between the two entities, for example, a triplet (a, B) indicates that the entity a belongs to the entity B.
The entity categories include, but are not limited to, body structure, disease core words, disease type, disease nature. Each entity class includes a plurality of entities, wherein the entity class "body structure" includes a plurality of entities of vertebrae, pancreatic ducts, thoracoabdominal, pulmonary tissues, and the like; the entity category "disease core word" comprises a plurality of entities such as inflammation, contusion, polyp, cyst, tumor and the like; the entity category "disease type" includes a plurality of entities of type 1, type 2, type a, type b, etc.; the entity class "disease properties" includes secondary, degenerative, wheezing, osseous, and the like entities.
Referring to fig. 2, fig. 2 is a schematic structural diagram of an embedding layer of a map according to the present application. In an optional embodiment, the medical knowledge graph includes an association relationship between all entities and any entity, and building the graph embedding layer based on at least one entity category in the medical knowledge graph includes:
extracting all entities of the same entity category from the medical knowledge graph and the association relation between the entities to obtain a sub-knowledge graph of each entity category;
the input of the map embedding layer is a sub-knowledge map of each entity category, the output is a map embedding vector used for representing the traditional Chinese medical knowledge in all sub-knowledge maps, the map embedding layer comprises a plurality of feature extraction layers and feature fusion layers, and the feature extraction layers are in one-to-one correspondence with the entity categories;
The feature extraction layer is used for extracting features of sub-knowledge maps of corresponding entity categories to obtain coding vectors of each entity in the entity categories, and adding the coding vectors of all the entities to obtain medical feature vectors of the entity categories;
the feature fusion layer is used for fusing medical feature vectors of all entity categories to obtain a spectrum embedding vector, and the spectrum embedding vector meets the relation:
wherein N is the number of all entity categories, h n Is the medical feature vector of entity class n, alpha n And (3) the weight coefficient of the entity class n, beta is a bias parameter, and H is the map embedding vector.
The sub-knowledge maps of the entity categories only comprise entities of the same entity category and association relations among the entities, and the example is that the association relations among the entities of the entity category, namely, polyps and tumors are deteriorated by taking the entity category, namely, the disease core words as examples, namely, the association relations among the entities, namely, the polyps, the deterioration and the tumors form triples; the association between the entities "contusion" and "inflammation" is the result, i.e. the formation of triples (contusion, result in inflammation).
There may be an association relationship between entities of different entity classes in the medical knowledge graph, but if the association relationship between entities of different entity classes is considered in the graph embedding layer, the calculation amount is greatly increased, so that the training time of the graph embedding layer is prolonged; therefore, the sub-knowledge graph is constructed for each entity category, only the association relation among the entities of the same entity category is considered, and then the medical feature vectors of different entity categories are fused through the feature fusion layer, so that the calculated amount in the graph embedding layer can be reduced.
The feature extraction layer is structured as a graph neural network, inputs as sub-knowledge maps corresponding to entity categories, and outputs as medical feature vectors of the entity categories. Each entity is a node in the graph neural network, the coding vector of each entity can be continuously updated by means of the graph neural network, and the coding vector of each entity can represent the association relationship between the entity and other entities in the same entity class in the medical knowledge graph; and adding the coded vectors of all the entities of the same entity class to obtain the medical feature vector of the entity class.
The weight coefficient alpha of the entity class n n And the bias parameter beta is a trainable parameter, and the specific value of the bias parameter beta is related to the training process of the atlas embedding layer. The map embedding vector output by the map embedding layer contains the medical knowledge in all sub-knowledge maps, and provides a medical theoretical basis for improving the accuracy of ICD automatic coding.
Therefore, sub-knowledge maps of different entity categories are constructed based on the medical knowledge maps, the map embedding layer performs feature extraction on each sub-knowledge map and then performs feature fusion to obtain map embedding vectors of the overall features of all sub-knowledge maps, the calculated amount of obtaining the map embedding vectors is reduced, and meanwhile feature extraction of association relations among all entities is realized.
S11, an ICD coding initial model is built based on the map embedding layer, wherein the ICD coding initial model comprises an input layer, an embedding layer and an output layer, and the embedding layer comprises a text embedding layer and the map embedding layer.
Referring to fig. 3, fig. 3 is a schematic structural diagram of an ICD coding initial model according to the present application. In an optional embodiment, the building an ICD coding initial model based on the atlas embedding layer includes:
the input layer is used for receiving any diagnosis term and any ICD term;
inputting the diagnostic term into the embedded layer to perform preprocessing operation to obtain first input data and second input data;
inputting the first input data into the atlas embedding layer to obtain an atlas embedding vector, inputting the second input data into the text embedding layer to obtain a text embedding vector, and splicing the atlas embedding vector and the text embedding vector to obtain a diagnosis vector of the diagnosis term;
inputting the ICD term into the embedding layer to execute the preprocessing operation, and obtaining an ICD vector of the ICD term based on the text embedding layer and the map embedding layer;
and inputting the diagnosis vector and the ICD vector into the output layer, and outputting the similarity of the diagnosis term and the ICD term.
The text embedding layer is a model for text data processing, such as Bert, SLTM, transformer in the field of natural language processing, and the embedding layer performs the same operation on the diagnosis term and the ICD term to obtain a diagnosis vector and an ICD vector.
The output layer is formed by connecting an N-dimensional full-connection layer, a normalization layer and a 1-dimensional full-connection layer in series, the N-dimensional full-connection layer is used for receiving the diagnosis vector and the ICD vector, N is the sum of the dimensions of the diagnosis vector and the ICD vector, and the 1-dimensional full-connection layer is used for outputting the similarity of the diagnosis term and the ICD term; the similarity value ranges from 0,1, with larger values indicating more similarity between the diagnostic term and the ICD term. For example, the diagnostic vector and the ICD vector are 10-dimensional vectors, that is, the diagnostic vector and the ICD vector each include 10 numerical values, and the value of N is 20.
In an alternative embodiment, said inputting said diagnostic term into said embedded layer to perform a preprocessing operation resulting in first input data and second input data comprises:
extracting entities of different entity categories from the diagnostic terms based on the medical knowledge graph to construct an entity extraction set of each entity category;
Judging whether each entity in the corresponding sub-knowledge graph is in the entity extraction set or not according to each entity category, if the entity is in the entity extraction set, reserving the entity in the sub-knowledge graph, and if the entity is not in the entity extraction set, replacing the entity in the sub-knowledge graph with an empty text;
after traversing all entities in the sub-knowledge maps corresponding to each entity category, obtaining a diagnosis sub-knowledge map of each entity category;
taking the diagnosis sub-knowledge maps of all entity categories as first input data;
and replacing all entities in the diagnosis term with preset characters, and filling the preset characters at the tail of the diagnosis term to reach a preset length to obtain second input data.
Wherein the preset substitution is a character "0"; the preset length is 20, namely the second input data is text data with the length of 20 constantly; the first input data includes diagnostic sub-knowledge maps of all entity categories, and the diagnostic sub-knowledge maps include only entities that appear in the diagnostic terms.
It should be noted that, the first input data includes features of all entities related to the medical knowledge graph in the diagnostic term; the second input data comprises characters which are not contained in the medical knowledge graph in the diagnosis term, and can reflect text information which is not contained in the medical knowledge graph due to incorrect writing, spoken language, or untimely updating of the medical knowledge graph in the diagnosis term, and the like, and the characteristics of the text information are learned through a text embedding layer; the first input data and the second input data may reflect all features of the diagnostic term.
In an alternative embodiment, the extracting entities of different entity classes from the diagnostic term based on the medical knowledge-graph to construct an entity extraction set of each entity class includes:
obtaining confusion entities of each entity in the medical knowledge graph;
constructing a dictionary tree based on the entities in the medical knowledge graph and the confusing entities;
querying the dictionary tree to obtain entities or confusing entities contained in the diagnosis terms, and classifying and storing the entities or confusing entities contained in the diagnosis terms according to entity categories to obtain an initial entity extraction set of each entity category;
and replacing all the confused entities in the initial entity extraction set of each entity category with the entities corresponding to the confused entities to obtain the entity extraction set of each entity category.
Wherein, the confusion entity is the similar word and the near meaning word of the font of the entity, and the confusion entity of the entity 'cataract' is exemplified as 'self-internal disorder', and the confusion entity of the entity 'rash' is exemplified as 'skin diagnosis'. The number of the confusing entities corresponding to one entity is 0, 1 or more, and the obtaining mode of the confusing entities is statistical obtaining. The dictionary tree is a database of a multi-way tree structure for quick retrieval, and stores all entities and confusion entities of each entity.
It should be noted that the confusing entity participates in the construction of the dictionary tree, thereby avoiding the situation of entity extraction failure caused by wrongly written words, spoken language, common names, inaccurate OCR recognition and other factors in the diagnosis term.
In this way, the ICD coding initial model is built, the ICD coding initial model can acquire and output the similarity between any diagnosis term and any ICD term, and an embedded layer in the ICD coding initial model can avoid the influence of factors such as wrongly written characters, spoken language, common names and the like in the diagnosis term on the accuracy of ICD automatic coding through carrying out entity extraction on the diagnosis term or the ICD term and splitting the diagnosis term into first input data and second input data.
S12, collecting a plurality of groups of sample pairs of diagnosis terms and ICD terms to serve as a training set.
In an alternative embodiment, the diagnostic term is case information or diagnostic results written by a physician during the actual diagnosis. The ICD terms are ICD codes of different disease classifications stored in a disease diagnosis code library that includes ICD codes of all disease classifications.
In an alternative embodiment, collecting a plurality of sets of pairs of samples of diagnostic terms and ICD terms as a training set includes:
Collecting an image with case information or diagnosis results, and extracting text information in the image by utilizing an optical character recognition technology to obtain diagnosis terms;
obtaining a coding result of any diagnosis term, wherein the coding result is an ICD term corresponding to the diagnosis term;
taking the diagnosis term and the coding result as a set of positive sample pairs, and taking any ICD term except the diagnosis term and the coding result as a set of negative sample pairs;
multiple positive sample pairs and multiple negative sample pairs are collected and stored as training sets.
The method for acquiring the coding result of any diagnosis term is expert annotation. The optical character recognition (Optical Character Recognition, OCR) refers to a process of analyzing and recognizing an image file of a text material to obtain text and layout information, that is, recognizing the text in the image and returning the text in the form of text.
Illustratively, the ICD term corresponding to the diagnostic term "acute bronchopneumonia lobular pneumonia" is the ICD code J98.414 corresponding to the disease classification "pulmonary infection"; the ICD term corresponding to the diagnosis term "right maxillary sinus lateral wall fracture" is ICD code S02.4 corresponding to the disease classification "cheekbone and maxillary fracture".
In this way, the corresponding relations between a plurality of groups of diagnosis terms and ICD terms are collected, each group of corresponding relations is used as a group of training pairs, a training set is obtained, and a data base is provided for realizing ICD automatic coding.
S13, training the ICD coding initial model based on the training set to obtain the ICD coding target model, wherein the ICD coding target model is input into a diagnosis term and an ICD term, and the ICD coding target model is output into the similarity of the diagnosis term and the ICD term.
In an alternative embodiment, the ICD coding initial model is a parameterized model, and training of the ICD coding initial model is required in order to constrain the ICD coding initial model to accurately output the similarity between any diagnostic term and any ICD term.
In an alternative embodiment, said training said ICD encoding initial model based on said training set to obtain said ICD encoding target model comprises:
randomly selecting a preset number of sample pairs from the training set, wherein the sample pairs comprise the positive sample pairs and the negative sample pairs;
inputting each sample pair into the ICD coding initial model to obtain the prediction similarity of each sample pair;
calculating a value of a loss function based on the predicted similarity of each sample pair, the loss function satisfying the relationship:
Wherein M is + Number of positive sample pairs, sim u For the predicted similarity of positive samples to u, M - Number of negative sample pairs, sim v Is the predicted similarity of negative samples to v u -1‖ 2 Representing the computation of Sim u And 1, loss is the value of the Loss function;
updating the ICD coding initial model according to a back propagation algorithm to reduce the value of the loss function;
and continuously randomly selecting sample pairs from the training set to update the ICD coding initial model, and stopping until the value of the loss function is smaller than a preset loss value to obtain the ICD coding target model.
Wherein the preset number is 32, and the preset loss value is 0.001.
Thus, training of the ICD coding initial model is completed, the ICD coding target model is obtained, the ICD coding target model is input into a diagnosis term and an ICD term, and the ICD coding target model is output into accurate similarity between the diagnosis term and the ICD term.
S14, inputting the diagnostic terms to be encoded into the ICD encoding target model, calculating the similarity between the diagnostic terms to be encoded and each ICD term, and taking the ICD term corresponding to the maximum value of the similarity as an encoding result of the diagnostic terms to be encoded.
In an alternative embodiment, the number of the ICD terms is limited, the similarity between the diagnostic terms to be encoded and each ICD term is calculated based on the ICD encoding target model, and the ICD term corresponding to the maximum value of the similarity is used as the encoding result of the diagnostic terms to be encoded.
Thus, any ICD term corresponding to the diagnostic term to be encoded can be obtained, and ICD automatic encoding of the diagnostic term to be encoded is realized.
According to the technical scheme, the application extracts the entities of different entity categories from the diagnosis term and the ICD term by means of the medical knowledge graph to construct the sub-knowledge graph of each entity category, and realizes the feature extraction of each entity category by means of the graph embedding layer, so that the graph embedding vector can be quickly obtained; extracting features of texts which do not belong to entities in the diagnosis terms and the ICD terms by means of the text embedding layer to obtain text embedding vectors, and fusing the atlas embedding vectors and the text embedding layer to obtain accurate diagnosis vectors and ICD vectors so as to obtain similarity between the diagnosis terms and the ICD terms; ICD automatic coding is achieved by calculating the similarity between the diagnostic terms and the ICD terms, and accuracy of ICD automatic coding is improved.
Referring to fig. 4, fig. 4 is a functional block diagram of a preferred embodiment of the ICD automatic coding device based on artificial intelligence of the present application. The ICD automatic coding device 11 based on artificial intelligence includes a first building unit 110, a second building unit 111, an acquisition unit 112, a training unit 113, and a coding unit 114. The module/unit referred to herein is a series of computer readable instructions capable of being executed by the processor 13 and of performing a fixed function, stored in the memory 12. In the present embodiment, the functions of the respective modules/units will be described in detail in the following embodiments.
In an alternative embodiment, the first building unit 110 is configured to build the atlas embedding layer based on at least one entity class in the medical knowledge atlas.
In an optional embodiment, the medical knowledge graph includes an association relationship between all entities and any entity, and building the graph embedding layer based on at least one entity category in the medical knowledge graph includes:
extracting all entities of the same entity category from the medical knowledge graph and the association relation between the entities to obtain a sub-knowledge graph of each entity category;
the input of the map embedding layer is a sub-knowledge map of each entity category, the output is a map embedding vector used for representing the traditional Chinese medical knowledge in all sub-knowledge maps, the map embedding layer comprises a plurality of feature extraction layers and feature fusion layers, and the feature extraction layers are in one-to-one correspondence with the entity categories;
The feature extraction layer is used for extracting features of sub-knowledge maps of corresponding entity categories to obtain coding vectors of each entity in the entity categories, and adding the coding vectors of all the entities to obtain medical feature vectors of the entity categories;
the feature fusion layer is used for fusing medical feature vectors of all entity categories to obtain a spectrum embedding vector, and the spectrum embedding vector meets the relation:
wherein N is the number of all entity categories, h n Is the medical feature vector of entity class n, alpha n And (3) the weight coefficient of the entity class n, beta is a bias parameter, and H is the map embedding vector.
In an alternative embodiment, the second building unit 111 is configured to build an ICD coding initial model based on the atlas embedding layer, the ICD coding initial model comprising an input layer, an embedding layer, and an output layer, the embedding layer comprising a text embedding layer and the atlas embedding layer.
In an optional embodiment, the building an ICD coding initial model based on the atlas embedding layer includes:
the input layer is used for receiving any diagnosis term and any ICD term;
inputting the diagnostic term into the embedded layer to perform preprocessing operation to obtain first input data and second input data;
Inputting the first input data into the atlas embedding layer to obtain an atlas embedding vector, inputting the second input data into the text embedding layer to obtain a text embedding vector, and splicing the atlas embedding vector and the text embedding vector to obtain a diagnosis vector of the diagnosis term;
inputting the ICD term into the embedding layer to execute the preprocessing operation, and obtaining an ICD vector of the ICD term based on the text embedding layer and the map embedding layer;
and inputting the diagnosis vector and the ICD vector into the output layer, and outputting the similarity of the diagnosis term and the ICD term.
In an alternative embodiment, said inputting said diagnostic term into said embedded layer to perform a preprocessing operation resulting in first input data and second input data comprises:
extracting entities of different entity categories from the diagnostic terms based on the medical knowledge graph to construct an entity extraction set of each entity category;
judging whether each entity in the corresponding sub-knowledge graph is in the entity extraction set or not according to each entity category, if the entity is in the entity extraction set, reserving the entity in the sub-knowledge graph, and if the entity is not in the entity extraction set, replacing the entity in the sub-knowledge graph with an empty text;
After traversing all entities in the sub-knowledge maps corresponding to each entity category, obtaining a diagnosis sub-knowledge map of each entity category;
taking the diagnosis sub-knowledge maps of all entity categories as first input data;
and replacing all entities in the diagnosis term with preset characters, and filling the preset characters at the tail of the diagnosis term to reach a preset length to obtain second input data.
In an alternative embodiment, the extracting entities of different entity classes from the diagnostic term based on the medical knowledge-graph to construct an entity extraction set of each entity class includes:
obtaining confusion entities of each entity in the medical knowledge graph;
constructing a dictionary tree based on the entities in the medical knowledge graph and the confusing entities;
querying the dictionary tree to obtain entities or confusing entities contained in the diagnosis terms, and classifying and storing the entities or confusing entities contained in the diagnosis terms according to entity categories to obtain an initial entity extraction set of each entity category;
and replacing all the confused entities in the initial entity extraction set of each entity category with the entities corresponding to the confused entities to obtain the entity extraction set of each entity category.
In an alternative embodiment, the collection unit 112 is configured to collect a plurality of sets of pairs of samples of diagnostic terms and ICD terms as a training set.
In an alternative embodiment, collecting a plurality of sets of pairs of samples of diagnostic terms and ICD terms as a training set includes:
collecting an image with case information or diagnosis results, and extracting text information in the image by utilizing an optical character recognition technology to obtain diagnosis terms;
obtaining a coding result of any diagnosis term, wherein the coding result is an ICD term corresponding to the diagnosis term;
taking the diagnosis term and the coding result as a set of positive sample pairs, and taking any ICD term except the diagnosis term and the coding result as a set of negative sample pairs;
multiple positive sample pairs and multiple negative sample pairs are collected and stored as training sets.
In an alternative embodiment, training unit 113 is configured to train the ICD coding initial model based on the training set to obtain the ICD coding target model, where the ICD coding target model is input as a diagnostic term and an ICD term, and output as a similarity of the diagnostic term and the ICD term.
In an alternative embodiment, said training said ICD encoding initial model based on said training set to obtain said ICD encoding target model comprises:
Randomly selecting a preset number of sample pairs from the training set, wherein the sample pairs comprise the positive sample pairs and the negative sample pairs;
inputting each sample pair into the ICD coding initial model to obtain the prediction similarity of each sample pair;
calculating a value of a loss function based on the predicted similarity of each sample pair, the loss function satisfying the relationship:
wherein M is + Number of positive sample pairs, sim u For the predicted similarity of positive samples to u, M - Number of negative sample pairs, sim v Is the predicted similarity of negative samples to v u -1‖ 2 Representing the computation of Sim u And 1, loss is the value of the Loss function;
updating the ICD coding initial model according to a back propagation algorithm to reduce the value of the loss function;
and continuously randomly selecting sample pairs from the training set to update the ICD coding initial model, and stopping until the value of the loss function is smaller than a preset loss value to obtain the ICD coding target model.
In an alternative embodiment, the encoding unit 114 is configured to input a diagnostic term to be encoded into the ICD encoding target model, calculate a similarity between the diagnostic term to be encoded and each ICD term, and use an ICD term corresponding to a maximum value of the similarity as an encoding result of the diagnostic term to be encoded.
According to the technical scheme, the application extracts the entities of different entity categories from the diagnosis term and the ICD term by means of the medical knowledge graph to construct the sub-knowledge graph of each entity category, and realizes the feature extraction of each entity category by means of the graph embedding layer, so that the graph embedding vector can be quickly obtained; extracting features of texts which do not belong to entities in the diagnosis terms and the ICD terms by means of the text embedding layer to obtain text embedding vectors, and fusing the atlas embedding vectors and the text embedding layer to obtain accurate diagnosis vectors and ICD vectors so as to obtain similarity between the diagnosis terms and the ICD terms; ICD automatic coding is achieved by calculating the similarity between the diagnostic terms and the ICD terms, and accuracy of ICD automatic coding is improved.
Fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present application. The electronic device 1 comprises a memory 12 and a processor 13. Memory 12 is used to store computer readable instructions that processor 13 uses to execute stored computer readable instructions in the memory to implement the artificial intelligence based ICD auto-coding method described in any of the embodiments above.
In an alternative embodiment, the electronic device 1 further comprises a bus, a computer program stored in said memory 12 and executable on said processor 13, for example an ICD auto-coding program based on artificial intelligence.
Fig. 5 shows only an electronic device 1 with a memory 12 and a processor 13, it being understood by a person skilled in the art that the structure shown in fig. 5 does not constitute a limitation of the electronic device 1, and may comprise fewer or more components than shown, or may combine certain components, or a different arrangement of components.
In connection with fig. 1, the memory 12 in the electronic device 1 stores a plurality of computer readable instructions to implement an artificial intelligence based ICD auto-coding method, the processor 13 being executable to implement:
constructing an atlas embedding layer based on at least one entity category in the medical knowledge atlas;
constructing an ICD coding initial model based on the map embedding layer, wherein the ICD coding initial model comprises an input layer, an embedding layer and an output layer, and the embedding layer comprises a text embedding layer and the map embedding layer;
collecting a plurality of sets of sample pairs of diagnostic terms and ICD terms as a training set;
training the ICD coding initial model based on the training set to obtain the ICD coding target model, wherein the ICD coding target model is input into a diagnosis term and an ICD term, and the ICD coding target model is output into the similarity of the diagnosis term and the ICD term;
Inputting diagnostic terms to be encoded into the ICD encoding target model, calculating the similarity between the diagnostic terms to be encoded and each ICD term, and taking the ICD term corresponding to the maximum value of the similarity as the encoding result of the diagnostic terms to be encoded.
Specifically, the specific implementation method of the above instructions by the processor 13 may refer to the description of the relevant steps in the corresponding embodiment of fig. 1, which is not repeated herein.
It will be appreciated by those skilled in the art that the schematic diagram is merely an example of the electronic device 1 and does not constitute a limitation of the electronic device 1, the electronic device 1 may be a bus type structure, a star type structure, the electronic device 1 may further comprise more or less other hardware or software than illustrated, or a different arrangement of components, e.g. the electronic device 1 may further comprise an input-output device, a network access device, etc.
It should be noted that the electronic device 1 is only used as an example, and other electronic products that may be present in the present application or may be present in the future are also included in the scope of the present application by way of reference.
The memory 12 includes at least one type of readable storage medium, which may be non-volatile or volatile. The readable storage medium includes flash memory, a removable hard disk, a multimedia card, a card type memory (e.g., SD or DX memory, etc.), a magnetic memory, a magnetic disk, an optical disk, etc. The memory 12 may in some embodiments be an internal storage unit of the electronic device 1, such as a mobile hard disk of the electronic device 1. The memory 12 may in other embodiments also be an external storage device of the electronic device 1, such as a plug-in mobile hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card) or the like, which are provided on the electronic device 1. The memory 12 may be used not only for storing application software installed in the electronic device 1 and various types of data, such as codes of ICD auto-code programs based on artificial intelligence, etc., but also for temporarily storing data that has been output or is to be output.
The processor 13 may be comprised of integrated circuits in some embodiments, for example, a single packaged integrated circuit, or may be comprised of multiple integrated circuits packaged with the same or different functions, including one or more central processing units (Central Processing unit, CPU), microprocessors, digital processing chips, graphics processors, a combination of various control chips, and the like. The processor 13 is a Control Unit (Control Unit) of the electronic device 1, connects the various components of the entire electronic device 1 using various interfaces and lines, executes or executes programs or modules stored in the memory 12 (e.g., performs an ICD auto-code program based on artificial intelligence, etc.), and invokes data stored in the memory 12 to perform various functions of the electronic device 1 and process the data.
The processor 13 executes the operating system of the electronic device 1 and various types of applications installed. The processor 13 executes the application program to implement the steps described above in various embodiments of the artificial intelligence based ICD auto-coding method, such as the steps shown in fig. 1.
Illustratively, the computer program may be partitioned into one or more modules/units that are stored in the memory 12 and executed by the processor 13 to complete the present application. The one or more modules/units may be a series of computer readable instruction segments capable of performing the specified functions, which instruction segments describe the execution of the computer program in the electronic device 1. For example, the computer program may be split into a first building unit 110, a second building unit 111, an acquisition unit 112, a training unit 113, an encoding unit 114.
The integrated units implemented in the form of software functional modules described above may be stored in a computer readable storage medium. The software functional module is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a computer device, or a network device, etc.) or a Processor (Processor) to perform portions of the ICD auto-encoding method based on artificial intelligence according to various embodiments of the present application.
The integrated modules/units of the electronic device 1 may be stored in a computer readable storage medium if implemented in the form of software functional units and sold or used as a stand alone product. Based on this understanding, the present application may also be implemented by a computer program for instructing a relevant hardware device to implement all or part of the procedures of the above-mentioned embodiment method, where the computer program may be stored in a computer readable storage medium and the computer program may be executed by a processor to implement the steps of each of the above-mentioned method embodiments.
Wherein the computer program comprises computer program code which may be in source code form, object code form, executable file or some intermediate form etc. The computer readable medium may include: any entity or device capable of carrying the computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM), a random access Memory, other memories, and the like.
Further, the computer-readable storage medium may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function, and the like; the storage data area may store data created from the use of blockchain nodes, and the like.
The blockchain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, encryption algorithm and the like. The Blockchain (Blockchain), which is essentially a decentralised database, is a string of data blocks that are generated by cryptographic means in association, each data block containing a batch of information of network transactions for verifying the validity of the information (anti-counterfeiting) and generating the next block. The blockchain may include a blockchain underlying platform, a platform product services layer, an application services layer, and the like.
The bus may be a peripheral component interconnect standard (Peripheral Component Interconnect, PCI) bus or an extended industry standard architecture (Extended Industry Standard Architecture, EISA) bus, among others. The bus may be classified as an address bus, a data bus, a control bus, etc. For ease of illustration, only one arrow is shown in FIG. 5, but only one bus or one type of bus is not shown. The bus is arranged to enable a connection communication between the memory 12 and at least one processor 13 or the like.
The embodiment of the application also provides a computer readable storage medium (not shown), wherein computer readable instructions are stored in the computer readable storage medium, and the computer readable instructions are executed by a processor in an electronic device to implement the automatic ICD coding method based on artificial intelligence according to any embodiment.
In the several embodiments provided in the present application, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is merely a logical function division, and there may be other manners of division when actually implemented.
The modules described as separate components may or may not be physically separate, and components shown as modules may or may not be physical units, may be located in one place, or may be distributed over multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional module in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units can be realized in a form of hardware or a form of hardware and a form of software functional modules.
Furthermore, it is evident that the word "comprising" does not exclude other elements or steps, and that the singular does not exclude a plurality. Several of the elements or devices described in the specification may be embodied by one and the same item of software or hardware. The terms first, second, etc. are used to denote a name, but not any particular order.
Finally, it should be noted that the above-mentioned embodiments are merely for illustrating the technical solution of the present application and not for limiting the same, and although the present application has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications and equivalents may be made to the technical solution of the present application without departing from the spirit and scope of the technical solution of the present application.

Claims (10)

1. An automatic ICD encoding method based on artificial intelligence, the method comprising:
constructing an atlas embedding layer based on at least one entity category in the medical knowledge atlas;
constructing an ICD coding initial model based on the map embedding layer, wherein the ICD coding initial model comprises an input layer, an embedding layer and an output layer, and the embedding layer comprises a text embedding layer and the map embedding layer;
collecting a plurality of sets of sample pairs of diagnostic terms and ICD terms as a training set;
Training the ICD coding initial model based on the training set to obtain the ICD coding target model, wherein the ICD coding target model is input into a diagnosis term and an ICD term, and the ICD coding target model is output into the similarity of the diagnosis term and the ICD term;
inputting diagnostic terms to be encoded into the ICD encoding target model, calculating the similarity between the diagnostic terms to be encoded and each ICD term, and taking the ICD term corresponding to the maximum value of the similarity as the encoding result of the diagnostic terms to be encoded.
2. The artificial intelligence based ICD automatic coding method of claim 1, wherein collecting a plurality of sets of pairs of diagnostic terms and ICD terms as a training set comprises:
collecting an image with case information or diagnosis results, and extracting text information in the image by utilizing an optical character recognition technology to obtain diagnosis terms;
obtaining a coding result of any diagnosis term, wherein the coding result is an ICD term corresponding to the diagnosis term;
taking the diagnosis term and the coding result as a set of positive sample pairs, and taking any ICD term except the diagnosis term and the coding result as a set of negative sample pairs;
multiple positive sample pairs and multiple negative sample pairs are collected and stored as training sets.
3. The ICD automatic coding method based on artificial intelligence according to claim 1, wherein the medical knowledge graph includes association relations between all entities and any entity, and the building of the graph embedding layer based on at least one entity category in the medical knowledge graph includes:
extracting all entities of the same entity category from the medical knowledge graph and the association relation between the entities to obtain a sub-knowledge graph of each entity category;
the input of the map embedding layer is a sub-knowledge map of each entity category, the output is a map embedding vector used for representing the traditional Chinese medical knowledge in all sub-knowledge maps, the map embedding layer comprises a plurality of feature extraction layers and feature fusion layers, and the feature extraction layers are in one-to-one correspondence with the entity categories;
the feature extraction layer is used for extracting features of sub-knowledge maps of corresponding entity categories to obtain coding vectors of each entity in the entity categories, and adding the coding vectors of all the entities to obtain medical feature vectors of the entity categories;
the feature fusion layer is used for fusing medical feature vectors of all entity categories to obtain a spectrum embedding vector, and the spectrum embedding vector meets the relation:
Wherein N is the number of all entity categories, h n Is the medical feature vector of entity class n, alpha n For the weight coefficient of entity class n, beta is the bias parameterAnd H is the embedded vector of the map.
4. An artificial intelligence based ICD automatic coding method according to claim 3, wherein constructing an ICD coding initial model based on the atlas embedding layer comprises:
the input layer is used for receiving any diagnosis term and any ICD term;
inputting the diagnostic term into the embedded layer to perform preprocessing operation to obtain first input data and second input data;
inputting the first input data into the atlas embedding layer to obtain an atlas embedding vector, inputting the second input data into the text embedding layer to obtain a text embedding vector, and splicing the atlas embedding vector and the text embedding vector to obtain a diagnosis vector of the diagnosis term;
inputting the ICD term into the embedding layer to execute the preprocessing operation, and obtaining an ICD vector of the ICD term based on the text embedding layer and the map embedding layer;
and inputting the diagnosis vector and the ICD vector into the output layer, and outputting the similarity of the diagnosis term and the ICD term.
5. The automatic ICD encoding method based on artificial intelligence of claim 4, wherein the inputting the diagnostic term into the embedding layer to perform a preprocessing operation to obtain first input data and second input data comprises:
extracting entities of different entity categories from the diagnostic terms based on the medical knowledge graph to construct an entity extraction set of each entity category;
judging whether each entity in the corresponding sub-knowledge graph is in the entity extraction set or not according to each entity category, if the entity is in the entity extraction set, reserving the entity in the sub-knowledge graph, and if the entity is not in the entity extraction set, replacing the entity in the sub-knowledge graph with an empty text;
after traversing all entities in the sub-knowledge maps corresponding to each entity category, obtaining a diagnosis sub-knowledge map of each entity category;
taking the diagnosis sub-knowledge maps of all entity categories as first input data;
and replacing all entities in the diagnosis term with preset characters, and filling the preset characters at the tail of the diagnosis term to reach a preset length to obtain second input data.
6. The automatic ICD encoding method based on artificial intelligence of claim 5, wherein the extracting entities of different entity categories from the diagnostic terms based on the medical knowledge-graph to construct an entity extraction set of each entity category comprises:
obtaining confusion entities of each entity in the medical knowledge graph;
constructing a dictionary tree based on the entities in the medical knowledge graph and the confusing entities;
querying the dictionary tree to obtain entities or confusing entities contained in the diagnosis terms, and classifying and storing the entities or confusing entities contained in the diagnosis terms according to entity categories to obtain an initial entity extraction set of each entity category;
and replacing all the confused entities in the initial entity extraction set of each entity category with the entities corresponding to the confused entities to obtain the entity extraction set of each entity category.
7. The automatic ICD coding method based on artificial intelligence of claim 2, wherein training the ICD coding initial model based on the training set to obtain the ICD coding target model comprises:
randomly selecting a preset number of sample pairs from the training set, wherein the sample pairs comprise the positive sample pairs and the negative sample pairs;
Inputting each sample pair into the ICD coding initial model to obtain the prediction similarity of each sample pair;
calculating a value of a loss function based on the predicted similarity of each sample pair, the loss function satisfying the relationship:
wherein M is + Number of positive sample pairs, sim u For the predicted similarity of positive samples to u, M - Number of negative sample pairs, sim v Is the predicted similarity of negative samples to v u -1‖ 2 Representing the computation of Sim u And 1, loss is the value of the Loss function;
updating the ICD coding initial model according to a back propagation algorithm to reduce the value of the loss function;
and continuously randomly selecting sample pairs from the training set to update the ICD coding initial model, and stopping until the value of the loss function is smaller than a preset loss value to obtain the ICD coding target model.
8. An artificial intelligence based ICD automatic coding device, the device comprising:
the first building unit is used for building an atlas embedding layer based on at least one entity category in the medical knowledge atlas;
the second building unit is used for building an ICD coding initial model based on the map embedding layer, wherein the ICD coding initial model comprises an input layer, an embedding layer and an output layer, and the embedding layer comprises a text embedding layer and the map embedding layer;
An acquisition unit for acquiring a plurality of sets of pairs of samples of diagnostic terms and ICD terms as a training set;
the training unit is used for training the ICD coding initial model based on the training set to obtain the ICD coding target model, wherein the ICD coding target model is input into a diagnosis term and an ICD term, and the ICD coding target model is output into the similarity of the diagnosis term and the ICD term;
and the coding unit is used for inputting the diagnostic terms to be coded into the ICD coding target model, calculating the similarity between the diagnostic terms to be coded and each ICD term, and taking the ICD term corresponding to the maximum value of the similarity as a coding result of the diagnostic terms to be coded.
9. An electronic device, the electronic device comprising:
a memory storing computer readable instructions; and
A processor executing computer readable instructions stored in the memory to implement the artificial intelligence based ICD auto-encoding method of any one of claims 1 to 7.
10. A computer readable storage medium having stored thereon computer readable instructions which when executed by a processor implement the artificial intelligence based ICD auto-encoding method of any one of claims 1 to 7.
CN202310463927.8A 2023-04-20 2023-04-20 ICD automatic coding method based on artificial intelligence and related equipment Pending CN116757207A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310463927.8A CN116757207A (en) 2023-04-20 2023-04-20 ICD automatic coding method based on artificial intelligence and related equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310463927.8A CN116757207A (en) 2023-04-20 2023-04-20 ICD automatic coding method based on artificial intelligence and related equipment

Publications (1)

Publication Number Publication Date
CN116757207A true CN116757207A (en) 2023-09-15

Family

ID=87952152

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310463927.8A Pending CN116757207A (en) 2023-04-20 2023-04-20 ICD automatic coding method based on artificial intelligence and related equipment

Country Status (1)

Country Link
CN (1) CN116757207A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117010494A (en) * 2023-09-27 2023-11-07 之江实验室 Medical data generation method and system based on causal expression learning

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117010494A (en) * 2023-09-27 2023-11-07 之江实验室 Medical data generation method and system based on causal expression learning
CN117010494B (en) * 2023-09-27 2024-01-05 之江实验室 Medical data generation method and system based on causal expression learning

Similar Documents

Publication Publication Date Title
CN110427486B (en) Body condition text classification method, device and equipment
CN112395886B (en) Similar text determination method and related equipment
CN113656547B (en) Text matching method, device, equipment and storage medium
WO2022160454A1 (en) Medical literature retrieval method and apparatus, electronic device, and storage medium
CN111177375B (en) Electronic document classification method and device
EP3575987A1 (en) Extracting from a descriptive document the value of a slot associated with a target entity
CN116757207A (en) ICD automatic coding method based on artificial intelligence and related equipment
CN116662488A (en) Service document retrieval method, device, equipment and storage medium
CN115222443A (en) Client group division method, device, equipment and storage medium
CN115798661A (en) Knowledge mining method and device in clinical medicine field
CN113342977B (en) Invoice image classification method, device, equipment and storage medium
WO2021174923A1 (en) Concept word sequence generation method, apparatus, computer device, and storage medium
CN113268597B (en) Text classification method, device, equipment and storage medium
CN116402166B (en) Training method and device of prediction model, electronic equipment and storage medium
CN113657086A (en) Word processing method, device, equipment and storage medium
CN116824677A (en) Expression recognition method and device, electronic equipment and storage medium
CN116503608A (en) Data distillation method based on artificial intelligence and related equipment
CN113420545B (en) Abstract generation method, device, equipment and storage medium
CN116468043A (en) Nested entity identification method, device, equipment and storage medium
CN114117082B (en) Method, apparatus, and medium for correcting data to be corrected
CN113269179B (en) Data processing method, device, equipment and storage medium
CN113486680B (en) Text translation method, device, equipment and storage medium
CN113627186B (en) Entity relation detection method based on artificial intelligence and related equipment
CN115169360A (en) User intention identification method based on artificial intelligence and related equipment
CN114581177A (en) Product recommendation method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination