CN112530534A - Method and system for distinguishing subject cancer stages based on electronic medical record - Google Patents

Method and system for distinguishing subject cancer stages based on electronic medical record Download PDF

Info

Publication number
CN112530534A
CN112530534A CN202011416351.2A CN202011416351A CN112530534A CN 112530534 A CN112530534 A CN 112530534A CN 202011416351 A CN202011416351 A CN 202011416351A CN 112530534 A CN112530534 A CN 112530534A
Authority
CN
China
Prior art keywords
electronic medical
medical record
cancer
cancer stage
stage
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011416351.2A
Other languages
Chinese (zh)
Other versions
CN112530534B (en
Inventor
顾大中
付桂振
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN202011416351.2A priority Critical patent/CN112530534B/en
Publication of CN112530534A publication Critical patent/CN112530534A/en
Application granted granted Critical
Publication of CN112530534B publication Critical patent/CN112530534B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H10/00ICT specially adapted for the handling or processing of patient-related medical or healthcare data
    • G16H10/60ICT specially adapted for the handling or processing of patient-related medical or healthcare data for patient-specific data, e.g. for electronic patient records
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • G06F16/90344Query processing by using string matching techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2413Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
    • G06F18/24133Distances to prototypes
    • G06F18/24137Distances to cluster centroïds
    • G06F18/2414Smoothing the distance, e.g. radial basis function networks [RBFN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Biomedical Technology (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Databases & Information Systems (AREA)
  • Primary Health Care (AREA)
  • Public Health (AREA)
  • Medical Informatics (AREA)
  • Epidemiology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Medical Treatment And Welfare Office Work (AREA)

Abstract

The invention discloses a method and a system for distinguishing subject cancer stages based on an electronic medical record, wherein the method comprises the following steps: extracting cancer stage information in the electronic medical record to be processed; segmenting the electronic medical record to be processed; constructing a cancer staging characteristic matrix by using the cancer staging information and the electronic medical record staging information; and inputting the text information of the electronic medical record to be processed and the cancer stage characteristic matrix into a deep learning model, and acquiring the probability of each cancer stage as a theme. The invention can judge which cancer staging information in the electronic medical record is the subject staging information, and provides reliable technical support for doctors to match the electronic medical record.

Description

Method and system for distinguishing subject cancer stages based on electronic medical record
Technical Field
The invention belongs to the technical field of intelligent medical treatment, and particularly relates to a method and a system for judging topic cancer stages based on electronic medical records
Background
Electronic Medical Records (EMR) are also known as computerized Medical Record systems or Computer-Based Patient records (CPR). It is a digitalized medical record stored, managed, transmitted and reproduced by electronic equipment to replace the hand-written paper case history.
The massive electronic medical records can provide complete and accurate data, warning, prompting and clinical decision support for one sound; cancer staging information is a feature that needs to be considered heavily in the process of matching electronic medical records. Cancers of different stages have completely different characteristics. For example, the clinical features of early stage cancer are very different from those of late stage cancer, and diagnosis of early stage cancer is a difficult point, while diagnosis of late stage cancer is easy. Early stage cancer can be removed by using an endoscope, and complete cancer needs a combination of various modes such as chemotherapy, radiotherapy, operation and the like. Therefore, the cancer stage is very different from case to case, and has little reference value. Thus, if the physician enters a cancer case, the cases returned by the system must have similar cancer stages. Therefore, accurate extraction of cancer stage information of the literature is important for matching of electronic medical records.
In an electronic medical record of cancer, there is always subject cancer stage information, i.e. the stage of the cancer obtained by the patient. But sometimes there is some less important cancer stage information such as the patient's family history (e.g., the patient's cancer stage), the patient's past history (e.g., the patient's cancer stage of other cancers 10 years ago), etc. Although these pieces of information have some significance, they are not important for electronic medical record matching, but rather generate noise. For example, if a doctor wants to search for a IIA-stage gastric cancer case, he must want the patient to be IIA-stage, rather than the patient's family IIA-stage. Therefore, the method for judging which cancer stages are the subject stages of the cases has important significance for the task of matching the electronic medical record.
In the face of massive electronic medical record data, no mature method is available at present for accurately judging which cancer stage information is topic stage information in an electronic medical record with a plurality of cancer stage information.
Disclosure of Invention
The invention aims to provide a method and a system for judging the subject cancer staging based on an electronic medical record, which can judge which cancer staging information in the electronic medical record is the subject staging information so as to solve the technical problem; and reliable technical support is provided for doctors to carry out electronic medical record matching.
In order to achieve the purpose, the invention adopts the following technical scheme:
in a first aspect, the invention provides a method for discriminating a subject cancer stage based on an electronic medical record, which comprises the following steps:
extracting cancer stage information in the electronic medical record to be processed;
segmenting the electronic medical record to be processed;
constructing a cancer staging characteristic matrix by using the cancer staging information and the electronic medical record staging information;
and inputting the text information of the electronic medical record to be processed and the cancer stage characteristic matrix into a deep learning model, and acquiring the probability of each cancer stage as a theme.
The invention further improves the following steps: the cancer stage information at least comprises four standard stages of stage1, stage2, stage3 and stage4, and position information of the cancer stage information in a case history text; segmenting the electronic medical record to be processed to obtain a segmented label of each sentence in the electronic medical record; the segment labels include at least B, P, R, C four segments; b is background segmentation; the P is a patient condition segment; the R is a patient outcome segment; and C is a summary segment.
The invention further improves the following steps: the step of extracting the cancer stage information in the electronic medical record to be processed specifically comprises the following steps:
extracting candidate cancer stage character strings;
filtering the wrong cancer staging string;
and standardizing the cancer stages after filtering the wrong cancer stage character strings to obtain final cancer stage information.
The invention further improves the following steps: standardizing cancer stages after filtering the erroneous cancer stage strings specifically includes:
inputting the character sequence of the cancer stage character string into a first character-level convolutional neural network layer, and extracting the shallow semantic features of the character string;
inputting the shallow semantic features of the character strings into a self-attention layer, and performing weighting processing on the shallow semantic features of the character strings in the cancer stage to obtain primary semantic features subjected to weighting processing;
inputting the weighted primary semantic features into a second character-level convolutional neural network layer, and performing feature extraction again to obtain high-level semantic features of the cancer stage character strings;
and inputting the high-level semantic features of the cancer stage character strings into a full-connection layer, and calculating the probability that the cancer stage character strings belong to each cancer stage.
The invention further improves the following steps: the step of segmenting the electronic medical record to be processed specifically comprises the following steps:
sentence division is carried out on the electronic medical record, words are divided in the sentences, a word embedding layer is input, and the words are converted into word vectors e1,e2……,en
Inputting the obtained word vector of the sentence into a bidirectional cyclic neural network to obtain a hidden vector h1,h2……,hn
Carrying out attention mechanism calculation on the hidden vector to obtain a characterization vector s of the current sentence;
obtaining all sentence characterization vector sequences s in electronic medical record1,s2……,sm
Characterizing a sentence by a sequence of vectors s1,s2……,smInputting the signal into a subsequent bidirectional cyclic neural network to obtain a corresponding hidden vector h1,h2……,hm
Hidden vector h1,h2……,hmThrough the conditional random field layer, outputting the predicted tag sequence y of each sentence of all sentences in the electronic medical record1,y2……,ym
The invention further improves the following steps: constructing a cancer staging characteristic matrix by using the cancer staging information and the electronic medical record staging information, wherein the method comprises the following steps:
establishing a cancer stage characteristic matrix of the electronic medical record, wherein rows of the matrix correspond to stages, and columns of the matrix correspond to sections; the type of the stages is M, and the number of the stages is N; the value of the mth row and nth column in the matrix represents the number of occurrences of the corresponding period of the m rows in the segment corresponding to the n columns;
wherein M is 1,2 … … M; n is 1,2 … … N.
The invention further improves the following steps: inputting text information of the electronic medical record to be processed and a cancer stage characteristic matrix into a deep learning model, and acquiring the probability of each cancer stage as a theme, wherein the probability comprises the following steps:
converting each word in each sentence of the electronic medical record into a word vector; inputting all words in the sentence into the corresponding sentence-level LSTM network; the sentence level LSTM network outputs semantic vectors of each sentence; inputting the semantic vector of each sentence into a document-level LSTM network, and outputting the semantic vector of the whole document;
performing dimension transformation on the cancer stage characteristics through a first full-connection layer, and changing a matrix into a cancer stage characteristic vector;
splicing the semantic vector of the whole document and the cancer stage characteristic vector to obtain a total characteristic vector; and inputting the total feature vector into a second full-connection layer, and outputting probability information of which each stage is a theme stage.
In a second aspect, the present invention provides a system for determining a subject cancer stage based on an electronic medical record, comprising:
the staging module is used for extracting the cancer staging information in the electronic medical record to be processed;
the segmentation module is used for segmenting the electronic medical record to be processed;
the cancer stage characteristic matrix construction module is used for constructing a cancer stage characteristic matrix by utilizing the cancer stage information and the electronic medical record stage information;
and the cancer stage judging module is used for inputting the text information of the electronic medical record to be processed and the cancer stage characteristic matrix into the deep learning model and acquiring the probability of each cancer stage as a theme.
In a third aspect, the present invention provides a computer program product, wherein instructions of the computer program product, when executed by a processor, implement the method for discriminating a subject cancer stage based on an electronic medical record.
In a fourth aspect, the present invention provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a method for determining a subject cancer stage based on an electronic medical record.
Compared with the prior art, the invention has the following beneficial effects:
the invention provides a method and a system for distinguishing subject cancer stages based on an electronic medical record, wherein the method comprises the following steps: extracting cancer stage information in the electronic medical record to be processed; segmenting the electronic medical record to be processed; constructing a cancer staging characteristic matrix by using the cancer staging information and the electronic medical record staging information; and inputting the text information of the electronic medical record to be processed and the cancer stage characteristic matrix into a deep learning model, and acquiring the probability of each cancer stage as a theme. According to the method, the specific information of the electronic medical records is deepened, the segmentation and stage information extraction is carried out, the extracted information is judged by using the deep learning model, the subject cancer stage probability in one electronic medical record can be accurately identified, and reliable technical support is provided for a doctor to carry out electronic medical record matching; the invention utilizes the electronic medical record segmentation information and improves the accuracy of judging the cancer subject stage.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate embodiments of the invention and, together with the description, serve to explain the invention and not to limit the invention. In the drawings:
FIG. 1 is a schematic flow chart of a method for determining the stage of a subject cancer based on electronic medical records according to the present invention;
FIG. 2 is a schematic diagram of the structure of the deep learning model of the present invention;
FIG. 3 is a block diagram of the system for determining the stage of cancer based on electronic medical records according to the present invention;
FIG. 4 is a network architecture diagram of a staged verification model;
FIG. 5 is a network architecture diagram of a standardized model;
fig. 6 is a network architecture diagram of a deep learning model for electronic medical record segmentation.
Detailed Description
The present invention will be described in detail below with reference to the embodiments with reference to the attached drawings. It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.
The following detailed description is exemplary in nature and is intended to provide further details of the invention. Unless otherwise defined, all technical terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the invention.
Some basic concepts of the invention:
(1) cancer stage information:
the stage of cancer indicates the severity of the cancer, and generally the higher the value of the stage, the higher the severity. The expression pattern of cancer stages is rich, e.g. "Stage II cancer", "early cancer", "T2N 2M0 cancer", etc., but these can be normalized to the standard Stage. In the present invention, for convenience of expression, it is agreed that all cancer stages are normalized to the four standard stages of stage1, 2, 3 and 4 (in practical application, the standard stage also has stage2A and the like to refine the stages, but the number of standard stages does not influence the understanding of the present invention).
(2) Electronic medical record segmentation information:
the general electronic medical record can be divided into Background (generally abbreviated as B), patient condition (People, generally abbreviated as P), patient outcome (Result, generally abbreviated as R), summary (Conclusion, generally abbreviated as C) and other paragraphs. The segments are marked out, which are not shown in the medical record, and require the reader to summarize himself. The paragraph information has important reference value for cancer subject stage judgment. If a cancer stage occurs in the R segment, it is likely to be the final cancer stage of the patient after treatment, which is part of the subject stage. If a cancer stage occurs in segment B, it is likely to be a stage of another similar patient, not the stage of the subject patient, and therefore not part of the subject stage. (medical records are often described as such: "endoscopic surgery is generally available only to patients with early stage gastric cancer, but a case where endoscopic surgery is successful in curing advanced stage gastric cancer is described herein," early stage gastric cancer "is a non-subject stage). In the invention, for convenience of description, the appointment electronic medical record has only B, P, R, C four segments (in practical application, there may be segments such as Intervention means, treatment Method, etc., but the number of segments does not affect the understanding of the invention).
Example 1
Referring to fig. 1, the present invention provides a method for determining a subject cancer stage based on an electronic medical record, comprising the following steps:
s1, extracting the cancer stage information in the electronic medical record to be processed;
s2, segmenting the electronic medical record to be processed;
s3, constructing a cancer staging characteristic matrix by using the cancer staging information and the electronic medical record staging information;
and S4, inputting the text information of the electronic medical record to be processed and the cancer stage characteristic matrix into the deep learning model, and acquiring the probability of each cancer stage as a theme.
Example 2
More specifically, the invention provides a method for distinguishing a subject cancer stage based on an electronic medical record, which comprises the following steps:
s1, extracting the cancer stage information in the electronic medical record to be processed;
acquiring text information of an electronic medical record to be processed, and extracting cancer stage information in the electronic medical record; the extracted cancer stage information includes: standardized staging (i.e., one of four standard staging of stage1, 2, 3, 4) and the position of staging information in the medical record text (e.g., 105 th character to 115 th character in an electronic medical record are staging information).
S2, segmenting the electronic medical record to be processed;
segmenting the electronic medical record to be processed; segmented labels of each sentence in the electronic medical record are obtained. (i.e., each sentence belongs to B, P, R, C one of the four segments).
S3, constructing a cancer staging characteristic matrix by using the cancer staging information and the electronic medical record staging information;
this step corresponds to the two parts of "staging information code" and "cancer staging feature matrix" in the bottom right hand corner of fig. 2. Through the steps S1 and S2, the cancer stage information and the segment information of the electronic medical record are acquired, and because the stage is positioned in the text in the stage information, the stage information can be judged in which segment of the medical record is positioned based on the position information. Based on the information, the invention constructs a cancer stage characteristic matrix.
The electronic medical record only has B, P, R, C four segments and only has stage1, 2, 3 and 4 standard stages. Therefore, the invention constructs a 4 x 4 matrix for each case history, the rows of the matrix corresponding to the stages and the columns of the matrix corresponding to the segments. Assume that rows 1,2, 3, 4 of the matrix correspond to stages 1,2, 3, 4, respectively, and columns 1,2, 3, 4 of the matrix correspond to B, P, R, C segments, respectively. The value of the mth row and nth column in the matrix indicates that, in the segment corresponding to the n columns, the m rows correspond to the number of occurrences of the period. For example, the value in row 2, column 3 indicates the number of times stage2 occurs in an R segment. Take the following medical records as an example:
"B: the removal of cancer at the gastrointestinal junction by esophageal endoscopic resection has been a difficult problem, is usually difficult to cure radically, and requires subsequent combined radiotherapy, and a case is introduced here.P: the patient was 72 years old, male. Patients suffering from self-reported diseasesStage IIStomach cancer later developing intoStage IIIGastric cancer. Confirmed diagnosis after examinationStage IIIGastric cancer. The hospital performs a transesophageal endoscopic resection on the patient as follows … …. R: because of poor physical condition of patients, only the cancerated part can be partially removed. After treatment, the patient is assessed to have a reduced stage of cancerStage IAnd then continuing chemoradiotherapy treatment. C: in this case, it can be seen that the effect of transesophageal endoscopic resection is greatly influenced by the physical condition of the patient, and how to eradicate cancer by the surgery will be an important issue in the future. "
B, P, R, C is a mark marked manually for easy understanding, and no label is displayed in normal medical records. Additionally, the cancer stage information is underlined; the medical record is an example of manual marking, and the electronic medical record obtained by extracting cancer staging information and segmenting the electronic medical record by an actual computer is the same as the above manner or is represented by other text or data forms.
Based on the medical record, the corresponding cancer stage characteristic matrix established after the extraction and the segmentation of the cancer stage information is as follows:
(0,0,0,00,1,0,00,2,1,00,0,0,0)
s4, inputting the text information of the electronic medical record to be processed and the cancer stage characteristic matrix into a deep learning model, and acquiring the probability of each cancer stage as a theme;
this step corresponds to the deep learning network of fig. 2, and is mainly divided into a left part and a right part.
S4.1, the left half part of the network is a two-stage Long Short-Term Memory network (LSTM). The first level is a sentence level LSTM network, for each sentence of the electronic medical record, each word in the sentence is converted into a word vector, and then all the words in the sentence are input into the corresponding sentence level LSTM network. The sentence-level LSTM network outputs a semantic vector for each sentence. The second level is a document level LSTM network, and the output of all sentence level LSTM networks is input into the document level LSTM network; the semantic vector for each sentence is input into the document-level LSTM network as input. Document-level LSTM outputs semantic vectors for the entire document.
And S4.2, the right half part of the network is a first full-connection layer trained in advance, and the first layer converts the cancer stage characteristic matrix into a cancer stage characteristic vector. The first full-connected layer learns how to convert the cancer stage feature matrix into the cancer stage feature vector through the previous learning training. The purpose of the first fully-connected layer is mainly two: firstly, the dimensionality transformation is carried out on the cancer stage characteristics, and the matrix is changed into a vector, so that later-stage calculation is facilitated. Secondly, in the process of learning and training, primary feature extraction is carried out on the cancer stage information, and weight is added to the discovered important features.
S4.3, obtaining a literature semantic vector through the left half part of the network; through the right half of the network, a cancer stage feature vector is obtained. In step S4.3, the document semantic vector and the cancer stage feature vector are spliced to obtain a total feature vector. And then inputting the total feature vector into a second fully-connected layer trained in advance, wherein the output of the second fully-connected layer is the probability that each stage is a topic stage.
And the second full-connection layer learns how to judge which periods are theme periods from the total feature vector through the previous learning training.
A total of four standard stages of stage1, 2, 3 and 4 are appointed before, so the output of the full connection layer is a 4-dimensional vector, each dimension of the vector is a real number between 0 and 1, and the probability that the corresponding standard stage of the dimension is the subject stage is represented. For example, if the final output is (0.8, 0.7,0.1,0.2), this indicates that stage1 has an 80% probability of being a subject stage, stage2 has a 70% probability of being a subject stage, stage3 has a 10% probability of being a subject stage, and stage4 has a 20% probability of being a subject stage. The invention can set a threshold value according to the specific service requirement, for example, more than 70% of the periods are regarded as the subject periods.
Example 3
Referring to fig. 3, the present invention provides a system for determining a subject cancer stage based on an electronic medical record, comprising:
the staging module is used for extracting the cancer staging information in the electronic medical record to be processed;
the segmentation module is used for segmenting the electronic medical record to be processed;
the cancer stage characteristic matrix construction module is used for constructing a cancer stage characteristic matrix by utilizing the cancer stage information and the electronic medical record stage information;
and the cancer stage judging module is used for inputting the text information of the electronic medical record to be processed and the cancer stage characteristic matrix into the deep learning model and acquiring the probability of each cancer stage as a theme.
It should be noted that the specific processing procedure of each functional module of the subject cancer stage determination system based on the electronic medical record to the electronic medical record is the same as the processing procedures of steps S1-S4 in embodiment 2, and is not described herein again.
Example 4
In order to implement the above embodiments, the present invention further provides a computer program product, wherein instructions of the computer program product, when executed by a processor, implement the method for discriminating subject cancer stage based on electronic medical record according to embodiment 2.
Example 5
In order to achieve the above embodiments, the present invention further provides a non-transitory computer readable storage medium having a computer program stored thereon, wherein the computer program, when executed by a processor, implements the method for discriminating a subject cancer stage based on an electronic medical record according to embodiment 2.
Example 6
In the method for determining the subject cancer stage based on the electronic medical record, the step S1 may specifically adopt the following method:
s11, extracting candidate cancer stage character strings
In the step, a dictionary or a regular expression can be adopted to extract candidate cancer stage character strings in the electronic medical record text information to be processed. Such as "Stage I", "T2N 2M 0", and the like. This step should be a pure character perfect match without parsing semantic information. Therefore, some of the extracted candidate cancer stage character strings are not indicative of cancer stage information and are screened out in the following steps.
S12, filtering error cancer stage character string
With respect to the candidate cancer stage character string extracted in step S11, whether it really expresses cancer stage information is discriminated using the stage verification model of fig. 4.
The input to the staging verification model is the sentence in which the candidate cancer staging string is located and the location of the candidate cancer staging string in the sentence. Take the staging "T2 stage" in "A T2stage capacity with Long cancer chamber" as an example. T2stage is the 1 st and 2 nd word in the sentence (the general word number starts from 0, i.e. a is the 0 th word of the sentence), so the position is (1, 2).
The first layer of the staged verification model is a feature extraction layer, which is divided into two parts.
(1) On the left is an LSTM network that takes the sentence "A T2stage page with long cancer" as input, analyzes the semantic information of the sentence, and outputs the semantic information in a real number vector.
(2) To the right is a position-coding layer, which is a simple mapping that maps two positive integers into one vector. The output of this layer is a 100-dimensional vector if the two natural numbers of the input are (a, b). Then the values of the a-th to b-th dimensions of this 100-dimensional vector are 1 and the values of the remaining dimensions are 0. For example, for (1,2), the output vector is 0110 … … 0(97 0's); for (2,4), the output vector is 001110 … … 0(95 0 s). If the length of the input sentence exceeds 100 words, the input sentence is cut into a plurality of sentences, and the length of the cut sentences is smaller than 100 words.
The second layer of the staging verification model is a splicing layer which simply splices the semantic vector of the sentence and the position vector of the staging information.
The second layer of the staged verification model is a scoring layer, and the structure of the scoring layer is a fully-connected network. The full-connection network takes the spliced semantic vector and the staging information position vector as input and scores the quality of the staging information. The output of this layer is a real number of 0-1, representing the probability that "T2 stage" is cancer stage information. If the probability is higher than the preset threshold, the T2stage is considered as a valid staging message.
The LSTM layer and the fully connected layer are basic modules for deep learning, and are not described herein.
S13, cancer staging standardization
The valid cancer stage character string selected in step S12 is normalized. Normalization is a multi-label classification task. Assuming that the cancers are only four types of T1, T2, T3 and T4 (in the true case, there is a classification of about 50), the expressions of stage1, stage2, stage3 and stage4 may be used. All that is required for normalization is to determine which classifications a cancer staging string can correspond to. For example, "T2 stage" corresponds to the T2 classification, and "T1-3 cander" corresponds to the T1, T2, and T3 classifications. The cancer staging strings are classified using the standardized model shown in fig. 5.
The input to the standardized model is a cancer staging string and the output is the probability of each possible classification (under the above assumptions, i.e., probabilities of T1, T2, T3, T4, respectively, are 4-dimensional vectors, each dimension of which is a real number between 0-1). The following description will be made by taking a character string "T2-4 stage" as an example.
The first layer of the standardized model is a feature extraction layer, whose purpose is to extract semantic features of cancer stage strings. This layer can be divided into three sublayers:
(1) the first character-level convolutional neural network layer (Char CNN layer) takes as input the character sequence of the cancer staging string, i.e., "T, 2, -, 4, s, T, a, g, e". And extracting shallow semantic features of the character strings.
(2) From the attention layer, the layer performs weighting processing on the bottom semantic features of the cancer staging character string to judge which features are important features and which features are unimportant features.
(3) And a second character-level convolutional neural network layer (Char CNN layer) which performs feature extraction again on the weighted primary semantic features to obtain high-level semantic features of the cancer stage character strings.
The second layer of the standardized model is a fully connected layer (classification layer), which takes the high-level semantic features of the cancer stage character string as input, and analyzes based on the features to calculate the probability that the character string belongs to each classification. In the example of "T2-4 stage", if the model is trained well, the probability of outputting T2, T3 and T4 is high, and the corresponding probability of the category of T1 is low.
The Char layer, the self-attention layer and the full connection layer are basic modules for deep learning, and are not described herein again.
Example 7
The label data is structured label data suitable for the electronic medical record to be processed, and has four types of BPRC labels.
There is a dependency relationship between the labels of the electronic medical records. The sequence of article sentence labels follows a certain probability distribution, rather than being random. For example: the labels of the first few sentences of the electronic medical record are generally background B, and the labels of the last few sentences of the electronic medical record are generally summary C. This probability distribution between statement labels can be used to help the program filter the noise label data. Background (generally abbreviated as B), patient condition (People, generally abbreviated as P), patient outcome (Result, generally abbreviated as R), and summary (Conclusion, generally abbreviated as C) were recorded in the order of B → P → R → C.
If N sentences exist after the sentence division of the electronic medical record, each sentence has a corresponding label, the corresponding label is (Lable _1, … …, Lable _ N), and Lable _ i represents the label of the ith office. Here, whether the labels (Lable _1, … …, Lable _ N) accord with the label sequence with the maximum probability is judged, and if so, the data sample is reserved; otherwise, the sample is the noise error data, the data sample is deleted, and the label sequence with the maximum probability of 4 classes is B → P → R → C.
The invention relates to a method for judging the stage of subject cancer based on an electronic medical record, which comprises the following steps of S2, based on a built hierarchical LSTM-CRF + ATT network model, segmenting the electronic medical record to be processed:
the main structure of the deep learning model for segmentation in the electronic medical record related to the invention is shown in fig. 6. The data is a sentence sequence obtained by separating sentences of the electronic medical record. Firstly, the Representation (Representation) of the sentence is learned through Bi-LSTM (bidirectional LSTM) + Attention mechanism for each sentence, then the sentence sequence is input into the subsequent BiLSTM to obtain the sentence sequence Representation, and then the labels of the sentence sequence are obtained through a CRF layer.
(1) For a sentence, firstly, the sentence is divided into words, and the words are inputted into the word Embedding layer (Token Embedding layer), and the words are converted into word vectors e1,e2……,en(the word vectors carry word semantic information).
(2) Inputting the word vector of a sentence into a bidirectional recurrent neural network (Bi-LSTM layer) to obtain a hidden vector h1,h2……,hn(each hidden vector carries a portion of the sentence information).
(3) Carrying out Attention mechanism (Attention) calculation on the hidden vector to obtain a characterization vector s of the current sentence;
(4) all sentences in the sentence set corresponding to the electronic medical record are sequentially subjected to 1-3 operations, so that a characterization vector sequence s of all sentences in the electronic medical record is obtained1,s2……,sm
(5) Vector the sentence into a sequence s1,s2……,smInputting the signal into a subsequent bidirectional recurrent neural network (Bi-LSTM layer) to obtain a corresponding hidden vector h1,h2……,hm
(6) Finally, a CRF (conditional random field for short) layer is passed to output the predicted tag sequence y of each sentence of all sentences in the electronic medical record1,y2……,ym
The main innovation point of the model is that most of the existing models only use a simple RNN network model; the invention uses a level BilSTM + ATT model, the sentence expression part uses an Attention mechanism, and the obtained sentence sequence uses the BilSTM again at the same time, the input of the BilSTM layer is all the sentence sequences in an electronic medical record, and the input of the BilSTM layer M of the preceding sentence representation part is all the word sequences in a sentence. The model can deeply mine deep information of sentences and semantic information among the sentences, thereby improving the model effect.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting the same, and although the present invention is described in detail with reference to the above embodiments, those of ordinary skill in the art should understand that: modifications and equivalents may be made to the embodiments of the invention without departing from the spirit and scope of the invention, which is to be covered by the claims.

Claims (10)

1. A method for distinguishing a subject cancer stage based on an electronic medical record is characterized by comprising the following steps:
extracting cancer stage information in the electronic medical record to be processed;
segmenting the electronic medical record to be processed;
constructing a cancer staging characteristic matrix by using the cancer staging information and the electronic medical record staging information;
and inputting the text information of the electronic medical record to be processed and the cancer stage characteristic matrix into a deep learning model, and acquiring the probability of each cancer stage as a theme.
2. The method for determining the stage of cancer based on electronic medical record as claimed in claim 1, wherein the cancer stage information at least includes four standard stages of stage1, stage2, stage3 and stage4, and the position information of the cancer stage information in the text information of the electronic medical record; segmenting the electronic medical record to be processed to obtain a segmented label of each sentence in the electronic medical record; the segment labels include at least B, P, R, C four segments; b is background segmentation; the P is a patient condition segment; the R is a patient outcome segment; and C is a summary segment.
3. The method for determining the subject cancer stage based on the electronic medical record as claimed in claim 1, wherein the step of extracting the cancer stage information in the electronic medical record to be processed specifically comprises:
extracting candidate cancer stage character strings;
filtering the wrong cancer staging string;
and standardizing the cancer stages after filtering the wrong cancer stage character strings to obtain final cancer stage information.
4. The method as claimed in claim 3, wherein the step of standardizing the cancer stages after filtering the wrong cancer stage character strings comprises:
inputting the character sequence of the cancer stage character string into a first character-level convolutional neural network layer, and extracting the shallow semantic features of the character string;
inputting the shallow semantic features of the character strings into a self-attention layer, and performing weighting processing on the shallow semantic features of the character strings in the cancer stage to obtain primary semantic features subjected to weighting processing;
inputting the weighted primary semantic features into a second character-level convolutional neural network layer, and performing feature extraction again to obtain high-level semantic features of the cancer stage character strings;
and inputting the high-level semantic features of the cancer stage character strings into a full-connection layer, and calculating the probability that the cancer stage character strings belong to each cancer stage.
5. The method for discriminating the subject cancer stage based on the electronic medical record as claimed in claim 1, wherein the step of segmenting the electronic medical record to be processed specifically comprises:
sentence division is carried out on the electronic medical record, words are divided in the sentences, a word embedding layer is input, and the words are converted into word vectors e1,e2……,en
Inputting the obtained word vector of the sentence into a bidirectional cyclic neural network to obtain a hidden vector h1,h2……,hn
Carrying out attention mechanism calculation on the hidden vector to obtain a characterization vector s of the current sentence;
obtaining all sentence characterization vector sequences s in electronic medical record1,s2……,sm
Characterizing a sentence by a sequence of vectors s1,s2……,smInputting the signal into a subsequent bidirectional cyclic neural network to obtain a corresponding hidden vector h1,h2……,hm
Hidden vector h1,h2……,hmThrough the conditional random field layer, outputting the predicted tag sequence y of each sentence of all sentences in the electronic medical record1,y2……,ym
6. The method for determining the subject cancer stage based on the electronic medical record as claimed in claim 1, wherein the step of constructing the cancer stage feature matrix by using the cancer stage information and the segmentation information of the electronic medical record comprises:
establishing a cancer stage characteristic matrix of the electronic medical record, wherein rows of the matrix correspond to stages, and columns of the matrix correspond to sections; the type of the stages is M, and the number of the stages is N; the value of the mth row and nth column in the matrix represents the number of occurrences of the corresponding period of the m rows in the segment corresponding to the n columns;
wherein M is 1,2 … … M; n is 1,2 … … N.
7. The method for determining the subject cancer stage based on the electronic medical record as claimed in claim 1, wherein the step of inputting the text information of the electronic medical record to be processed and the cancer stage feature matrix into the deep learning model to obtain the probability of each cancer stage as the subject comprises:
converting each word in each sentence of the electronic medical record into a word vector; inputting all words in the sentence into the corresponding sentence-level LSTM network; the sentence level LSTM network outputs semantic vectors of each sentence; inputting the semantic vector of each sentence into a document-level LSTM network, and outputting the semantic vector of the whole document;
performing dimension transformation on the cancer stage characteristics through a first full-connection layer, and changing a matrix into a cancer stage characteristic vector;
splicing the semantic vector of the whole document and the cancer stage characteristic vector to obtain a total characteristic vector; and inputting the total feature vector into a second full-connection layer, and outputting probability information of which each stage is a theme stage.
8. A system for determining a subject cancer stage based on an electronic medical record, comprising:
the staging module is used for extracting the cancer staging information in the electronic medical record to be processed;
the segmentation module is used for segmenting the electronic medical record to be processed;
the cancer stage characteristic matrix construction module is used for constructing a cancer stage characteristic matrix by utilizing the cancer stage information and the electronic medical record stage information;
and the cancer stage judging module is used for inputting the text information of the electronic medical record to be processed and the cancer stage characteristic matrix into the deep learning model and acquiring the probability of each cancer stage as a theme.
9. A computer program product, wherein instructions in the computer program product, when executed by a processor, implement a method for discriminating subject cancer stage based on electronic medical records according to any of claims 1 to 6.
10. A non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a method for discriminating subject cancer stages based on electronic medical records according to any of claims 1 to 6.
CN202011416351.2A 2020-12-04 2020-12-04 Method and system for distinguishing subject cancer stages based on electronic medical record Active CN112530534B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011416351.2A CN112530534B (en) 2020-12-04 2020-12-04 Method and system for distinguishing subject cancer stages based on electronic medical record

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011416351.2A CN112530534B (en) 2020-12-04 2020-12-04 Method and system for distinguishing subject cancer stages based on electronic medical record

Publications (2)

Publication Number Publication Date
CN112530534A true CN112530534A (en) 2021-03-19
CN112530534B CN112530534B (en) 2023-02-07

Family

ID=74997840

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011416351.2A Active CN112530534B (en) 2020-12-04 2020-12-04 Method and system for distinguishing subject cancer stages based on electronic medical record

Country Status (1)

Country Link
CN (1) CN112530534B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107578798A (en) * 2017-10-26 2018-01-12 北京康夫子科技有限公司 The processing method and system of electronic health record
CN110459282A (en) * 2019-07-11 2019-11-15 新华三大数据技术有限公司 Sequence labelling model training method, electronic health record processing method and relevant apparatus
CN111312392A (en) * 2020-03-13 2020-06-19 中南大学 Prostate cancer auxiliary analysis method and device based on integration method and electronic equipment
US20200342056A1 (en) * 2019-04-26 2020-10-29 Tencent America LLC Method and apparatus for natural language processing of medical text in chinese
CN111967261A (en) * 2020-10-20 2020-11-20 平安科技(深圳)有限公司 Cancer stage information processing method, device and storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107578798A (en) * 2017-10-26 2018-01-12 北京康夫子科技有限公司 The processing method and system of electronic health record
US20200342056A1 (en) * 2019-04-26 2020-10-29 Tencent America LLC Method and apparatus for natural language processing of medical text in chinese
CN110459282A (en) * 2019-07-11 2019-11-15 新华三大数据技术有限公司 Sequence labelling model training method, electronic health record processing method and relevant apparatus
CN111312392A (en) * 2020-03-13 2020-06-19 中南大学 Prostate cancer auxiliary analysis method and device based on integration method and electronic equipment
CN111967261A (en) * 2020-10-20 2020-11-20 平安科技(深圳)有限公司 Cancer stage information processing method, device and storage medium

Also Published As

Publication number Publication date
CN112530534B (en) 2023-02-07

Similar Documents

Publication Publication Date Title
CN110032648B (en) Medical record structured analysis method based on medical field entity
CN109871545B (en) Named entity identification method and device
US10929420B2 (en) Structured report data from a medical text report
CN112256828B (en) Medical entity relation extraction method, device, computer equipment and readable storage medium
CN112597774B (en) Chinese medical named entity recognition method, system, storage medium and equipment
CN110335653B (en) Non-standard medical record analysis method based on openEHR medical record format
CN108959418A (en) Character relation extraction method and device, computer device and computer readable storage medium
CN112287680B (en) Entity extraction method, device and equipment of inquiry information and storage medium
CN112151183A (en) Entity identification method of Chinese electronic medical record based on Lattice LSTM model
CN111259111B (en) Medical record-based decision-making assisting method and device, electronic equipment and storage medium
CN114564959A (en) Method and system for identifying fine-grained named entities of Chinese clinical phenotype
CN111191456A (en) Method for identifying text segmentation by using sequence label
CN110991185A (en) Method and device for extracting attributes of entities in article
CN107122582B (en) diagnosis and treatment entity identification method and device facing multiple data sources
CN114913953A (en) Medical entity relationship identification method and device, electronic equipment and storage medium
Adduru et al. Towards Dataset Creation And Establishing Baselines for Sentence-level Neural Clinical Paraphrase Generation and Simplification.
CN117217233A (en) Text correction and text correction model training method and device
CN114662477A (en) Stop word list generating method and device based on traditional Chinese medicine conversation and storage medium
CN113111660A (en) Data processing method, device, equipment and storage medium
CN112530534B (en) Method and system for distinguishing subject cancer stages based on electronic medical record
CN116910251A (en) Text classification method, device, equipment and medium based on BERT model
CN116719840A (en) Medical information pushing method based on post-medical-record structured processing
CN115455969A (en) Medical text named entity recognition method, device, equipment and storage medium
CN115114437A (en) Gastroscope text classification system based on BERT and double-branch network
US20230140480A1 (en) Utterance generation apparatus, utterance generation method, and program

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant