CN117874235B - Data processing system for acquiring disease name identification of electronic medical record text - Google Patents

Data processing system for acquiring disease name identification of electronic medical record text Download PDF

Info

Publication number
CN117874235B
CN117874235B CN202410108326.XA CN202410108326A CN117874235B CN 117874235 B CN117874235 B CN 117874235B CN 202410108326 A CN202410108326 A CN 202410108326A CN 117874235 B CN117874235 B CN 117874235B
Authority
CN
China
Prior art keywords
disease name
disease
medical record
obtaining
refers
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202410108326.XA
Other languages
Chinese (zh)
Other versions
CN117874235A (en
Inventor
王志鹏
王军江
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Qidian Zhibao Beijing Technology Co ltd
Original Assignee
Qidian Zhibao Beijing Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Qidian Zhibao Beijing Technology Co ltd filed Critical Qidian Zhibao Beijing Technology Co ltd
Priority to CN202410108326.XA priority Critical patent/CN117874235B/en
Publication of CN117874235A publication Critical patent/CN117874235A/en
Application granted granted Critical
Publication of CN117874235B publication Critical patent/CN117874235B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Medical Treatment And Welfare Office Work (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to the field of data processing, in particular to a data processing system for acquiring disease name identifications of electronic medical record texts, which comprises the following steps when a computer program is executed by a processor: screening to obtain a first candidate disease name list and removing duplication to obtain a first intermediate disease name list based on the similarity between the word vector of the initial keyword and the word vector of the main keyword corresponding to the main disease name, so that the first disease name identification list is obtained by combining the similarity between the initial keyword and the main keyword and the occurrence frequency of the initial keyword in the medical record text, and the accuracy of obtaining the first disease name identification is improved; and then combining the word vector of the initial keyword, the first disease name identification list, the first secondary disease name information list and the second secondary disease name information list to further acquire a target medical record name identification corresponding to the electronic medical record text, thereby improving the acquisition accuracy of the target medical record name identification.

Description

Data processing system for acquiring disease name identification of electronic medical record text
Technical Field
The invention relates to the field of data processing, in particular to a data processing system for acquiring disease name identification of an electronic medical record text.
Background
The DRG (disease diagnosis related group) is a classification coding standard specially used for a medical insurance prepayment system, the patients are classified into a plurality of diagnosis related groups according to the age, sex, hospitalization days, clinical diagnosis, symptoms, operation, disease severity, complications, and relatives of the patients, scientific measurement and calculation are carried out on the classification, and quota prepayment is given, so that the medical insurance prepayment system can assist in improving lean operation management and performance management capability of hospitals, and has wide application prospects in the medical field.
In the medical field, the DRG catalog can be divided into three layers of main, sub-and fine-mesh, and the main, sub-and fine-mesh corresponding to the medical record text are respectively obtained through the similarity between the medical record text of the patient and the main, sub-and fine-mesh names, and the detailed disease name identification is further obtained by integrating the corresponding main, sub-and fine-mesh. Because the DRG catalog comprises a plurality of main orders, each main order comprises a plurality of sub-orders, and each sub-order comprises a plurality of detail, the complexity of the disease name identification is higher, and the accuracy of the existing disease name identification acquisition method for acquiring the disease name identification is lower only according to the similarity between the medical record text and the main order, sub-order and detail names.
Therefore, how to improve the accuracy of acquiring the disease name identifier of the electronic medical record text becomes a urgent problem to be solved.
Disclosure of Invention
Aiming at the technical problems, the technical scheme adopted by the invention is a data processing system for acquiring the disease name identifications of the electronic medical record text, the system comprises a processor and a memory storing a computer program, wherein the memory also stores an initial keyword vector set A 0={A0 1,A0 2,……,A0 i,……,A0 m of the electronic medical record text, a main disease name information list L 0, a first auxiliary disease name information list L 1 and a second auxiliary disease name information list L 2, wherein A 0 i refers to a word vector of an ith initial keyword of the electronic medical record text, L 0 comprises n preset main disease names, a main disease name identification corresponding to each main disease name and a word vector of a main keyword corresponding to the jth main disease name, i=1, 2, … …, m refers to the total number of the initial keywords of the electronic medical record text, j=1, 2, … …, n, and when the computer program is executed by the processor, the following steps are realized:
S100, according to A 0 and B 0, obtaining a main disease name similarity set C 0={C0 1,C0 2,……,C0 i,……,C0 m corresponding to A 0, wherein ,C0 i={C0 i1,C0 i2,……,C0 ij,……,C0 in},C0 ij refers to similarity between A 0 i and B 0 j.
S200, taking the main disease name corresponding to C 0 ij meeting the requirement of C 0 ij>△C0 as a first candidate disease name, and obtaining a first candidate disease name list D 0, wherein DeltaC 0 is a first preset threshold.
S300, de-duplicating D 0 to obtain a first intermediate disease name list D 1={D1 1,D1 2,……,D1 k,……,D1 t, where D 1 k refers to the kth first intermediate disease name, k=1, 2, … …, t, and t is the total number of first intermediate disease names.
S400, according to C 0 and D 1, obtaining a set of intermediate similarity C 1={C1 1,C1 2,……,C1 k,……,C1 t between the main disease names corresponding to D 1, where ,C1 k={C1 k1,C1 k2,……,C1 kx,……,C1 kr(k)},C1 kx refers to the x-th main disease name similarity greater than Δc 0 corresponding to the main disease name corresponding to D 1 k in C 0, x=1, 2, … …, r (k), and r (k) refers to the total number of main disease name similarities greater than Δc 0 corresponding to the main disease name corresponding to D 1 k in C 0.
S500, according to C 1 and A 0, a first occurrence frequency set E 1={E1 1,E1 2,……,E1 k,……,E1 t corresponding to D 1 is obtained, wherein ,E1 k={E1 k1,E1 k2,……,E1 kx,……,E1 kr(k)},E1 kx=Q1 kx/m,Q1 kx is the occurrence frequency of a word vector of an initial keyword corresponding to C 1 kx in A 0.
S600, according to C 1 and E 1, obtaining a first selection probability set S 1={S1 1,S1 2,……,S1 k,……,S1 t corresponding to D 1, wherein the first selection probability set S 1 k=Σx=1 r(k)(E1 kx*C1 kx)/r (k) corresponding to D 1 k.
S700, taking a main disease name identifier corresponding to a first intermediate disease name corresponding to S 1 k which meets S 1 k>△S1 as a first disease name identifier, and acquiring a first disease name identifier list W 1, wherein DeltaS 1 is a second preset threshold value.
S800, according to A 0、W1、L1 and L 2, obtaining a target medical record name identifier corresponding to the electronic medical record text.
Compared with the prior art, the data processing system for acquiring the disease name identification of the electronic medical record text has obvious beneficial effects, can achieve quite technical progress and practicality, has wide industrial application value, and has at least the following beneficial effects: screening to obtain a first candidate disease name list based on the similarity between the word vector of the initial keyword and the word vector of the main keyword corresponding to the main disease name, de-duplicating the first candidate disease name list to obtain a first intermediate disease name list, further obtaining a main disease name intermediate similarity set corresponding to the first intermediate disease name list and a corresponding first occurrence frequency set, and therefore representing the probability that each first intermediate disease name is selected as a first disease name identifier by combining the similarity between the initial keyword and the main keyword and the occurrence frequency of the initial keyword in a medical record text, and finally obtaining a first disease name identifier list, thereby improving the accuracy of obtaining the first disease name identifier; and then combining the word vector of the initial keyword, the first disease name identification list, the first secondary disease name information list and the second secondary disease name information list to further acquire a target medical record name identification corresponding to the electronic medical record text, thereby improving the acquisition accuracy of the target medical record name identification.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flowchart of an executing computer program of a data processing system for obtaining a disease name identifier of an electronic medical record text according to an embodiment of the present invention;
FIG. 2 is another flowchart of an executing computer program of a data processing system for obtaining a disease name identifier of an electronic medical record text according to an embodiment of the present invention;
FIG. 3 is another flowchart of an executing computer program of a data processing system for obtaining a disease name identifier of an electronic medical record text according to an embodiment of the present invention;
FIG. 4 is another flowchart of an execution computer program of a data processing system for obtaining a disease name identifier of an electronic medical record text according to an embodiment of the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to fall within the scope of the invention.
It should be noted that the terms "first," "second," and the like in the description and the claims of the present invention and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the invention described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or server that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed or inherent to such process, method, article, or apparatus, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
The first embodiment provides a data processing system for obtaining a disease name identifier of an electronic medical record text, where the system includes a processor and a memory storing a computer program, the memory further stores an initial keyword vector set a 0={A0 1,A0 2,……,A0 i,……,A0 m of the electronic medical record text, a main disease name information list L 0, a first sub disease name information list L 1 and a second sub disease name information list L 2, where a 0 i refers to a word vector of an i-th initial keyword of the electronic medical record text, L 0 includes n preset main disease names, a main disease name identifier corresponding to each main disease name, and a word vector set B0={B0 1,B0 2,……,B0 j,……,B0 n},B0 j of main keywords corresponding to a j-th main disease name, i=1, 2, … …, m, m refers to a total number of the initial keywords of the electronic medical record text, j=1, 2, … …, n, and when the computer program is executed by the processor, the following steps are implemented as shown in fig. 1:
S100, according to A 0 and B 0, obtaining a main disease name similarity set C 0={C0 1,C0 2,……,C0 i,……,C0 m corresponding to A 0, wherein ,C0 i={C0 i1,C0 i2,……,C0 ij,……,C0 in},C0 ij refers to similarity between A 0 i and B 0 j.
The initial keywords may be keywords obtained by extracting keywords from the electronic medical record text according to a keyword extraction algorithm; the primary disease name may refer to a disease name corresponding to a primary order, the first secondary disease name may refer to a disease name corresponding to a sub-order, the second secondary disease name may refer to a disease name corresponding to a detail, the primary disease name identifier is a unique identifier corresponding to the primary disease name, and the primary keyword corresponding to the primary disease name may be a keyword obtained by extracting a keyword from the primary disease name according to a keyword extraction algorithm. One skilled in the art knows that any keyword extraction algorithm in the prior art falls within the protection scope of the present invention, and is not described herein.
The higher the similarity between the initial keyword and the primary keyword, the higher the probability that the primary disease name identifier corresponding to the primary keyword is the first disease name identifier of the electronic medical record text. Therefore, firstly, the similarity between each initial keyword and each primary keyword is obtained, and a primary disease name similarity set C 0 is further obtained as a basis for judging the first disease name identification. Any similarity calculation method in the prior art is known to those skilled in the art to fall within the protection scope of the present invention, and will not be described herein.
According to the similarity between the initial keywords and the primary keywords, the possibility that the corresponding primary disease name marks are the first disease name marks of the electronic medical record text is characterized, and a reliable data basis is provided for judging the first disease name marks.
In a specific embodiment, the memory further stores electronic medical record text and primary keywords corresponding to each primary disease name, and the a 0 and the B 0 are obtained through the following steps:
S10, acquiring a target keyword set A= { A 1,A2,……,Ai,……,Am},Ai of the electronic medical record text, wherein the target keyword set A= { A 1,A2,……,Ai,……,Am},Ai refers to an ith target keyword;
S20, inputting the A into a word vector model, and acquiring a target keyword vector set A0={A0 1,A0 2,……,A0 i,……,A0 m},A0 i corresponding to the A, wherein the target keyword vector set A0={A0 1,A0 2,……,A0 i,……,A0 m},A0 i is a target keyword vector corresponding to the A i;
S30, inputting the primary key corresponding to each preset primary disease name into the word vector model to obtain a preset primary disease name vector set B 0={B0 1,B0 2,……,B0 j,……,B0 n.
Any word vector model in the prior art is known by those skilled in the art to fall within the protection scope of the present invention, and will not be described herein.
In a specific embodiment ,C0 ij=(A0 i·B0 j)/(||A0 i||×||B0 j||),, wherein i a 0 i is the modulus of a 0 i and i B 0 j is the modulus of B 0 j.
For two vectors, the degree of similarity between the two vectors may be expressed according to the cosine of the included angle between the vectors, so in this embodiment, the cosine similarity value between a 0 i and B 0 j is calculated, and C 0 ij is obtained as the similarity between a 0 i and B 0 j.
S200, taking the main disease name corresponding to C 0 ij meeting the requirement of C 0 ij>△C0 as a first candidate disease name, and obtaining a first candidate disease name list D 0, wherein DeltaC 0 is a first preset threshold.
When C 0 ij>△C0, it may be indicated that the degree of similarity between the electronic medical record text and the primary disease name corresponding to C 0 ij is higher, and the probability that the primary disease name corresponding to C 0 ij is the first disease name identifier of the electronic medical record text is higher, so in this embodiment, the primary disease name corresponding to C 0 ij of C 0 ij>△C0 is screened out as the first candidate disease name, and the first candidate disease name list D 0 is further obtained as a basis for determining the first disease name identifier.
The specific value of Δc 0 may be set by the practitioner according to the actual situation.
Above-mentioned, through carrying out the size comparison to main class disease name similarity and first default threshold value, the main class disease name that the degree of similarity is higher with the electronic medical record text is selected from all main class disease names, has reduced the acquisition scope of first disease name sign, has improved the acquisition accuracy of first disease name sign.
S300, de-duplicating D 0 to obtain a first intermediate disease name list D 1={D1 1,D1 2,……,D1 k,……,D1 t, where D 1 k refers to the kth first intermediate disease name, k=1, 2, … …, t, and t is the total number of first intermediate disease names.
Each initial keyword may appear one or more times in the electronic medical record text, and the primary disease names corresponding to the primary disease name similarity of Δc 0 corresponding to the different initial keywords may be identical, so each first intermediate disease name may appear one or more times in D 1.
In order to avoid repeated calculation, reduce the memory space occupation and improve the acquisition efficiency of the first disease name identifier, the duplicate removal of D 0 is performed to obtain a first intermediate disease name list D 1, which is used as a basis for judging the first disease name identifier.
Above-mentioned, remove duplicate to D 0 and obtain first intermediate disease name list D 1, avoided the repeated calculation of duplicate data, further reduced the acquisition scope of first disease name sign, improved the acquisition efficiency of first disease name sign.
S400, according to C 0 and D 1, obtaining a set of intermediate similarity C 1={C1 1,C1 2,……,C1 k,……,C1 t between the main disease names corresponding to D 1, where ,C1 k={C1 k1,C1 k2,……,C1 kx,……,C1 kr(k)},C1 kx refers to the x-th main disease name similarity greater than Δc 0 corresponding to the main disease name corresponding to D 1 k in C 0, x=1, 2, … …, r (k), and r (k) refers to the total number of main disease name similarities greater than Δc 0 corresponding to the main disease name corresponding to D 1 k in C 0.
Among all the main disease names similarity corresponding to the D 1 k in the C 0, r (k) main disease names similarity larger than DeltaC 0 are screened out, and then C 1 k is obtained, so that the similarity degree of the D 1 k and the electronic medical record text is represented based on C 1 k, and a data basis is provided for further screening out the first disease name identification of the electronic medical record text.
It will be appreciated that the higher the intermediate similarity of the primary disease names corresponding to D 1 k, the higher the probability that D 1 k is selected as the first disease name identifier.
Above-mentioned, select the middle similarity of all main class disease names that each first intermediate disease name corresponds from C 0, can characterize the probability that each first intermediate disease name is selected as first disease name sign to can further screen all first intermediate disease names based on C 1, finally acquire first disease name sign list, improve the acquisition accuracy of first disease name sign.
S500, according to C 1 and A 0, a first occurrence frequency set E 1={E1 1,E1 2,……,E1 k,……,E1 t corresponding to D 1 is obtained, wherein ,E1 k={E1 k1,E1 k2,……,E1 kx,……,E1 kr(k)},E1 kx=Q1 kx/m,Q1 kx is the occurrence frequency of a word vector of an initial keyword corresponding to C 1 kx in A 0.
The more the number of occurrences of the initial keyword in the medical record text, the higher the importance degree of the initial keyword in obtaining the first disease name identifier is indicated, so that the first occurrence frequency E 1 kx corresponding to C 1 kx is obtained according to the ratio of the number of occurrences of the word vector of the initial keyword corresponding to C 1 kx in a 0 to the total number m of the initial keywords, so as to characterize the probability that each first intermediate disease name is selected as the first disease name identifier in combination with Q 1 kx.
The occurrence times of the initial keywords in the medical record text are used as the acquisition basis of the probability that each first intermediate disease name is selected as the first disease name identifier, so that all first intermediate disease names can be further screened based on E 1, a first disease name identifier list is finally acquired, and the acquisition accuracy of the first disease name identifiers is improved.
S600, according to C 1 and E 1, obtaining a first selection probability set S 1={S1 1,S1 2,……,S1 k,……,S1 t corresponding to D 1, wherein the first selection probability set S 1 k=Σx=1 r(k)(E1 kx*C1 kx)/r (k) corresponding to D 1 k.
S700, taking a main disease name identifier corresponding to a first intermediate disease name corresponding to S 1 k which meets S 1 k>△S1 as a first disease name identifier, and acquiring a first disease name identifier list W 1, wherein DeltaS 1 is a second preset threshold value.
The specific value of Δs 1 may be set by the practitioner according to the actual situation.
S800, according to A 0、W1、L1 and L 2, obtaining a target medical record name identifier corresponding to the electronic medical record text.
In one embodiment, S800 specifically includes the following steps, as shown in fig. 2:
S810, according to A 0、W1 and L 1, obtaining a second disease name identification list W 2 corresponding to the electronic medical record text;
S820, according to A 0、W2 and L 2, obtaining a third disease name identifier w 3 corresponding to the electronic medical record text;
S830, according to W 1、W2 and W 3, obtaining the name identification of the target medical record corresponding to the electronic medical record text.
In an embodiment, L 1 includes P preset first minor disease names, a first minor disease name identifier corresponding to each first minor disease name, a main disease name corresponding to each first minor disease name, and a first minor disease name vector set B1={B1 1,B1 2,……,B1 p,……,B1 P},B1 u refer to a word vector of a first minor keyword corresponding to a P-th first minor disease name, where p=1, 2, … …, P, S810 specifically includes the following steps, as shown in fig. 3:
S811, determining a first subsidiary disease name corresponding to the main disease name corresponding to each first disease name identifier as a first subsidiary disease name to be selected according to W 1 and L 1;
S812, according to all the first secondary disease name vectors corresponding to the first secondary disease names, a first secondary disease name vector set B 1d={B1d 1,B1d 2,……,B1d v,……,B1d y is obtained from B 1, where B 1d v refers to the word vector of the first secondary keyword corresponding to the v-th first secondary disease name, v=1, 2, … …, y, and y refers to the total number of the first secondary disease names;
S813, according to A 0 and B 1d, obtaining a first secondary disease name similarity set C 2={C2 1,C2 2,……,C2 i,……,C2 m corresponding to A 0, wherein ,C2 i={C2 i1,C2 i2,……,C2 iv,……,C2 iy},C2 ij refers to similarity between A 0 i and B 1d v;
s814, taking the first auxiliary disease name corresponding to C 2 iv meeting the requirement of C 2 iv>△C2 as a second candidate disease name, and obtaining a second candidate disease name list D 2, wherein DeltaC 2 is a third preset threshold;
S815, de-duplicating D 2 to obtain a second intermediate disease name list D 3={D3 1,D3 2,……,D3 h,……,D3 H }, where D 3 h refers to the H second intermediate disease name, h=1, 2, … …, H is the total number of second intermediate disease names;
s816, obtaining a set of intermediate similarity C 3={C3 1,C3 2,……,C3 h,……,C3 H of the first auxiliary disease names corresponding to D 3 according to C 2 and D 3, wherein ,C3 h={C3 h1,C3 h2,……,C3 hf,……,C3 hr(h)},C3 hf refers to the similarity of the f first auxiliary disease names corresponding to D 3 h and greater than DeltaC 2 in C 2, f=1, 2, … …, r (h), and r (h) refers to the total number of the first auxiliary disease name similarities corresponding to D 3 h and greater than DeltaC 2 in C 2;
s817, according to C 3 and A 0, obtaining a second occurrence frequency set E 2={E2 1,E2 2,……,E2 h,……,E2 H corresponding to D 3, wherein ,E2 h={E2 h1,E2 h2,……,E2 hf,……,E2 hr(h)},E2 hf=Q2 hf/m,Q2 hf is the occurrence frequency of the word vector of the initial keyword corresponding to C 3 hf in A 0;
S818, according to C 3 and E 2, obtaining a second selection probability set S 2={S2 1,S2 2,……,S2 h,……,S2 H corresponding to D 3, wherein the second selection probability set S 2 h=Σf=1 r(h)(E2 hf*C3 hf)/r (h) corresponding to D 3 h;
And S819, taking the first auxiliary disease name identifier corresponding to the second intermediate disease name corresponding to S 2 h of S 2 h>△S2 as a second disease name identifier, and acquiring a second disease name identifier list W 2, wherein DeltaS 2 is a fourth preset threshold.
The first secondary disease name identifier is a unique identifier corresponding to the first secondary disease name, and the first secondary keyword corresponding to the first secondary disease name may be a keyword obtained by extracting a keyword from the first secondary disease name according to a keyword extraction algorithm. Each primary disease name corresponds to a plurality of first secondary disease names.
The specific values of Δc 2 and Δs 2 may be set by the practitioner according to the actual situation.
Above, all the first minor disease names are screened according to W 1, and the first minor disease name corresponding to the major disease name corresponding to each first disease name identifier in W 1 is determined as the first minor disease name to be selected, so that the acquisition range of the second disease name identifier is narrowed; then, based on the similarity between the word vector of the primary keyword and the word vector of the first secondary keyword corresponding to the first secondary disease name, screening to obtain a second candidate disease name list D 2, de-duplicating D 2 to obtain a second intermediate disease name list D 3, and further obtaining a first secondary disease name intermediate similarity set C 3 and a corresponding second occurrence frequency set E 2 corresponding to D 3, so that the similarity between the initial keyword and the first secondary keyword and the occurrence frequency of the initial keyword in the medical record text can be combined to represent the probability that each second intermediate disease name is selected as a second disease name identifier, so as to screen all second intermediate disease names, finally obtain a second disease name identifier list, and further improve the accuracy of obtaining the second disease name identifier.
In one embodiment, Δs 1>△S2.
The Δs 1 is a threshold for screening the first disease name identifier, the Δs 2 is a threshold for screening the second disease name identifier, and the second disease name identifier is obtained by further combining the similarity between the initial keyword and the first sub-category keyword and the occurrence number of the initial keyword in the medical record text based on the first disease name identifier, if the reliability of the first disease name identifier is lower, the reliability of the second disease name identifier is lower, so that a larger threshold needs to be set for judging the first disease name identifier, so that the first disease name identifier and the second disease name identifier have higher accuracy.
In a specific embodiment, L 2 includes G preset second secondary disease names, a second secondary disease name identifier corresponding to each second secondary disease name, and the first secondary disease name and the second secondary disease name vector set B2={B2 2,B2 2,……,B2 g,……,B2 G},B2 g corresponding to each second secondary disease name refer to word vectors of the second secondary keywords corresponding to the G second secondary disease name, where g=1, 2, … …, G, and S820 specifically include the following steps as shown in fig. 4:
S822, determining a second minor disease name corresponding to the first minor disease name corresponding to each second disease name identifier as a second minor disease name to be selected according to W 2 and L 2;
S822, obtaining a second to-be-selected sub-class disease name vector set B 2f={B2f 2,B2f 2,……,B2f u,……,B2f Y from B 2 according to second sub-class disease name vectors corresponding to all second to-be-selected sub-class disease names, wherein B 2f v refers to word vectors of second sub-class keywords corresponding to the u-th second to-be-selected sub-class disease names, u=1, 2, … … and Y refer to the total number of the second to-be-selected sub-class disease names;
S823, according to A 0 and B 2f, obtaining a second secondary disease name similarity set C 4={C4 1,C4 2,……,C4 i,……,C4 m corresponding to A 0, wherein ,C4 i={C4 i1,C4 i2,……,C4 iu,……,C4 iY},C4 ij refers to similarity between A 0 i and B 2f u;
S824, taking the second auxiliary disease name corresponding to C 4 iu meeting the requirement of C 4 iu>△C4 as a third candidate disease name, and obtaining a third candidate disease name list D 3, wherein DeltaC 4 is a fifth preset threshold;
s825, de-duplicating D 3 to obtain a third intermediate disease name list D 4={D4 1,D4 2,……,D4 α,……,D4 β }, where D 4 α refers to the α third intermediate disease name, α=1, 2, … …, β, β is the total number of third intermediate disease names;
S826, obtaining a set of intermediate similarity C 5={C5 1,C5 2,……,C5 α,……,C5 β of the second secondary disease names corresponding to D 4 according to C 4 and D 4, wherein ,C5 α={C5 α1,C5 α2,……,C5 αγ,……,C5 αr(α)},C5 αγ refers to the similarity of the gamma second secondary disease names corresponding to D 4 α and greater than DeltaC 4 in C 4, gamma=1, 2, … …, r (alpha), and r (alpha) refers to the total number of the second secondary disease name similarities corresponding to D 4 α and greater than DeltaC 4 in C 4;
S827, according to C 5 and A 0, obtaining a third occurrence frequency set E 3={E3 1,E3 2,……,E3 α,……,E3 β corresponding to D 4, wherein ,E3 α={E3 α1,E3 α2,……,E3 αγ,……,E3 αr(α)},E3 αγ=Q3 αγ/m,Q3 αγ is the occurrence frequency of a word vector of an initial keyword corresponding to C 5 αγ in A 0;
S828, according to C 5 and E 3, obtaining a third selection probability set S 3={S3 1,S3 2,……,S3 α,……,S3 β corresponding to D 4, wherein third selection probability S 3 α=Σγ=1 r(α)(E3 αγ*C5 αγ)/r (alpha) corresponding to D 4 α;
S8239, the second sub-disease name identifier corresponding to the third intermediate disease name corresponding to max (S 3) is used as the third disease name identifier w 3.
The second secondary disease name identifier is a unique identifier corresponding to a second secondary disease name, and the second secondary keyword corresponding to the second secondary disease name may be a keyword obtained by extracting a keyword from the second secondary disease name according to a keyword extraction algorithm. Each first minor disease name corresponds to a plurality of second minor disease names.
The specific value of Δc 4 may be set by the practitioner according to the actual situation.
Since the second secondary disease name is the name of the last stage, after the third selection probability set S 3 corresponding to D 4 is obtained, the second secondary disease name identifier corresponding to the third intermediate disease name corresponding to max (S 3) is used as the third disease name identifier w 3.
Screening all the first auxiliary disease names according to W 2, determining the second auxiliary disease name corresponding to the first auxiliary disease name corresponding to each second disease name identifier in W 2 as the second auxiliary disease name to be selected, and reducing the acquisition range of the third disease name identifier; then, screening and obtaining a third candidate disease name list D 3 based on the similarity between the word vector of the primary keyword and the word vector of the second secondary keyword corresponding to the second to-be-selected secondary disease name, de-duplicating D 3 to obtain a third intermediate disease name list D 4, further obtaining a second secondary disease name intermediate similarity set C 5 corresponding to D 4 and a corresponding third occurrence frequency set E 3, so that the similarity between the initial keyword and the second secondary keyword and the occurrence frequency of the initial keyword in the medical record text can be combined to represent the probability that each third intermediate disease name is selected as a third disease name identifier, screening all third intermediate disease names, finally obtaining a third disease name identifier, and improving the accuracy of obtaining the third disease name identifier.
In one embodiment, S830 specifically includes the following steps:
S831, the first auxiliary disease name identifier corresponding to w 3 is used as a second disease name identifier w 2;
S832, using the main disease name identifier corresponding to w 2 as a first disease name identifier w 1;
s833, splicing the w 1、w2 and the w 3 according to a preset sequence, and obtaining a target medical record name mark w 1w2w3 corresponding to the electronic medical record text.
In the embodiment, based on the similarity between the word vector of the initial keyword and the word vector of the primary keyword corresponding to the primary disease name, a first candidate disease name list D 0 is screened, the duplication of D 0 is removed to obtain a first intermediate disease name list D 1, and a primary disease name intermediate similarity set C 1 corresponding to D 1 and a corresponding first occurrence frequency set E 1 are further obtained, so that the similarity between the initial keyword and the primary keyword and the occurrence frequency of the initial keyword in the medical record text can be combined to represent the probability that each first intermediate disease name is selected as a first disease name identifier, so that all first intermediate disease names are screened, and finally, a first disease name identifier list is obtained, thereby improving the accuracy of obtaining the first disease name identifier; then, combining the word vector of the initial keyword, the first disease name identification list, the first subsidiary disease name information list L 1 and the second subsidiary disease name information list L 2, further acquiring the target medical record name identification corresponding to the electronic medical record text, and improving the acquisition accuracy of the target medical record name identification.
While certain specific embodiments of the invention have been described in detail by way of example, it will be appreciated by those skilled in the art that the above examples are for illustration only and are not intended to limit the scope of the invention. Those skilled in the art will also appreciate that many modifications may be made to the embodiments without departing from the scope and spirit of the invention. The scope of the invention is defined by the appended claims.

Claims (6)

1. The data processing system for acquiring the disease name identification of the electronic medical record text is characterized by comprising a processor and a memory storing a computer program, wherein the memory also stores an initial keyword vector set A 0={A0 1,A0 2,……,A0 i,……,A0 m of the electronic medical record text, a main disease name information list L 0, a first auxiliary disease name information list L 1 and a second auxiliary disease name information list L 2, A 0 i refers to a word vector of an ith initial keyword of the electronic medical record text, L 0 comprises n preset main disease names, a main disease name identification corresponding to each main disease name and a word vector of a main keyword corresponding to the jth main disease name, i=1, 2, … …, m, m refers to the total number of the initial keywords of the electronic medical record text, j=1, 2, … …, n, when the computer program is executed by the processor, the following steps are realized:
S100, obtaining a main disease name similarity set C 0={C0 1,C0 2,……,C0 i,……,C0 m corresponding to A 0 according to A 0 and B 0, wherein ,C0 i={C0 i1,C0 i2,……,C0 ij,……,C0 in},C0 ij refers to similarity between A 0 i and B 0 j;
S200, taking the main disease name corresponding to C 0 ij meeting the requirement of C 0 ij>△C0 as a first candidate disease name, and acquiring a first candidate disease name list D 0, wherein DeltaC 0 is a first preset threshold;
S300, de-duplicating the D 0 to obtain a first intermediate disease name list D 1={D1 1,D1 2,……,D1 k,……,D1 t, wherein D 1 k refers to the kth first intermediate disease name, k=1, 2, … …, t, t is the total number of the first intermediate disease names;
S400, obtaining a main disease name intermediate similarity set C 1={C1 1,C1 2,……,C1 k,……,C1 t corresponding to D 1 according to C 0 and D 1, wherein ,C1 k={C1 k1,C1 k2,……,C1 kx,……,C1 kr(k)},C1 kx refers to the x-th main disease name similarity greater than DeltaC 0 corresponding to the main disease name corresponding to D 1 k in C 0, x=1, 2, … …, r (k), and r (k) refers to the total number of main disease name similarities greater than DeltaC 0 corresponding to the main disease name corresponding to D 1 k in C 0;
S500, according to C 1 and A 0, obtaining a first occurrence frequency set E 1={E1 1,E1 2,……,E1 k,……,E1 t corresponding to D 1, wherein ,E1 k={E1 k1,E1 k2,……,E1 kx,……,E1 kr(k)},E1 kx=Q1 kx/m,Q1 kx is the occurrence frequency of a word vector of an initial keyword corresponding to C 1 kx in A 0;
S600, according to C 1 and E 1, obtaining a first selection probability set S 1={S1 1,S1 2,……,S1 k,……,S1 t corresponding to D 1, wherein the first selection probability set S 1 k=Σx=1 r(k)(E1 kx*C1 kx)/r (k) corresponding to D 1 k;
S700, taking a main disease name identifier corresponding to a first intermediate disease name corresponding to S 1 k which meets S 1 k>△S1 as a first disease name identifier, and acquiring a first disease name identifier list W 1, wherein DeltaS 1 is a second preset threshold;
S800, according to A 0、W1、L1 and L 2, obtaining a target medical record name identifier corresponding to the electronic medical record text, wherein S800 specifically comprises the following steps:
S810, according to A 0、W1 and L 1, obtaining a second disease name identification list W 2 corresponding to the electronic medical record text;
s820, according to A 0、W2 and L 2, obtaining a third disease name identifier w 3 corresponding to the electronic medical record text;
s830, acquiring a target medical record name identifier corresponding to the electronic medical record text according to W 1、W2 and W 3;
S830 specifically includes the following steps:
S831, the first auxiliary disease name identifier corresponding to w 3 is used as a second disease name identifier w 2;
S832, using the main disease name identifier corresponding to w 2 as a first disease name identifier w 1;
s833, splicing the w 1、w2 and the w 3 according to a preset sequence, and obtaining a target medical record name mark w 1w2w3 corresponding to the electronic medical record text.
2. The data processing system of claim 1, wherein the memory further stores electronic medical record text and primary keywords corresponding to each primary disease name, and wherein a 0 and B 0 are obtained by:
S10, acquiring a target keyword set A= { A 1,A2,……,Ai,……,Am},Ai of the target electronic medical record text, wherein the target keyword set A= { A 1,A2,……,Ai,……,Am},Ai refers to an ith target keyword;
S20, inputting the A into a word vector model, and acquiring a target keyword vector set A0={A0 1,A0 2,……,A0 i,……,A0 m},A0 i corresponding to the A, wherein the target keyword vector set A0={A0 1,A0 2,……,A0 i,……,A0 m},A0 i is a target keyword vector corresponding to the A i;
S30, inputting the primary key corresponding to each preset primary disease name into the word vector model to obtain a preset primary disease name vector set B 0={B0 1,B0 2,……,B0 j,……,B0 n.
3. The data processing system of claim 1 wherein ,C0 ij=(A0 i·B0 j)/(‖A0 i‖×‖B0 j‖), is wherein i a 0 i is the modulus of a 0 i and ii B 0 j is the modulus of B 0 j.
4. The data processing system according to claim 1, wherein L 1 includes P preset first minor disease names, a first minor disease name identifier corresponding to each first minor disease name, a major disease name corresponding to each first minor disease name, and a first minor disease name vector set B1={B1 1,B1 2,……,B1 p,……,B1 P},B1 p refer to a word vector of a first minor keyword corresponding to a P-th first minor disease name, and p=1, 2, … …, P, S810 specifically includes the steps of:
S811, determining a first subsidiary disease name corresponding to the main disease name corresponding to each first disease name identifier as a first subsidiary disease name to be selected according to W 1 and L 1;
S812, according to all the first secondary disease name vectors corresponding to the first secondary disease names, a first secondary disease name vector set B 1d={B1d 1,B1d 2,……,B1d v,……,B1d y is obtained from B 1, where B 1d v refers to the word vector of the first secondary keyword corresponding to the v-th first secondary disease name, v=1, 2, … …, y, and y refers to the total number of the first secondary disease names;
S813, according to A 0 and B 1d, obtaining a first secondary disease name similarity set C 2={C2 1,C2 2,……,C2 i,……,C2 m corresponding to A 0, wherein ,C2 i={C2 i1,C2 i2,……,C2 iv,……,C2 iy},C2 iv refers to similarity between A 0 i and B 1d v;
s814, taking the first auxiliary disease name corresponding to C 2 iv meeting the requirement of C 2 iv>△C2 as a second candidate disease name, and obtaining a second candidate disease name list D 2, wherein DeltaC 2 is a third preset threshold;
S815, de-duplicating D 2 to obtain a second intermediate disease name list D 3={D3 1,D3 2,……,D3 h,……,D3 H }, where D 3 h refers to the H second intermediate disease name, h=1, 2, … …, H is the total number of second intermediate disease names;
s816, obtaining a set of intermediate similarity C 3={C3 1,C3 2,……,C3 h,……,C3 H of the first auxiliary disease names corresponding to D 3 according to C 2 and D 3, wherein ,C3 h={C3 h1,C3 h2,……,C3 hf,……,C3 hr(h)},C3 hf refers to the similarity of the f first auxiliary disease names corresponding to D 3 h and greater than DeltaC 2 in C 2, f=1, 2, … …, r (h), and r (h) refers to the total number of the first auxiliary disease name similarities corresponding to D 3 h and greater than DeltaC 2 in C 2;
s817, according to C 3 and A 0, obtaining a second occurrence frequency set E 2={E2 1,E2 2,……,E2 h,……,E2 H corresponding to D 3, wherein ,E2 h={E2 h1,E2 h2,……,E2 hf,……,E2 hr(h)},E2 hf=Q2 hf/m,Q2 hf is the occurrence frequency of the word vector of the initial keyword corresponding to C 3 hf in A 0;
S818, according to C 3 and E 2, obtaining a second selection probability set S 2={S2 1,S2 2,……,S2 h,……,S2 H corresponding to D 3, wherein the second selection probability set S 2 h=Σf=1 r(h)(E2 hf*C3 hf)/r (h) corresponding to D 3 h;
And S819, taking the first auxiliary disease name identifier corresponding to the second intermediate disease name corresponding to S 2 h of S 2 h>△S2 as a second disease name identifier, and acquiring a second disease name identifier list W 2, wherein DeltaS 2 is a fourth preset threshold.
5. The data processing system of claim 4, wherein Δs 1>△S2.
6. The data processing system according to claim 4, wherein L 2 includes G preset second-secondary disease names, a second-secondary disease name identifier corresponding to each second-secondary disease name, a first-secondary disease name corresponding to each second-secondary disease name, and a second-secondary disease name vector set B2={B2 2,B2 2,……,B2 g,……,B2 G},B2 g refer to a word vector of a second-secondary keyword corresponding to a G-th second-secondary disease name, and g=1, 2, … …, G, S820 specifically includes the steps of:
S822, determining a second minor disease name corresponding to the first minor disease name corresponding to each second disease name identifier as a second minor disease name to be selected according to W 2 and L 2;
S822, obtaining a second to-be-selected sub-class disease name vector set B 2f={B2f 2,B2f 2,……,B2f u,……,B2f Y from B 2 according to second sub-class disease name vectors corresponding to all second to-be-selected sub-class disease names, wherein B 2f u refers to word vectors of second sub-class keywords corresponding to the u-th second to-be-selected sub-class disease names, u=1, 2, … … and Y refer to the total number of the second to-be-selected sub-class disease names;
s823, according to A 0 and B 2f, obtaining a second secondary disease name similarity set C 4={C4 1,C4 2,……,C4 i,……,C4 m corresponding to A 0, wherein ,C4 i={C4 i1,C4 i2,……,C4 iu,……,C4 iY},C4 iu refers to similarity between A 0 i and B 2f u;
S824, taking the second auxiliary disease name corresponding to C 4 iu meeting the requirement of C 4 iu>△C4 as a third candidate disease name, and obtaining a third candidate disease name list D 3, wherein DeltaC 4 is a fifth preset threshold;
s825, de-duplicating D 3 to obtain a third intermediate disease name list D 4={D4 1,D4 2,……,D4 α,……,D4 β }, where D 4 α refers to the α third intermediate disease name, α=1, 2, … …, β, β is the total number of third intermediate disease names;
S826, obtaining a set of intermediate similarity C 5={C5 1,C5 2,……,C5 α,……,C5 β of the second secondary disease names corresponding to D 4 according to C 4 and D 4, wherein ,C5 α={C5 α1,C5 α2,……,C5 αγ,……,C5 αr(α)},C5 αγ refers to the similarity of the gamma second secondary disease names corresponding to D 4 α and greater than DeltaC 4 in C 4, gamma=1, 2, … …, r (alpha), and r (alpha) refers to the total number of the second secondary disease name similarities corresponding to D 4 α and greater than DeltaC 4 in C 4;
S827, according to C 5 and A 0, obtaining a third occurrence frequency set E 3={E3 1,E3 2,……,E3 α,……,E3 β corresponding to D 4, wherein ,E3 α={E3 α1,E3 α2,……,E3 αγ,……,E3 αr(α)},E3 αγ=Q3 αγ/m,Q3 αγ is the occurrence frequency of a word vector of an initial keyword corresponding to C 5 αγ in A 0;
S828, according to C 5 and E 3, obtaining a third selection probability set S 3={S3 1,S3 2,……,S3 α,……,S3 β corresponding to D 4, wherein third selection probability S 3 α=Σγ=1 r(α)(E3 αγ*C5 αγ)/r (alpha) corresponding to D 4 α;
S8239, the second sub-disease name identifier corresponding to the third intermediate disease name corresponding to max (S 3) is used as the third disease name identifier w 3.
CN202410108326.XA 2024-01-25 2024-01-25 Data processing system for acquiring disease name identification of electronic medical record text Active CN117874235B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410108326.XA CN117874235B (en) 2024-01-25 2024-01-25 Data processing system for acquiring disease name identification of electronic medical record text

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410108326.XA CN117874235B (en) 2024-01-25 2024-01-25 Data processing system for acquiring disease name identification of electronic medical record text

Publications (2)

Publication Number Publication Date
CN117874235A CN117874235A (en) 2024-04-12
CN117874235B true CN117874235B (en) 2024-06-21

Family

ID=90577367

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410108326.XA Active CN117874235B (en) 2024-01-25 2024-01-25 Data processing system for acquiring disease name identification of electronic medical record text

Country Status (1)

Country Link
CN (1) CN117874235B (en)

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110245217B (en) * 2019-06-17 2022-07-22 京东方科技集团股份有限公司 Medicine recommendation method and device and electronic equipment
CN112287094B (en) * 2020-12-30 2021-04-13 北京伯仲叔季科技有限公司 Similar case text retrieval system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Stacked ensemble combined with fuzzy matching for biomedical named entity recognition of diseases;Balu Bhasuran等;《Journal of Biomedical Informatics》;20160912;第64卷;第1-9页 *
中文电子病历文本检索关键技术研究;杨淞淳;《中国博士学位论文全文数据库医药卫生科技辑》;20220215(第2期);第E053-21页 *

Also Published As

Publication number Publication date
CN117874235A (en) 2024-04-12

Similar Documents

Publication Publication Date Title
JP2020135853A (en) Method, apparatus, electronic device, computer readable medium, and computer program for determining descriptive information
CN111008321B (en) Logistic regression recommendation-based method, device, computing equipment and readable storage medium
CN110674319A (en) Label determination method and device, computer equipment and storage medium
CN110299209B (en) Similar medical record searching method, device and equipment and readable storage medium
CN111445968A (en) Electronic medical record query method and device, computer equipment and storage medium
CN110929498B (en) Method and device for calculating similarity of short text and readable storage medium
CN110659298A (en) Financial data processing method and device, computer equipment and storage medium
CN109087688B (en) Patient information acquisition method, apparatus, computer device and storage medium
WO2022041940A1 (en) Cross-modal retrieval method, training method for cross-modal retrieval model, and related device
CN112800248B (en) Similar case retrieval method, similar case retrieval device, computer equipment and storage medium
CN111651986A (en) Event keyword extraction method, device, equipment and medium
JP4687089B2 (en) Duplicate record detection system and duplicate record detection program
CN113722507B (en) Hospitalization cost prediction method and device based on knowledge graph and computer equipment
CN117874235B (en) Data processing system for acquiring disease name identification of electronic medical record text
CN114722199A (en) Risk identification method and device based on call recording, computer equipment and medium
CN113723056A (en) ICD (interface control document) coding conversion method, device, computing equipment and storage medium
CN114241585A (en) Cross-age face recognition model training method, recognition method and device
CN115544215B (en) Associated object acquisition method, medium and equipment
CN116186223A (en) Financial text processing method, device, equipment and storage medium
EP4089568A1 (en) Cascade pooling for natural language document processing
CN113780454B (en) Model training and calling method and device, computer equipment and storage medium
CN113011153B (en) Text correlation detection method, device, equipment and storage medium
CN113643143A (en) Task splitting method, device and equipment based on artificial intelligence and storage medium
CN114416847A (en) Data conversion method, device, server and storage medium
CN114238664A (en) Overseas trademark retrieval method, equipment, medium and product

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant